High load / CPU usage and container stopping

Hi balena team,

I have a very urgent matter. Today a customer called us that his system became more and more unresponsive, so we SSH’d into the device. The device was showing a load of more than 4, and the last 15 minutes more than 7. So that was our first indication that the device indeed became more unresponsive.

So with the command top, I’ve checked the processes. I’ve added a screenshot below:

It looks like the following command was the one using the CPU (Could not get the full command):
/usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena --delta-storage-driver=aufs --log-driver=journald -s aufs --data-root=/mnt/sysroot/inactive/balena -H unix:///v

Because we needed to help the customer, we had to make the system responsive again. So I killed that process and hoped for the best. This did the trick. But I don’t know what this command does or is supposed to do. So I hope I didn’t break anything.

I’ve also checked all my containers, but they didn’t seem to have a high load. So it wasn’t the software running in the containers afaik.

Before I did that, I tried to gather all the logs I could think of because you’ll need them (I think). I used dmesg, journalctl and balena logs resin_supervisor. I’ve uploaded them in this thread.

Some basic information:
Board UP Board (UP Squared)
HostOS balenaOS 2.29.2+rev1
Supervisor 9.0.1

I hope you guys can help me as soon as possible, because it’ll likely happen again if we don’t change anything. This is our beta customer, but we’d like this problem resolved as soon as possible.

If you have any questions, please let me know. I can grant access to the device, but because they’re working on the device, it is not possible to do any “real” work on it, like rebooting or restarting containers.

Thanks in advance and I hope to hear something soon!

dmesg.log (73.8 KB) journalctl.log (126.1 KB) resin_supervisor.log (4.3 KB)

You may indeed do a systemctl restart balena if you see this in the future. This looks like an intermediate issue - a docker/balena-engine bug. I pinged our balena-engine maintainer to take a look at your report.

One thing that I would suggest if that keeps happening is trying out a later OS version. In general it is always good to keep the OS updated to a more recent release (after doing some testing first).

Can you please grand support access and provide the UUID to the device, so that we may look at it?

Thanks,
Zahari

Hi,

Thanks for your response. I hope you find the cause of this problem, because we can’t ask our customers to restart the device everytime this issue occurs.

About the OS version, this is the latest OS version of the UP Board. 2.39 was released a few months ago, but was revoked sometime later. So 2.29.2+rev1 is the latest OS currently for the UP Board. If there was an OS update, that’d be the first thing we’d do before we send in the issue.

I’ve granted support access for the next 12 hours. The UUID is 8a465161eb9b616472ef8b193d31b89f. As said, our customer uses this product, it’s connected to a touchscreen with an interface, so please don’t restart containers or reboot the device.

Last but not least, thanks for your support!

Hi,

Unfortunately, we are not yet able to identify the root cause of the problem analyzing the available data.
We’ll appreciate it if you ping us when this occurs again, so we can do extra analysis on the misbehaving balenad process.
Thanks!

Hi,

That’s really disappointing to hear.
I’ve checked the logs, also on another device, and it seems like on that device it happens every day around 17:00 UTC. But I haven’t figured out yet what the reason is.

The resin_supervisor seems to restart at that time, but also all containers. Is there a process balenad runs time-scheduled or something? I’ve granted support access on this device 586aab3d59ce613c90d43669d75a283e for 1 day. This is an in-house device, so feel free to do some tests.

@bversluijs Thanks for the access. I’m taking a look.

@bversluijs Have you restarted the device?

Hi,

My colleague has restarted the device. Sorry for that.
Have you seen anything already? I’ve told my colleagues that the device may not be restarted.

So, when I started looking at the device, I first dove into the investigation of some warnings in balena-engine logs. Those warnings are fixed in a newer engine version, but the new OS is not yet available for Up Board. Our devices team is still doing internal tests with the new OS before publishing it to production. The warnings I’m talking about are

  • failed to retrieve balena-engine-init version triggered by balena info command that runs by our watchdog checks
  • unknown container warnings
    As I said, both are fixed in the newer engine/OS versions.

However, then the device got rebooted. Warnings remain in the logs, but balenad is not using big amount of CPU anymore. So those warnings can be not related to the issue (unless they produce some cumulative effect).
We should strace balenad the next time we see high CPU usage. Please send us another ping. Thanks!

We’ll also notify this thread when the new version for Up Board is released.

Hi,

First, thanks for looking into the device. It’s very unfortunate that the device was rebooted.
Is there a time schedule for releasing the UP Board version? Because the latest version now is 2.29.2, but BalenaOS is already at 2.44.0. We’re hoping this update fixes the issue, but our customer isn’t happy at the moment, because they’re using the system all day. And it’s weird, because we haven’t seen the problem before in our devices. Hardware and software is exactly the same. And because I saw high CPU usage for balenad, it’s likely that the problem isn’t related to our containers, isn’t it?

Do you have any suggestions? When this occurs, the system freezes, so we can’t ask our customer to wait till we/you have investigated the device properly…

Hi, I got confirmation from our devices team that we they will be releasing 2.44 in production for UP Board soon. Please note however that we will be on a yearly company summit next week, so this will be done the week after.
We suggest that you try the newer version after it is released. Will that work for you? Otherwise you need to look for some workaround - e.g. watching for high CPU usage and restart the service if necessary, but that is something I would not recommend.
Thanks,
Zahari