Hi balena team,
I have a very urgent matter. Today a customer called us that his system became more and more unresponsive, so we SSH’d into the device. The device was showing a load of more than 4, and the last 15 minutes more than 7. So that was our first indication that the device indeed became more unresponsive.
So with the command top
, I’ve checked the processes. I’ve added a screenshot below:
It looks like the following command was the one using the CPU (Could not get the full command):
/usr/bin/balenad --delta-data-root=/mnt/sysroot/active/balena --delta-storage-driver=aufs --log-driver=journald -s aufs --data-root=/mnt/sysroot/inactive/balena -H unix:///v
Because we needed to help the customer, we had to make the system responsive again. So I killed that process and hoped for the best. This did the trick. But I don’t know what this command does or is supposed to do. So I hope I didn’t break anything.
I’ve also checked all my containers, but they didn’t seem to have a high load. So it wasn’t the software running in the containers afaik.
Before I did that, I tried to gather all the logs I could think of because you’ll need them (I think). I used dmesg
, journalctl
and balena logs resin_supervisor
. I’ve uploaded them in this thread.
Some basic information:
Board UP Board (UP Squared)
HostOS balenaOS 2.29.2+rev1
Supervisor 9.0.1
I hope you guys can help me as soon as possible, because it’ll likely happen again if we don’t change anything. This is our beta customer, but we’d like this problem resolved as soon as possible.
If you have any questions, please let me know. I can grant access to the device, but because they’re working on the device, it is not possible to do any “real” work on it, like rebooting or restarting containers.
Thanks in advance and I hope to hear something soon!
dmesg.log (73.8 KB) journalctl.log (126.1 KB) resin_supervisor.log (4.3 KB)