'balena ps' shows container restarting for many hours; reboot required to clear condition

jakogut · January 16, 2021, 1:07am

Hey Jeff,

Could you share your kernel logs (dmesg) from one of the affected devices?

jweide · January 18, 2021, 4:01pm

I am waiting on another device to hit the issue to grab the dmesg log. I did find a brand new issue that may or may not be related to this, and I’m holding back the logs from that to not conflate that issue with this one in case it’s not related and so we don’t get distracted.

The executive summary of that new issue is: The agent service is “exited” and cannot be restarted. balena ps -a shows the container exited, and balena inspect shows the exit code 255. Balena start results in this output:

root@<removed>:~# balena start agent_3134344_1651895
Error response from daemon: task 6cc2fb58f3ef27b74b2e4d3159836f358e08f62539fe4df390f9b2a28d6d8f82 already exists: unknown
Error: failed to start containers: agent_3134344_1651895

Again, I don’t want to get stuck on this new symptom because it’s the first time we have seen it. If I see it again I will probably open a different issue.

jweide · January 18, 2021, 6:35pm

I found this issue had been encountered on another field device. The “agent” process is streaming sensor data to a local influx instance on this device. The last bit of data successfully persisted to the database was timestamped around 9:38:15 AM on 01/18. (This would be 15:38:15 UTC - So that is a good indication of a last-known good time when were were still actively writing to the database.) Additionally, the time the data was sensed isn’t necessarily the time it was persisted but it’s expected to be within only a few seconds of being sensed that we would expect it to be persisted.

Looking at dmesg on the device, I can see that around that time, we see some kernel watchdog messages about hung tasks:

[Mon Jan 18 15:34:12 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 15:39:37 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 15:41:02 2021] INFO: task jbd2/mmcblk0p6-:917 blocked for more than 120 seconds.
[Mon Jan 18 15:41:02 2021]       Tainted: P         C O      4.19.118 #1
[Mon Jan 18 15:41:02 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Jan 18 15:41:02 2021] jbd2/mmcblk0p6- D    0   917      2 0x00000000
[Mon Jan 18 15:41:02 2021] [<c094e598>] (__schedule) from [<c094ed10>] (schedule+0x4c/0xac)
[Mon Jan 18 15:41:02 2021] [<c094ed10>] (schedule) from [<c094f25c>] (io_schedule+0x20/0x40)
[Mon Jan 18 15:41:02 2021] [<c094f25c>] (io_schedule) from [<c094f708>] (bit_wait_io+0x1c/0x6c)
[Mon Jan 18 15:41:02 2021] [<c094f708>] (bit_wait_io) from [<c094f2ec>] (__wait_on_bit+0x70/0xc8)
[Mon Jan 18 15:41:02 2021] [<c094f2ec>] (__wait_on_bit) from [<c094f3dc>] (out_of_line_wait_on_bit+0x98/0xbc)
[Mon Jan 18 15:41:02 2021] [<c094f3dc>] (out_of_line_wait_on_bit) from [<c0315f78>] (__wait_on_buffer+0x40/0x44)
[Mon Jan 18 15:41:02 2021] [<c0315f78>] (__wait_on_buffer) from [<c03e7b68>] (jbd2_journal_commit_transaction+0x14f8/0x198c)
[Mon Jan 18 15:41:02 2021] [<c03e7b68>] (jbd2_journal_commit_transaction) from [<c03ed930>] (kjournald2+0x10c/0x2f4)
[Mon Jan 18 15:41:02 2021] [<c03ed930>] (kjournald2) from [<c0143238>] (kthread+0x16c/0x174)
[Mon Jan 18 15:41:02 2021] [<c0143238>] (kthread) from [<c01010ac>] (ret_from_fork+0x14/0x28)
[Mon Jan 18 15:41:02 2021] Exception stack(0xece67fb0 to 0xece67ff8)
[Mon Jan 18 15:41:02 2021] 7fa0:                                     00000000 00000000 00000000 00000000
[Mon Jan 18 15:41:02 2021] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[Mon Jan 18 15:41:02 2021] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[Mon Jan 18 15:41:02 2021] INFO: task influxd:2992 blocked for more than 120 seconds.
[Mon Jan 18 15:41:02 2021]       Tainted: P         C O      4.19.118 #1
[Mon Jan 18 15:41:02 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Jan 18 15:41:02 2021] influxd         D    0  2992   2456 0x00000000
[Mon Jan 18 15:41:02 2021] [<c094e598>] (__schedule) from [<c094ed10>] (schedule+0x4c/0xac)
[Mon Jan 18 15:41:02 2021] [<c094ed10>] (schedule) from [<c03ecf74>] (jbd2_log_wait_commit+0xdc/0x168)
[Mon Jan 18 15:41:02 2021] [<c03ecf74>] (jbd2_log_wait_commit) from [<c03ee910>] (jbd2_complete_transaction+0x90/0xf8)
[Mon Jan 18 15:41:02 2021] [<c03ee910>] (jbd2_complete_transaction) from [<c038d0e4>] (ext4_sync_file+0x378/0x47c)
[Mon Jan 18 15:41:02 2021] [<c038d0e4>] (ext4_sync_file) from [<c0312bcc>] (vfs_fsync_range+0x4c/0x8c)
[Mon Jan 18 15:41:02 2021] [<c0312bcc>] (vfs_fsync_range) from [<c0312c94>] (do_fsync+0x4c/0x80)
[Mon Jan 18 15:41:02 2021] [<c0312c94>] (do_fsync) from [<c0312fc4>] (sys_fsync+0x1c/0x20)
[Mon Jan 18 15:41:02 2021] [<c0312fc4>] (sys_fsync) from [<c0101000>] (ret_fast_syscall+0x0/0x28)
[Mon Jan 18 15:41:02 2021] Exception stack(0xdbac5fa8 to 0xdbac5ff0)
[Mon Jan 18 15:41:02 2021] 5fa0:                   00000000 00000000 0000004e 00000000 00000000 00000000
[Mon Jan 18 15:41:02 2021] 5fc0: 00000000 00000000 00000000 00000076 00000005 06b28fb4 08e7ddc0 02afa870
[Mon Jan 18 15:41:02 2021] 5fe0: 00000001 06b28eb0 00012648 000d906c
[Mon Jan 18 15:41:02 2021] INFO: task influxd:5310 blocked for more than 120 seconds.
[Mon Jan 18 15:41:02 2021]       Tainted: P         C O      4.19.118 #1
[Mon Jan 18 15:41:02 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Jan 18 15:41:02 2021] influxd         D    0  5310   2456 0x00000000
[Mon Jan 18 15:41:02 2021] [<c094e598>] (__schedule) from [<c094ed10>] (schedule+0x4c/0xac)
[Mon Jan 18 15:41:02 2021] [<c094ed10>] (schedule) from [<c03ecf74>] (jbd2_log_wait_commit+0xdc/0x168)
[Mon Jan 18 15:41:02 2021] [<c03ecf74>] (jbd2_log_wait_commit) from [<c03ee910>] (jbd2_complete_transaction+0x90/0xf8)
[Mon Jan 18 15:41:02 2021] [<c03ee910>] (jbd2_complete_transaction) from [<c038d0e4>] (ext4_sync_file+0x378/0x47c)
[Mon Jan 18 15:41:02 2021] [<c038d0e4>] (ext4_sync_file) from [<c0312bcc>] (vfs_fsync_range+0x4c/0x8c)
[Mon Jan 18 15:41:02 2021] [<c0312bcc>] (vfs_fsync_range) from [<c0312c94>] (do_fsync+0x4c/0x80)
[Mon Jan 18 15:41:02 2021] [<c0312c94>] (do_fsync) from [<c0312fc4>] (sys_fsync+0x1c/0x20)
[Mon Jan 18 15:41:02 2021] [<c0312fc4>] (sys_fsync) from [<c0101000>] (ret_fast_syscall+0x0/0x28)
[Mon Jan 18 15:41:02 2021] Exception stack(0xea6b7fa8 to 0xea6b7ff0)
[Mon Jan 18 15:41:02 2021] 7fa0:                   00000000 00000000 00000077 00000000 00000000 00000000
[Mon Jan 18 15:41:02 2021] 7fc0: 00000000 00000000 00000000 00000076 00000005 00000000 07348540 02afa870
[Mon Jan 18 15:41:02 2021] 7fe0: 00000001 04109680 00012648 000d906c
[Mon Jan 18 15:41:02 2021] INFO: task influxd:24736 blocked for more than 120 seconds.
[Mon Jan 18 15:41:02 2021]       Tainted: P         C O      4.19.118 #1
[Mon Jan 18 15:41:02 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Jan 18 15:41:02 2021] influxd         D    0 24736   2456 0x00000000
[Mon Jan 18 15:41:02 2021] [<c094e598>] (__schedule) from [<c094ed10>] (schedule+0x4c/0xac)
[Mon Jan 18 15:41:02 2021] [<c094ed10>] (schedule) from [<c03ecf74>] (jbd2_log_wait_commit+0xdc/0x168)
[Mon Jan 18 15:41:02 2021] [<c03ecf74>] (jbd2_log_wait_commit) from [<c03ee910>] (jbd2_complete_transaction+0x90/0xf8)
[Mon Jan 18 15:41:02 2021] [<c03ee910>] (jbd2_complete_transaction) from [<c038d0e4>] (ext4_sync_file+0x378/0x47c)
[Mon Jan 18 15:41:02 2021] [<c038d0e4>] (ext4_sync_file) from [<c0312bcc>] (vfs_fsync_range+0x4c/0x8c)
[Mon Jan 18 15:41:02 2021] [<c0312bcc>] (vfs_fsync_range) from [<c0312c94>] (do_fsync+0x4c/0x80)
[Mon Jan 18 15:41:02 2021] [<c0312c94>] (do_fsync) from [<c0312fc4>] (sys_fsync+0x1c/0x20)
[Mon Jan 18 15:41:02 2021] [<c0312fc4>] (sys_fsync) from [<c0101000>] (ret_fast_syscall+0x0/0x28)
[Mon Jan 18 15:41:02 2021] Exception stack(0xcf375fa8 to 0xcf375ff0)
[Mon Jan 18 15:41:02 2021] 5fa0:                   00000000 00000000 00000027 00000000 00000000 00000000
[Mon Jan 18 15:41:02 2021] 5fc0: 00000000 00000000 00000001 00000076 00000005 112a37b4 08e7c2a0 02afa870
[Mon Jan 18 15:41:02 2021] 5fe0: 00000001 112a36b0 00012648 000d906c
[Mon Jan 18 15:41:02 2021] INFO: task kworker/0:0:13386 blocked for more than 120 seconds.
[Mon Jan 18 15:41:02 2021]       Tainted: P         C O      4.19.118 #1
[Mon Jan 18 15:41:02 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Jan 18 15:41:02 2021] kworker/0:0     D    0 13386      2 0x00000000
[Mon Jan 18 15:41:02 2021] Workqueue: events_freezable mmc_rescan
[Mon Jan 18 15:41:02 2021] [<c094e598>] (__schedule) from [<c094ed10>] (schedule+0x4c/0xac)
[Mon Jan 18 15:41:02 2021] [<c094ed10>] (schedule) from [<c0716ff0>] (__mmc_claim_host+0xe0/0x23c)
[Mon Jan 18 15:41:02 2021] [<c0716ff0>] (__mmc_claim_host) from [<c0717184>] (mmc_get_card+0x38/0x3c)
[Mon Jan 18 15:41:02 2021] [<c0717184>] (mmc_get_card) from [<c071dd18>] (mmc_detect+0x24/0x7c)
[Mon Jan 18 15:41:02 2021] [<c071dd18>] (mmc_detect) from [<c071964c>] (mmc_rescan+0xe8/0x3f8)
[Mon Jan 18 15:41:02 2021] [<c071964c>] (mmc_rescan) from [<c013c604>] (process_one_work+0x23c/0x518)
[Mon Jan 18 15:41:02 2021] [<c013c604>] (process_one_work) from [<c013d710>] (worker_thread+0x60/0x5b8)
[Mon Jan 18 15:41:02 2021] [<c013d710>] (worker_thread) from [<c0143238>] (kthread+0x16c/0x174)
[Mon Jan 18 15:41:02 2021] [<c0143238>] (kthread) from [<c01010ac>] (ret_from_fork+0x14/0x28)
[Mon Jan 18 15:41:02 2021] Exception stack(0xcf271fb0 to 0xcf271ff8)
[Mon Jan 18 15:41:02 2021] 1fa0:                                     00000000 00000000 00000000 00000000
[Mon Jan 18 15:41:02 2021] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[Mon Jan 18 15:41:02 2021] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[Mon Jan 18 15:42:53 2021] br-ca8558b9fcfe: port 2(veth9b76e62) entered disabled state
[Mon Jan 18 15:42:53 2021] vethc2a548a: renamed from eth0
[Mon Jan 18 15:42:53 2021] br-ca8558b9fcfe: port 2(veth9b76e62) entered disabled state
[Mon Jan 18 15:42:53 2021] device veth9b76e62 left promiscuous mode
[Mon Jan 18 15:42:53 2021] br-ca8558b9fcfe: port 2(veth9b76e62) entered disabled state
[Mon Jan 18 15:42:54 2021] supervisor0: port 1(vetha236063) entered disabled state
[Mon Jan 18 15:42:54 2021] veth73e1b2d: renamed from eth1
[Mon Jan 18 15:42:54 2021] supervisor0: port 1(vetha236063) entered disabled state
[Mon Jan 18 15:42:54 2021] device vetha236063 left promiscuous mode
[Mon Jan 18 15:42:54 2021] supervisor0: port 1(vetha236063) entered disabled state
[Mon Jan 18 15:42:54 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered blocking state
[Mon Jan 18 15:42:54 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered disabled state
[Mon Jan 18 15:42:54 2021] device vethec30b0d entered promiscuous mode
[Mon Jan 18 15:42:54 2021] IPv6: ADDRCONF(NETDEV_UP): vethec30b0d: link is not ready
[Mon Jan 18 15:42:54 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered blocking state
[Mon Jan 18 15:42:54 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered forwarding state
[Mon Jan 18 15:42:54 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered disabled state
[Mon Jan 18 15:42:54 2021] supervisor0: port 1(veth107e9c6) entered blocking state
[Mon Jan 18 15:42:54 2021] supervisor0: port 1(veth107e9c6) entered disabled state
[Mon Jan 18 15:42:54 2021] device veth107e9c6 entered promiscuous mode
[Mon Jan 18 15:42:54 2021] IPv6: ADDRCONF(NETDEV_UP): veth107e9c6: link is not ready
[Mon Jan 18 15:42:54 2021] supervisor0: port 1(veth107e9c6) entered blocking state
[Mon Jan 18 15:42:54 2021] supervisor0: port 1(veth107e9c6) entered forwarding state
[Mon Jan 18 15:42:55 2021] supervisor0: port 1(veth107e9c6) entered disabled state
[Mon Jan 18 15:42:56 2021] eth0: renamed from veth1f61100
[Mon Jan 18 15:42:56 2021] IPv6: ADDRCONF(NETDEV_CHANGE): vethec30b0d: link becomes ready
[Mon Jan 18 15:42:56 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered blocking state
[Mon Jan 18 15:42:56 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered forwarding state
[Mon Jan 18 15:42:56 2021] eth1: renamed from veth7fe6f99
[Mon Jan 18 15:42:56 2021] IPv6: ADDRCONF(NETDEV_CHANGE): veth107e9c6: link becomes ready
[Mon Jan 18 15:42:56 2021] supervisor0: port 1(veth107e9c6) entered blocking state
[Mon Jan 18 15:42:56 2021] supervisor0: port 1(veth107e9c6) entered forwarding state
[Mon Jan 18 15:43:01 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered disabled state
[Mon Jan 18 15:43:01 2021] veth1f61100: renamed from eth0
[Mon Jan 18 15:43:01 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered disabled state
[Mon Jan 18 15:43:01 2021] device vethec30b0d left promiscuous mode
[Mon Jan 18 15:43:01 2021] br-ca8558b9fcfe: port 2(vethec30b0d) entered disabled state
[Mon Jan 18 15:43:01 2021] supervisor0: port 1(veth107e9c6) entered disabled state
[Mon Jan 18 15:43:01 2021] veth7fe6f99: renamed from eth1
[Mon Jan 18 15:43:01 2021] supervisor0: port 1(veth107e9c6) entered disabled state
[Mon Jan 18 15:43:01 2021] device veth107e9c6 left promiscuous mode
[Mon Jan 18 15:43:01 2021] supervisor0: port 1(veth107e9c6) entered disabled state
[Mon Jan 18 15:43:02 2021] br-ca8558b9fcfe: port 2(vethcd1a47d) entered blocking state
[Mon Jan 18 15:43:02 2021] br-ca8558b9fcfe: port 2(vethcd1a47d) entered disabled state
[Mon Jan 18 15:43:02 2021] device vethcd1a47d entered promiscuous mode
[Mon Jan 18 15:43:02 2021] IPv6: ADDRCONF(NETDEV_UP): vethcd1a47d: link is not ready
[Mon Jan 18 15:43:02 2021] br-ca8558b9fcfe: port 2(vethcd1a47d) entered blocking state
[Mon Jan 18 15:43:02 2021] br-ca8558b9fcfe: port 2(vethcd1a47d) entered forwarding state
[Mon Jan 18 15:43:02 2021] supervisor0: port 1(vethc5edb6f) entered blocking state
[Mon Jan 18 15:43:02 2021] supervisor0: port 1(vethc5edb6f) entered disabled state
[Mon Jan 18 15:43:02 2021] device vethc5edb6f entered promiscuous mode
[Mon Jan 18 15:43:02 2021] IPv6: ADDRCONF(NETDEV_UP): vethc5edb6f: link is not ready
[Mon Jan 18 15:43:02 2021] supervisor0: port 1(vethc5edb6f) entered blocking state
[Mon Jan 18 15:43:02 2021] supervisor0: port 1(vethc5edb6f) entered forwarding state
[Mon Jan 18 15:43:02 2021] br-ca8558b9fcfe: port 2(vethcd1a47d) entered disabled state
[Mon Jan 18 15:43:02 2021] supervisor0: port 1(vethc5edb6f) entered disabled state
[Mon Jan 18 15:43:03 2021] eth0: renamed from veth5a90daa
[Mon Jan 18 15:43:04 2021] IPv6: ADDRCONF(NETDEV_CHANGE): vethcd1a47d: link becomes ready
[Mon Jan 18 15:43:04 2021] br-ca8558b9fcfe: port 2(vethcd1a47d) entered blocking state
[Mon Jan 18 15:43:04 2021] br-ca8558b9fcfe: port 2(vethcd1a47d) entered forwarding state
[Mon Jan 18 15:43:04 2021] eth1: renamed from veth910f50b
[Mon Jan 18 15:43:04 2021] IPv6: ADDRCONF(NETDEV_CHANGE): vethc5edb6f: link becomes ready
[Mon Jan 18 15:43:04 2021] supervisor0: port 1(vethc5edb6f) entered blocking state
[Mon Jan 18 15:43:04 2021] supervisor0: port 1(vethc5edb6f) entered forwarding state
[Mon Jan 18 15:45:04 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 15:50:19 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 15:55:34 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:00:49 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:06:04 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:11:24 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:16:39 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:21:54 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:27:09 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:32:24 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:37:39 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:42:54 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:48:09 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:53:24 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 16:58:39 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:03:59 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:09:14 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:14:34 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:19:49 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:25:04 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:30:19 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:35:34 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:40:49 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:46:04 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:51:19 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 17:56:34 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 18:01:49 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 18:07:04 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 18:12:19 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 18:17:34 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Mon Jan 18 18:22:49 2021] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready

On this device, here we see the output of standard debugging commands:

root@:~# uptime
 18:20:23  up 2 days 15:18,  2 users,  load average: 0.65, 0.74, 0.87
root@:~# date
Mon Jan 18 18:20:30 UTC 2021
root@:~# balena ps -a
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS                 PORTS                  NAMES
bedab54c784a        e36fce813fa7                                                            "/entrypoint.sh infl…"   22 minutes ago      Up 22 minutes                                 influxdb_3134354_1651895
560c7c64756d        158780b9dd8a                                                            "/trex/edge capture"     51 minutes ago      Up 51 minutes                                 capture_3134342_1651895
5e0049975f36        registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest   "/usr/src/app/entry.…"   3 hours ago         Up 3 hours (healthy)                          resin_supervisor
f521eb3a3986        158780b9dd8a                                                            "/trex/edge control"     2 days ago          Up 2 days                                     control_3134341_1651895
8791130fe2e6        02f603c54d98                                                            "/entrypoint.sh"         11 days ago         Up 2 days              0.0.0.0:2222->22/tcp   scp_3134349_1651895
1beb74fa93df        e5364636fadf                                                            "/entrypoint.sh"         2 weeks ago         Up 2 days                                     grafana_3134353_1651895
ea65c2c1e5bc        e4ccd877f611                                                            "/entrypoint_with_wa…"   2 weeks ago         Up 2 days                                     telegraf_3134352_1651895
d4b1fef88b50        8d80652d5ac6                                                            "sleep infinity"         2 weeks ago         Up 2 days                                     inspect_3134348_1651895
573734c7fdb9        158780b9dd8a                                                            "/trex/edge agent"       2 weeks ago         Up 21 hours                                   agent_3134344_1651895
4346c0db5d8f        860c57b07dd9                                                            "sleep infinity"         3 weeks ago         Up 2 days                                     co-processor-flasher_3134343_1651895
0223285ca905        c9767211b5b7                                                            "./grpc-dfuTool"         3 weeks ago         Up 2 days                                     sensor-firmware-flasher_3134345_1651895
92a1b0a40e66        5b31debf8912                                                            "/bin/sh -c /app/con…"   3 weeks ago         Up 2 days                                     connection-watchdog_3134350_1651895
a8836beee4e2        9f49ac54288d                                                            "python3 logger.py"      3 weeks ago         Up 3 hours                                    logger_3134347_1651895
476d372c5409        48a011276016                                                            "/entrypoint.sh"         3 weeks ago         Up 2 days (healthy)                           hw-init_3134346_1651895
root@:~# ps | grep agent
23189 root      2276 S    grep agent
root@:~# balena inspect agent_3134344_1651895
[
    {
        "Id": "573734c7fdb9d9d49b9c8151a52f4b8554d0cca6cb4d04e379078ed90ede585e",
        "Created": "2020-12-30T20:44:30.455373572Z",
        "Path": "/trex/edge",
        "Args": [
            "agent"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 22864,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2021-01-17T21:43:23.742110162Z",
            "FinishedAt": "2021-01-17T21:43:21.154466563Z"
        },
        "Image": "sha256:158780b9dd8a4569fa86dfa8d134f584dea1b7e1dc51170634f32a94b841e1e0",
        "ResolvConfPath": "/var/lib/docker/containers/573734c7fdb9d9d49b9c8151a52f4b8554d0cca6cb4d04e379078ed90ede585e/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/573734c7fdb9d9d49b9c8151a52f4b8554d0cca6cb4d04e379078ed90ede585e/hostname",
        "HostsPath": "/var/lib/docker/containers/573734c7fdb9d9d49b9c8151a52f4b8554d0cca6cb4d04e379078ed90ede585e/hosts",
        "LogPath": "",
        "Name": "/agent_3134344_1651895",
        "RestartCount": 4,
        "Driver": "aufs",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "/tmp/balena-supervisor/services/1499460/agent:/tmp/resin",
                "/tmp/balena-supervisor/services/1499460/agent:/tmp/balena"
            ],
            "ContainerIDFile": "",
            "ContainerIDEnv": "",
            "LogConfig": {
                "Type": "journald",
                "Config": {}
            },
            "NetworkMode": "host",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "always",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": [],
            "CapDrop": [],
            "Capabilities": null,
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": [],
            "GroupAdd": [],
            "IpcMode": "shareable",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": true,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": [],
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DeviceRequests": [],
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": false,
            "PidsLimit": null,
            "Ulimits": [],
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": null,
            "ReadonlyPaths": null
        },
        "GraphDriver": {
            "Data": null,
            "Name": "aufs"
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/tmp/balena-supervisor/services/1499460/agent",
                "Destination": "/tmp/balena",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/tmp/balena-supervisor/services/1499460/agent",
                "Destination": "/tmp/resin",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
        "Config": {
            "Hostname": "<edited out>",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": true,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "RESIN_DEVICE_NAME_AT_INIT=winter-moon",
                "BALENA_DEVICE_NAME_AT_INIT=winter-moon",
                "INFLUXDB_HOST=127.0.0.1:8087",
                "WAIT_FOR_INFLUXDB=true",
                "AGENT_CAPTURE_URL=tcp://localhost:10000",
                "AGENT_INFLUXDB_URL=http://localhost:8086",
                "TREX_LOG_LEVEL=INFO",
                "WELL_API_NUMBER=00-000-00000",
                "AGENT_KAFKA_URL=tcp://broker.trex.petropower.com:9092",
                "AGENT_LOG_LEVEL=INFO",
                "AGENT_PARTIAL_SAMPLES_DATABASE=partial_samples/autogen",
                "BALENA_APP_ID=1499460",
                "BALENA_APP_NAME=Trex",
                "BALENA_SERVICE_NAME=agent",
                "BALENA_DEVICE_UUID=<edited out>",
                "BALENA_DEVICE_TYPE=fincm3",
                "BALENA_DEVICE_ARCH=armv7hf",
                "BALENA_HOST_OS_VERSION=balenaOS 2.58.3+rev1",
                "BALENA_SUPERVISOR_VERSION=11.14.0",
                "BALENA_APP_LOCK_PATH=/tmp/balena/updates.lock",
                "BALENA=1",
                "RESIN_APP_ID=1499460",
                "RESIN_APP_NAME=Trex",
                "RESIN_SERVICE_NAME=agent",
                "RESIN_DEVICE_UUID=<edited out>",
                "RESIN_DEVICE_TYPE=fincm3",
                "RESIN_DEVICE_ARCH=armv7hf",
                "RESIN_HOST_OS_VERSION=balenaOS 2.58.3+rev1",
                "RESIN_SUPERVISOR_VERSION=11.14.0",
                "RESIN_APP_LOCK_PATH=/tmp/balena/updates.lock",
                "RESIN=1",
                "RESIN_SERVICE_KILL_ME_PATH=/tmp/balena/handover-complete",
                "BALENA_SERVICE_HANDOVER_COMPLETE_PATH=/tmp/balena/handover-complete",
                "USER=root",
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": [
                "agent"
            ],
            "Healthcheck": {
                "Test": [
                    "NONE"
                ]
            },
            "Image": "sha256:158780b9dd8a4569fa86dfa8d134f584dea1b7e1dc51170634f32a94b841e1e0",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": [
                "/trex/edge"
            ],
            "OnBuild": null,
            "Labels": {
                "com.petropower.component.name": "agent",
                "com.petropower.component.version": "421",
                "io.balena.app-id": "1499460",
                "io.balena.service-id": "682262",
                "io.balena.service-name": "agent",
                "io.balena.supervised": "true"
            },
            "StopSignal": "SIGTERM",
            "StopTimeout": 10
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "bd23c5503360d3b8d342ac91ff59d057e172291a75a8f7918f5656fccccca8db",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "/var/run/balena-engine/netns/default",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "host": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "a1b0d489f69d651e7b7054b5bb74622acbb3254cc18986ba57befbdb5d798287",
                    "EndpointID": "5167be5522308ae117a8d69466a1f7f4db4d9c54d7b394d53c47a9ba56b5297a",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "",
                    "DriverOpts": null
                }
            }
        }
    }
]

Ereski · January 18, 2021, 6:50pm

Hi. Those dmesg logs are pretty conclusive: it’s a either a storage failure, or something in your application (probably influxDB) is saturating the bandwidth of your storage device way above its capacity.

[Mon Jan 18 15:41:02 2021] INFO: task jbd2/mmcblk0p6-:917 blocked for more than 120 seconds.

When this happens, IO tasks will hang and so will any process waiting on them. systemctl restart balenad works because SIGKILL doesn’t care, and the new process won’t be waiting on those hanged tasks. See this kernel issue: https://bugzilla.kernel.org/show_bug.cgi?id=198187.

Installing faster storage would certainly help and probably solve this issue altogether, and there’s also probably some sysctl settings to increase that 120s timeout although I’m not entirely sure which settings could be useful right now.

jweide · January 18, 2021, 10:30pm

So as it turns out, 120 seconds before the kernel started killing stalled threads, influxDB was in the process of compacting the _internal database.

I still don’t understand why after this event occurs that we are able to use balena to manage some processes containers but not others.

Also, since we are using the telegraf diskio plugin we are able to see that at this time there is a gigantic spike in reads.

I think the logs are suggesting that something is deadlocked within influxdb, and not agent. InfluxDB is the process getting impacted here, but there is something about how the agent process exits that is causing it to be unmanageable afterwards.

I fully accept that we can avoid getting into this state by managing our storage better. We are presently evaluating tuning influxdb to be less taxing on the storage and evaluating what options we have to upgrade the storage we are using. However, I’d be more satisfied if we understood why balena is unable to restart the agent container, which is really a tangentially connected part of the whole picture and isn’t even really where the action is happening. In theory, the system should self-recover from this problem when it’s encountered, but it does not, and I’d like to know why.

jakogut · January 20, 2021, 12:12am

Hey Jeff,

As Carol pointed out, this seems to be related to your storage performance, and I’d like to dig into that a little deeper to see if it’s the culprit.

Could you tell us what type of devices you’re using, as well as what kind of storage medium? If the OS is running on microSD cards, could you give us more information about the model of card you’re using?

Additionally, on one of the devices exhibiting the issue, could you run iostat, and report the results? You should be able to pull and run an Alpine container including iostat for your platform using a command such as this:

balena run --privileged --rm -it --pid=host balenalib/raspberry-pi-alpine iostat

Note that unless you’re using a Raspberry Pi (1) or another compatible device, you need to replace raspberry-pi with the platform you’re running. See here for more information on that: https://www.balena.io/docs/reference/base-images/base-images/

Also note that downloading this container will use up some disk and bandwidth, which may be undesirable if your device is low on space, or using a cellular connection. You can remove this image after you’re done to free up this space.

The number we’re most interested in here is iowait. Assuming your devices that aren’t performing as expected have a high percentage of iowait, this means that you have processes that are blocking waiting for read/writes to finish, which would explain some of the behavior you’re seeing. If this is the case, we’ll need to look at either reducing disk usage somehow, or improving the overall throughput.

If this is not the case, we’re barking up the wrong tree and need to look elsewhere to diagnose this problem.

jweide · January 25, 2021, 5:58pm

I’ve been monitoring this issue for a week after making the following changes:

We reconfigured our influxDB container to stop keeping internal metrics by setting the environment variable INFLUXDB_MONITOR_STORE_ENABLED = false

We implemented that change as a stop-gap while we evaluate our options and continue to debug the issue at hand (That the balena service cannot manage the agent container). It appears to have provided some relief because on the 12/24 devices in our existing fleet, we saw only 1 incidence of our original issue in the last 7 days. The other 12 devices were not updated, and 4 of those hit this same issue. We also got a little lucky and hit the issue on one of our internal CI devices. I am in the process of collecting additional information from the CI device and will include that in a later update.

Regarding our storage configuration: We are using the eMMC included on the BalenaFin.

Regarding the I/O Workload and the use of the alpine container: As I previously posted, we do have access to, and are already logging disk metrics using prometheus. I am still considering doing the experiment with the alpine container because there aren’t any queue depth metrics (read/write queue depth is always recorded as 0).

Now I want to mention once again that the specific issue at hand in this forum post is that once we hit this condition, the balena service is not able to manage the container. Our expectation is that regardless of how a container exits, the balena service should be able to kill the exited container and start it again. This is not what is happening right now.

This is an early alpha/PoC IoT solution and we understand there are challenges with using the eMMC and the platform in the way we are presently doing it. The root issue I am posting here is not that I can’t make my solution work, but instead that once a piece of my solution breaks, the underlying balena platform also breaks. We don’t want to see that issue in our product so we are trying to understand the root cause of the breakage in balena and continue to troubleshoot that.

I’ve provided many logs and outputs of the symptoms of the issue. The one I would like to re focus on is that “balena kill” cannot kill the agent container until the balena service is restarted using 'sysctl restart balena’

What additional logs can be collected or experiments could be run to get to the bottom of the reason that balena is unable to manage this container?

klutchell · January 26, 2021, 4:56pm

Hey Jeff, this was quite the thread to catch up on but I think I know where we are at now, especially with your helpful summaries.

Regarding your main concern with the balena kill command not working, there are a number of possible causes for that. Here are two I think we should consider:

The one I see reported most often with the official docker engine is ‘zombie’ containers . If the OS is starved for resources and/or the engine crashes this may be the result and is usually solved by restarting the engine or manually killing the container shim process. Unfortunately the best solutions here are to avoid starving the system IO resources, and to continually pull upstream fixes from docker engine, as we do regularly.
Another possibility, and maybe more likely in this case, is a threadlock in your agent container. This would mean that balena engine sends a SIGKILL but your container is locked and not listening for signals.

Are there any other services that fail to restart/kill besides the agent? If not I would suspect that it’s locking up while waiting for influxdb or available IO or both.

jweide · January 26, 2021, 5:53pm

Yes, we have two different failure modes, but the “solution” to both of them has been to restart the balena service.

In both failure modes, we noticed that other services are still manageable. It’s only the “agent” container that is unmanageable. We have restarted other containers with ease, including the influx container, which is the container facing the brunt of the problems with it’s I/O’s going slowly.

I like your theory that there is a “threadlock”. I’m not familiar with this concept. I’m just wondering if that’s possible because the foreground process “edge” isn’t running (according to output of ps) and so I’m not sure what else in that container could be locked.

Is there a way we could eliminate one of those?

I will continue to read on both of these theories.

notnamed · January 26, 2021, 7:25pm

Jeff,

notnamed · January 26, 2021, 7:28pm

Jeff, since it’s just this one container that’s unmanageable, it does sound like the process is blocked. I found this link to be a pretty good summary of what might be going on: https://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work

I would recommend trying to get a minimal reproduction of the missing process but the container still locked, to try to track down what aside from the “edge” foreground process is still running. Since the container is in the state it’s in, it’s more difficult to troubleshoot without pushing modified versions of the container to try to narrow down and find the point at which the issue no longer occurs.

Topic		Replies	Views
Stopping balena container Product support	3	98	August 9, 2024
Supervisor constantly restarting balenaOS docker	3	245	October 10, 2023
Services are in a constant restart loop! balenaOS support , balenafin	51	4029	January 15, 2021
Local-mode Supervisor kills containers randomly balenaOS	17	796	December 28, 2021
Rebooting a device using the supervisor API vs DBUS balenaOS	1	188	August 26, 2024

'balena ps' shows container restarting for many hours; reboot required to clear condition

Related topics