I have a really weird situation. I’m trying to push an update to a Beaglebone Green Gatewat board. The application uses three containers. Two of them have downloaded the update and are ready to go, however, the third container won’t seem to finish! It gets to around 95% or so and then the board seems to reset and the container download starts all over.
Any advice on troubleshooting or why this would be happening?
I ran into this issue before on RPi0s and the and the root cause was a compute bottleneck. The balena engine + supervisor were using 50%+ of cpu and my containers were using the rest. During the download process this prevented the health check on the balena-supervisior from responding which caused the download to crash and restart.
I switched to a kill-then-download update strategy which kinda sucks due to downtime but along with support we couldn’t really come up with a better solution in the meantime.
@mpous as I mentioned above, there aren’t any relevant logs to share. There was no info in the log, the device would just start over for some reason. I gave up and just reflashed the device which allowed the update to complete for some reason.
@jhamburger this sounds like what was probably happening to me I suppose. The Beaglebone black processor isn’t exactly speedy in today’s world and the CPU utilization during the updates is between 80-100% I’ve noticed. In my case, reflashing the device, meaning that there was no current workload, lessening the burden on the CPU which probably allowed the update to complete.
That seems to be at least two accounts of high CPU usage causing updates to not complete, which seems like a real flaw in the update system I think.
I tried to push an update to my beaglebone green gateway and it seems to be stuck not finishing, although the device is online.
I can’t seem to figure out what is wrong, but I see that the CPU is maxed out in the Balena UI, even though it is not in the top command.
The device ID is 3d31d42c0d06fa599a854b7ff1278afe and I’ve granted access to the support team if anyone could take a look and tell me why the device is not pulling down a new update.
Thanks @keenanjohnson for granting support access to the device.
We have seen this type of issues related with the watchdog timeouts, causing the restart of balenaEngine and re-downloading again from zero the images.
There are some things that can be done here, but my first recommendation is to change the update strategy to free up resources of the device during the download and update of the new releases. Fleet update strategy - Balena Documentation
Could you please confirm what update strategy you had?
I’ve been using the download-then-kill default strategy, so are you suggesting I’ll need to switch to the kill-then-download strategy to converse CPU?
Is there a way to manually issue the kill command, as I can’t seem to get the supervisor on that device to pick up the configuration change in order to process the new update strategy?
Looking at the device I am seeing a ton of supervisor errors:
Dec 16 19:35:04 3d31d42 resin-supervisor[2835]: [info] Supervisor v12.11.20 starting up...
Dec 16 19:35:09 3d31d42 resin-supervisor[2835]: [info] Setting host to discoverable
Dec 16 19:35:09 3d31d42 resin-supervisor[2835]: [warn] Invalid firewall mode: . Reverting to state: off
Dec 16 19:35:09 3d31d42 resin-supervisor[2835]: [info] Applying firewall mode: off
Dec 16 19:35:10 3d31d42 resin-supervisor[2835]: [debug] Starting systemd unit: avahi-daemon.service
Dec 16 19:35:10 3d31d42 resin-supervisor[2835]: [debug] Starting systemd unit: avahi-daemon.socket
Dec 16 19:35:10 3d31d42 resin-supervisor[2835]: [debug] Starting logging infrastructure
Dec 16 19:35:11 3d31d42 resin-supervisor[2835]: [info] Starting firewall
Dec 16 19:35:11 3d31d42 resin-supervisor[2835]: [debug] Performing database cleanup for container log timestamps
Dec 16 19:35:14 3d31d42 resin-supervisor[2835]: [info] Previous engine snapshot was not stored. Skipping cleanup.
Dec 16 19:35:14 3d31d42 resin-supervisor[2835]: [debug] Handling of local mode switch is completed
Dec 16 19:35:14 3d31d42 resin-supervisor[2835]: [success] Firewall mode applied
Dec 16 19:35:14 3d31d42 resin-supervisor[2835]: [debug] Starting api binder
Dec 16 19:35:15 3d31d42 resin-supervisor[2835]: (node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
Dec 16 19:35:15 3d31d42 resin-supervisor[2835]: [info] API Binder bound to: https://api.balena-cloud.com/v6/
Dec 16 19:35:15 3d31d42 resin-supervisor[2835]: [event] Event: Supervisor start {}
Dec 16 19:35:16 3d31d42 resin-supervisor[2835]: [debug] Spawning journald with: chroot /mnt/root journalctl -a -S 2021-12-16 19:09:10 -o json CONTAINER_ID_FULL=1022874ca95d2cd81bd98ceabde17a32c0e4a501af346c13e356ce9fe5529ce9
Dec 16 19:35:48 3d31d42 resin-supervisor[2835]: [debug] Spawning journald with: chroot /mnt/root journalctl -a -S 2021-12-13 22:18:27 -o json CONTAINER_ID_FULL=1280442b5f410157b3f7c0559b383ade33683aebee8972b59e1099d754252ca5
Dec 16 19:36:35 3d31d42 resin-supervisor[2835]: (node:1) UnhandledPromiseRejectionWarning: KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
Dec 16 19:36:35 3d31d42 resin-supervisor[2835]: at Client_SQLite3.acquireConnection (/usr/src/app/dist/app.js:6:224388)
Dec 16 19:36:35 3d31d42 resin-supervisor[2835]: at runNextTicks (internal/process/task_queues.js:62:5)
Dec 16 19:36:35 3d31d42 resin-supervisor[2835]: at listOnTimeout (internal/timers.js:518:9)
Dec 16 19:36:35 3d31d42 resin-supervisor[2835]: at processTimers (internal/timers.js:492:7)
Dec 16 19:36:35 3d31d42 resin-supervisor[2835]: (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 8)
Dec 16 19:36:35 3d31d42 resin-supervisor[2835]: (node:1) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Dec 16 19:36:58 3d31d42 resin-supervisor[2835]: [error] LogBackend: unexpected error: Error: Client network socket disconnected before secure TLS connection was established
Dec 16 19:36:58 3d31d42 resin-supervisor[2835]: [error] at connResetException (internal/errors.js:608:14)
Dec 16 19:36:58 3d31d42 resin-supervisor[2835]: [error] at TLSSocket.onConnectEnd (_tls_wrap.js:1514:19)
Dec 16 19:36:58 3d31d42 resin-supervisor[2835]: [error] at Object.onceWrapper (events.js:416:28)
Dec 16 19:36:58 3d31d42 resin-supervisor[2835]: [error] at TLSSocket.emit (events.js:322:22)
Dec 16 19:36:58 3d31d42 resin-supervisor[2835]: [error] at endReadableNT (_stream_readable.js:1187:12)
Dec 16 19:36:58 3d31d42 resin-supervisor[2835]: [error] at processTicksAndRejections (internal/process/task_queues.js:84:21)
The error UnhandledPromiseRejectionWarning: KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? from the Supervisor logs definitely indicates a resource constraint while downloading. The fact that you were able to successfully download may hint that your device is operating near the limits of its resources. We’d definitely recommend kill-then-download if that’s possible for your use case.