Trying to develop a Balena System with reliable rebooting

ghshephard · September 23, 2021, 6:03pm

For the last 3+ years we’ve been building and deploy Balena Devices on Onlogic Mk100b-51.
Intel Skylake 1U Rackmount Computer | OnLogic . We have about 50 of the systems currently in our fleet, and they are deployed geographically throughout the world,

Prior to shipping any device, I always run at least 7+ “Reboots” through the Balena console, to hopefully validate whether the ACPI/Onlogic Firmware/YoctoLinux/BalenaOS stack are all playing well with each other - without exception it always works prior to shipping. I’ve never had a failure, so that means I’ve had 350+ successful power cycles with 50+ devices.

But - about 50% of the time, when I attempt a remote power cycle of systems in the field (using the balenaCloud “Reboot” option) - it fails, and I have to have someone go visit the site with remote hands and physically power cycle the device (at which point it returns to service.)

I’ve tried for the better part of a year to replicate this behavior on a lab bench, running the exact same containers, BalenaOS, Supervisor - to no avail.

I’m always running ESR, so this has persisted through multiple generations of BalenaOS.

The latest event occurred today, on a system running [BalenaOS 2021.01.0] Supervisor: 12.3.0

Our Intel Server vendor hasn’t ever issued any firmware upgrades, so there isn’t much I can do there.

I guess I have two questions:
o First, has anyone else ever had trouble with remote rebooting - I’m guessing that because you have to answer questions in the interface confirming your choice for a reboot, but not container restart, that this has been an issue for others.

o Second - Does anyone have some suggestions as to how I might go about validating/configuring a Balena Instance that can reliably remote power cycle? If I could ever replicate this behavior on a lab bench, then I have lots of tools at hand (try different hardware, for example) - but it only seems to occur with production systems that have been running remotely for a prolonged period of time.

This is kind of a big deal - because we really want to upgrade the BalenaOS on some of our systems that have been out in the field for a year, but at the same time, we can’t afford the risk of power cycling them and having to send out remote hands.

Any thoughts, recommended telemetry, topics to discuss with our hardware vendor around ACPI, hardware recommendations for devices that have never failed to remote power cycle appreciated.

maggie0002 · September 23, 2021, 10:12pm

Interesting scenario. I have experienced something similar on an Orange Pi, but it is almost always caused by an issue with the device connecting to the Balena Cloud, or perhaps it is more broadly an issue with the connection from the device to the internet. I never investigated too much as it was a development device. But on all of these occasions I had an indication on the cloud that it was occurring, it would show ‘heartbeat’ and indicate a connection error. Presumably you haven’t had any indications of connection issues?

Are you able to ssh into the device from the cloud? It would suggest the connection is live it you can.

Are there any logs suggesting it received the request and failed, or that it didn’t receive a request?

ghshephard · September 23, 2021, 10:15pm

Connection issues are a daily occurrence with me - Proxies, Firewalls, Network Outages, etc…
The scenario I’m looking at here though - is where we have a flawless network connection. Both the VPN and the Logs and the Balena API connection have no issues - it’s only when we trigger the reboot that the device just gets wedged.

maggie0002 · September 23, 2021, 10:16pm

Weird. So you can see the devices and communicate no problem when the reboot fails? Is there no indication in the logs through the dashboard that it is receiving and trying to execute the request?

ghshephard · September 23, 2021, 10:17pm

The command executes just fine. Device power cycles. But somewhere between Yocto and the ACPI, something goes awry and the device isn’t able to complete a full power cycle.

maggie0002 · September 23, 2021, 10:20pm

It sounds different to what I experienced then.

I am looking at a similar type of deployment where devices won’t be accessible so this is of particular interest to me. I’m sure the Balena guys will be able to come up with some helpful debugging steps, I know for me though it would be interesting to see the journalctl logs from a device if they still exist (persistent logging) when the device comes back up. Should give some details as to the failed boot process.

nmaas87 · September 24, 2021, 5:08am

Hi there,
just some ideas from my end:
I had scenarios with OpenVPN connected embedded devices to fail after reboot because of a construct of different overlapping problems and realized I could run into a deadlock:
Sometimes OpenVPN tried to early to connect, while DHCP wasn’t through yet and then gave up at some point. As some really dirty hack I ended up deploying a ping container which just pings an IP at the far end - that motivated OpenVPN enough, never had problems like that afterwards. (It was non balena, more Docker on ARM environment )

So, my gut feeling would tell me something between DHCP (that means also configuring DNS and routes) as well as OpenVPN getting into some “lock” situation there. Maybe…

But the rest of the community or the devs will probably have more insight

mpous · September 27, 2021, 11:54am

Hello @ghshephard thank you for your message.

I pinged internally the balenaOS team to see if they can help you to solve this issue.

Let’s see meanwhile if the community can help with more ideas.

ghshephard · September 27, 2021, 4:45pm

One other data point - I’ve had reports from people in the field that when they come up to the system, it’s sitting their with the power light just “blinking”. So - it’s as though the power cycle occurred, and then the device ended up in some intermediary S0-S5 power states.

Clearly there needs to be a conversation with my hardware vendor, but given they haven’t released any firmware releases, and that I’ve been unable to ever replicate, they’ll likely point their fingers at Yocto/BalenaOS.

maggie0002 · September 27, 2021, 10:02pm

On the second question, if it is network issues after boot rather than boot issues (without logs hard to know for sure), poll some online connection and on fail trigger a device reboot using the API: Interacting with the balena Supervisor - Balena Documentation

mpous · September 28, 2021, 1:08pm

Hello @ghshephard the balenaOS team read your issue and they potentially think that there could be an issue on the device’s BSP, and when it fails to boot there is probably a kernel panic.

A suggestion could be to use this reboot test tool. This tool run thousands of loops as a first step to make sure the device has no issues rebooting. GitHub - balena-io-playground/balena-reboot-app: Just an app that continuosly reboots the device

Let us know if this works!

Topic		Replies	Views
Raspberry Pis keep rebooting Product support raspberrypi3	58	2401	October 1, 2019
Raspberry Pi 0 Devices shut down, and containers sometimes fail to start Product support	5	148	February 6, 2024
Reboot Loop on NUC12WSKi3 Balena OS 3.0.8 balenaOS support , docker	7	275	July 26, 2023
Balena Builds Often Fail Product support	20	584	January 5, 2023
"Update successful, rebooting" persists through reboots, Balena Application Container Engine fails Product support raspberrypi3 , docker	4	947	September 16, 2021

Trying to develop a Balena System with reliable rebooting

Related topics