Trying to develop a Balena System with reliable rebooting

For the last 3+ years we’ve been building and deploy Balena Devices on Onlogic Mk100b-51.
Intel Skylake 1U Rackmount Computer | OnLogic . We have about 50 of the systems currently in our fleet, and they are deployed geographically throughout the world,

Prior to shipping any device, I always run at least 7+ “Reboots” through the Balena console, to hopefully validate whether the ACPI/Onlogic Firmware/YoctoLinux/BalenaOS stack are all playing well with each other - without exception it always works prior to shipping. I’ve never had a failure, so that means I’ve had 350+ successful power cycles with 50+ devices.

But - about 50% of the time, when I attempt a remote power cycle of systems in the field (using the balenaCloud “Reboot” option) - it fails, and I have to have someone go visit the site with remote hands and physically power cycle the device (at which point it returns to service.)

I’ve tried for the better part of a year to replicate this behavior on a lab bench, running the exact same containers, BalenaOS, Supervisor - to no avail.

I’m always running ESR, so this has persisted through multiple generations of BalenaOS.

The latest event occurred today, on a system running [BalenaOS 2021.01.0] Supervisor: 12.3.0

Our Intel Server vendor hasn’t ever issued any firmware upgrades, so there isn’t much I can do there.

I guess I have two questions:
o First, has anyone else ever had trouble with remote rebooting - I’m guessing that because you have to answer questions in the interface confirming your choice for a reboot, but not container restart, that this has been an issue for others.

o Second - Does anyone have some suggestions as to how I might go about validating/configuring a Balena Instance that can reliably remote power cycle? If I could ever replicate this behavior on a lab bench, then I have lots of tools at hand (try different hardware, for example) - but it only seems to occur with production systems that have been running remotely for a prolonged period of time.

This is kind of a big deal - because we really want to upgrade the BalenaOS on some of our systems that have been out in the field for a year, but at the same time, we can’t afford the risk of power cycling them and having to send out remote hands.

Any thoughts, recommended telemetry, topics to discuss with our hardware vendor around ACPI, hardware recommendations for devices that have never failed to remote power cycle appreciated.

Interesting scenario. I have experienced something similar on an Orange Pi, but it is almost always caused by an issue with the device connecting to the Balena Cloud, or perhaps it is more broadly an issue with the connection from the device to the internet. I never investigated too much as it was a development device. But on all of these occasions I had an indication on the cloud that it was occurring, it would show ‘heartbeat’ and indicate a connection error. Presumably you haven’t had any indications of connection issues?

Are you able to ssh into the device from the cloud? It would suggest the connection is live it you can.

Are there any logs suggesting it received the request and failed, or that it didn’t receive a request?

Connection issues are a daily occurrence with me - Proxies, Firewalls, Network Outages, etc…
The scenario I’m looking at here though - is where we have a flawless network connection. Both the VPN and the Logs and the Balena API connection have no issues - it’s only when we trigger the reboot that the device just gets wedged.

Weird. So you can see the devices and communicate no problem when the reboot fails? Is there no indication in the logs through the dashboard that it is receiving and trying to execute the request?

The command executes just fine. Device power cycles. But somewhere between Yocto and the ACPI, something goes awry and the device isn’t able to complete a full power cycle.

It sounds different to what I experienced then.

I am looking at a similar type of deployment where devices won’t be accessible so this is of particular interest to me. I’m sure the Balena guys will be able to come up with some helpful debugging steps, I know for me though it would be interesting to see the journalctl logs from a device if they still exist (persistent logging) when the device comes back up. Should give some details as to the failed boot process.

Hi there,
just some ideas from my end:
I had scenarios with OpenVPN connected embedded devices to fail after reboot because of a construct of different overlapping problems and realized I could run into a deadlock:
Sometimes OpenVPN tried to early to connect, while DHCP wasn’t through yet and then gave up at some point. As some really dirty hack I ended up deploying a ping container which just pings an IP at the far end - that motivated OpenVPN enough, never had problems like that afterwards. (It was non balena, more Docker on ARM environment :slight_smile: )

So, my gut feeling would tell me something between DHCP (that means also configuring DNS and routes) as well as OpenVPN getting into some “lock” situation there. Maybe…

But the rest of the community or the devs will probably have more insight :slight_smile:

Hello @ghshephard thank you for your message.

I pinged internally the balenaOS team to see if they can help you to solve this issue.

Let’s see meanwhile if the community can help with more ideas.

One other data point - I’ve had reports from people in the field that when they come up to the system, it’s sitting their with the power light just “blinking”. So - it’s as though the power cycle occurred, and then the device ended up in some intermediary S0-S5 power states.

Clearly there needs to be a conversation with my hardware vendor, but given they haven’t released any firmware releases, and that I’ve been unable to ever replicate, they’ll likely point their fingers at Yocto/BalenaOS.

On the second question, if it is network issues after boot rather than boot issues (without logs hard to know for sure), poll some online connection and on fail trigger a device reboot using the API: Interacting with the balena Supervisor - Balena Documentation

Hello @ghshephard the balenaOS team read your issue and they potentially think that there could be an issue on the device’s BSP, and when it fails to boot there is probably a kernel panic.

A suggestion could be to use this reboot test tool. This tool run thousands of loops as a first step to make sure the device has no issues rebooting. GitHub - balena-io-playground/balena-reboot-app: Just an app that continuosly reboots the device

Let us know if this works!