For the last 3+ years we’ve been building and deploy Balena Devices on Onlogic Mk100b-51.
Intel Skylake 1U Rackmount Computer | OnLogic . We have about 50 of the systems currently in our fleet, and they are deployed geographically throughout the world,
Prior to shipping any device, I always run at least 7+ “Reboots” through the Balena console, to hopefully validate whether the ACPI/Onlogic Firmware/YoctoLinux/BalenaOS stack are all playing well with each other - without exception it always works prior to shipping. I’ve never had a failure, so that means I’ve had 350+ successful power cycles with 50+ devices.
But - about 50% of the time, when I attempt a remote power cycle of systems in the field (using the balenaCloud “Reboot” option) - it fails, and I have to have someone go visit the site with remote hands and physically power cycle the device (at which point it returns to service.)
I’ve tried for the better part of a year to replicate this behavior on a lab bench, running the exact same containers, BalenaOS, Supervisor - to no avail.
I’m always running ESR, so this has persisted through multiple generations of BalenaOS.
The latest event occurred today, on a system running [BalenaOS 2021.01.0] Supervisor: 12.3.0
Our Intel Server vendor hasn’t ever issued any firmware upgrades, so there isn’t much I can do there.
I guess I have two questions:
o First, has anyone else ever had trouble with remote rebooting - I’m guessing that because you have to answer questions in the interface confirming your choice for a reboot, but not container restart, that this has been an issue for others.
o Second - Does anyone have some suggestions as to how I might go about validating/configuring a Balena Instance that can reliably remote power cycle? If I could ever replicate this behavior on a lab bench, then I have lots of tools at hand (try different hardware, for example) - but it only seems to occur with production systems that have been running remotely for a prolonged period of time.
This is kind of a big deal - because we really want to upgrade the BalenaOS on some of our systems that have been out in the field for a year, but at the same time, we can’t afford the risk of power cycling them and having to send out remote hands.
Any thoughts, recommended telemetry, topics to discuss with our hardware vendor around ACPI, hardware recommendations for devices that have never failed to remote power cycle appreciated.