Devices no longer connect after upgrade from 2.32 to 2.48

We have now had 3 instances where after doing a Host OS update from 2.32 to 2.48 the system goes offline and no longer connects to the internet. It seems the containers start up because we do see status lights blinking consistent with the startup sequence of our containers, but the device never connects. The only thing that seems to solve it is a new SD card.

This is unfortunately a big problem for us because the systems in the field are difficult to get to, especially with covid-based movement restrictions. Previously, updating our images on 2.32 over low-bandwidth connections was unreliable, and the advice we received was to update to latest OS first.

Is there a best practice for upgrading that would minimize the chance of this happening? Is there any evidence that it would be more reliable if we went from 2.32 to an intermediary release, such as 2.44 and from there to 2.48? Given our poor <70% upgrade success rate, as is we have to write off OTA updates completely, which is a pretty big bummer.

Hi,

On the devices that are failing, have you been able to do any analysis on the original SD card? For example, could they run out of space during the upgrade? Also, do you reboot the devices after the upgrade? Any logs you could share would be useful for our engineering team to look at.

John

Can you also tell us what kind of devices you’re using?

Thank you. I will see if I can get the SD card back here but it’s unlikely - the tech support staff there are not technical enough to be able to provide that kind of detail. I know for sure that we definitely do not run out of space.

Also, we are running pi0w devices.

We don’t currently have any reports of such behaviour from other apps, so it would definitely be wise to analyze the SD cards after the fact to find out what specifically is happening here. I suppose you’ve already looked at the logs during the update to try to find anything? At this point I’d say investigation can’t proceed without being able to analyze the resulting bad state in the bad devices.

OK Thanks - I’ll see if I can get them to mail it back. I take it then that there isn’t any reason to believe that any alternate update strategy (such as going to an intermediate update) will be any safer.

In the meantime in the absence of any other action to take on it, it can’t hurt right? Let’s say it results in less bad devices, it may still be complex to try to reason from that result to the root cause, but at the least it will result in a greater proportion of successful updates.

For our part during this time, something I think we can look into is whether if any there are networking-related changes between those OS versions which could be relevant to a pi0w device (or in general). I’ll start that process now.

One question that could help in the meantime actually is: are any of these devices (which successfully host-update, but then don’t connect to the internet according to the dashboard) on the same subnet as devices which do update successfully? There’s a way to SSH from a good device on the same subnet into a device failing to connect to the VPN, and that might allow live debugging if so.

The devices are all individually in remote areas - some are really remote, as in 4 hour burro-ride remote. I connected with a colleague who experienced this issue in the lab and in working with your support group got this:

So I managed to get the logs from your device. The upgrade logs indicate the update process went well but when trying to boot into the new OS, the kernel panicked and was unable to boot the userspace. Whilst this might be fixed by retrying the upgrade, it can indicate a problem with the storage medium of the device.
A slight correction, we don’t know specifically that the kernel panicked, but something stopped the runtime reaching userspace, so the rollback occured.

So perhaps it is some kind of SD card corruption, but what’s surprising is how often this has occurred.

Is the application - to your knowledge - high in disk IO? So long as we’re in the stage of hypotheses with little data, one idea that comes to mind is if disk corruption had accumulated through the lifespan of the SD cards in question, and it was the host OS update which for one reason or another “exposed” this, from a minor issue it became critical and caused the panic in question.

As for the reply above from one of our support engineers, do you happen to have the subject line of that thread on hand? This would enable us to search back and look at that history.

Yes, high in log writes - our newer apps reduces the log writes dramatically, and also batch them to reduce overall disk writes, but these older ones in the field write about a line/second. Furthermore, they wrote to console which means it went into the journal logs also, for a double whammy.

The support thread was on Intercom so no title. Also no resolution beyond what I posted.

Unfortunately, in the absence of more data, SD card corruption seems to be the most likely explanation. It’s good that the log writes were reduced, and it would also be good to make sure that persistent logging is disabled unless you need that. It’s disabled by default.