Trouble with openbalena after internet outage

bidikov · December 19, 2019, 3:29pm

Hi team,
I have to update on this no internet at the devices / load on VM…

Strangely, seems that this 2 moments are somehow connected… here is a short timeline of yesterdays events:

Two devices in same city lost internet connectivity (and after the UPS got drained they might have been even without power) becouse of a serious power outage in the city.
Now i’m almost certain that UPS devices managed to hold up and only internet connectivity was lost but i saw a strange connection… 30 minutes after internet was gone (and devices stoped sending data) we saw the load of the openbalena VM to rise from 0.8 ti 14…
One of the devices newer showed online in openbalena (until la power cycle today)
The second device, was online in openbalena after we rebooted the VM (and load returned to normal) but when we executed a device reboot xxxxx command the device went offline and never returned in openbalena until today…

After a power cycle today now everything is normal…
Now 2 questions arise from this:
First - how and when to check if IP address lease could be the reason to stop getting IP back after internet getting back (since i presume that the main router rebooted because of the outage) i presume that DHCP failed (exited before internet was back) and/or never renewed the lease as expected and that is why reboot was eminent…

Secondly, how safe is power cycle as a operation in situations like this (especially because of the sd card risks) and can we make the device not to use SDcard except for booting…

Third, any ideas how to ship logs to a central syslog or a similar situation to have some of the logging (both from application and host device OS) into a more easy to analyze place?

Thanks for the support and hope we can get to the bottom of this in order to continue the project/deployment…

tmigone · December 19, 2019, 5:16pm

Hello,

Thanks for the update. It seems though that the underlying issue still is the way the VPS is handling outages. When you say the VM load increased do you mean CPU, memory or both? Did you perhaps get the chance of testing on a bare metal machine?

On the questions you have:

First - how and when to check if IP address lease could be the reason to stop getting IP back after internet getting back (since i presume that the main router rebooted because of the outage) i presume that DHCP failed (exited before internet was back) and/or never renewed the lease as expected and that is why reboot was eminent…

What could be happening is that the router is not persisting leases after the outage, so once it gets back online it starts reassigning them with no knowledge of your devices existing leases. In this case it makes sense that power cycling the device gets it back online since a restart will trigger a lease renew request. As to how and when to check for this behavior, I think the watchdog idea discussed previously could be of use. If you can detect from within the device when this outages happen (by checking connectivity conditions to the DHCP server) you can then force a reboot (or even a lease renew) when required.

Secondly, how safe is power cycle as a operation in situations like this (especially because of the sd card risks) and can we make the device not to use SDcard except for booting…

Regarding SD Cards, this events seem to be very occasional, so specially if you are using the SD cards we recommend (San Disk Extreme Pro series) the risk should be very low. If you are still concerned about SD card reliability, I can recommend you looking into the balenaFin product, which is a professional industrial quality carrier board that solves many reliability issues the Pi’s have (including SD card risks). You can learn more about the fin here:

Another thing to consider in order to minimize the power cycling is the idea on the previous question of forcing the device to do a lease renew instead of a reboot. Of course provided that it works.

Third, any ideas how to ship logs to a central syslog or a similar situation to have some of the logging (both from application and host device OS) into a more easy to analyze place?

Are you referring to sending logs to an online service or an on-device master log type of solution?

Thanks, Tomás

bidikov · December 19, 2019, 6:19pm

Both went up… CPU was 8GB used, CPU was 400% (for 4 vCPU) Linux load in top/htop was 16.2 - still trying to verify if this was connected but seems like the case (which makes it even more strange)

This was also my idea of the problem - and then looks like there should be a way to ask from balenaos dhcp client to keep lease time (retry no mater what) very low (like 5-10min) since requesting a renew before lease expires is never a problem for a DHCP server… compared to the risk of getting a lease time of 1/3/6/24 hours by bad config in LAN DHCP (which we cannot control all the time) and then be left dry in the water like this…
In any case, the question of the logs (for which i will talk below) and more examples on how this hypervisor api works (too many keyas and generic examples - some quick hack / howto for simple command would help alot) in order to get some watchdog like procedure implemented…

Looks like we will power cycle devices as needed until we can fix/debug this and definetely we can look into the balena Fin…

Well, online service for me is a bit of a overkill… simple way to get remote syslog from the balenaos (with ha simple host/ip as variable ) and maybe the same for the apps will be more than good…
Then, standard linux syslog utils on that syslog machine can be more than efficient for further debug analysis…

Thanks,
V.B

CameronDiver · December 19, 2019, 7:29pm

I may have misunderstood this, but it’s possible to read the journal of devices using a supervisor API endpoint: Interacting with the balena Supervisor - Balena Documentation

Does this help?

bidikov · December 19, 2019, 8:16pm

Well most of the supervisor API looks ok - although the documentation looks too complex for me (and again some nice snippets and example with all those keys/tokens can be really beneficial)
My idea is more like standard linux thinking - yes API is a good concept but i have to make request and integration is complex… my old and grumpy self thinks having a way to setup one ip/host which can be already router over the VPN network overlay can be quite nicer…
Yes, i know it’s not 21 modern API science but that concept of remote syslog is well used for 30 years now and is compatible with most NMS and Device monitoring system (yes SNMP and syslog are still the kind)

The implementation will be very simple since syslog already has everything by default in order to support this - you just enable it and in this mode syslog will not use the filesystem which means less write i/o operations…

Hope we understand the idea completely…

All the best

CameronDiver · December 19, 2019, 8:29pm

I see, yeah unfortunately this is not something that we support with balenaOS as far as I’m aware right now. We do run journald on devices, so perhaps there would be something you could do to run this yourself, although I’m not sure

Topic		Replies	Views
Devices recognized by server but always offline openBalena	11	2316	February 3, 2019
OpenBalena / BalenaCLI: Device on CLI is saying not connected, but I can push apps to the device openBalena	2	883	December 11, 2018
Open balena device offline(2021) openBalena support , network	2	619	November 29, 2021
balenaSound: Device is Offline and I'm stumped as to why Product support	8	472	March 23, 2021
Device not online / Production openBalena	35	6260	January 7, 2019

Trouble with openbalena after internet outage

Related topics