I’ve got a fleet of Pi3’s that I’m using for wallboards around the office.
Before I discovered Balena I was using a stock Raspbian and chromium and everything was really stable. I’m not sure why yet, but since I shifted to Balena my Pis reboot themselves about once a day - and although two of my dashboards Just Work without requiring anyone to log in, my NewRelic dashboard requires someone to VNC into the system and conduct a login.
Debugging steps so far:
upgrade to 2.5A micro USB power supplies - no effect.
turn on persistent logging, but that appears to have had no effect. I’m none the wiser as to just what is causing the boxes to reboot.
Anyone here got any suggestions as to how to proceed?
Uptime on device 00e083378d5df0144cf34ddca09515da is 4 hours as reported on the web UI, 3:51 as reported by ‘uptime’…
balenaOS 2.38.0+rev1
Unfortunately I don’t have any spare Pis at the moment to try the sample app on, but I’ll try and get hold of one.
The only thing the app is doing is starting up X11 and then running x11vnc and chromium. Chromium loads three tabs and cycles between them using tabcarousel.
Is there any place that records logs from a previously booted instance?
In Device Configuration activate variable RESIN_SUPERVISOR_PERSISTENT_LOGGING and set its state to Enabled.
After reboots you should see past logs with: journalctl --list-boots
You can read them with journalctl -b -1 or -2 and so on.
Please note that the size of logs is fixed to 8MB so they can be overwritten.
What could also help debug this issue would be to use a debug build and have the board connected to a computer over serial. This way you would catch all the logs.
Here’s the last couple of hours before a reboot at 7pm UTC yesterday. Can you theorise as to what might be causing the watchdog to timeout ? What does it monitor and what is it expecting to see?
The Supervisor healthcheck shouldn’t cause a reboot, but it should cause the Supervisor to restart. There’s a couple of things here:
The healthcheck failing isn’t normal, and there sounds like an underlying issue here. Is it possible your service stalls/halts/does something kernel-based that may be causing an issue?
A reboot usually signifies something fairly catastrophic occurs in the kernel (as Sebastian notes), or potentially a corrupt SD card
Would it be possible to grant access support for this device (and supply the dash URL), so we can have a look at it in its current state? As also previously stated, it would be incredibly helpful if persistent logging were enabled, as this would also allow us to examine what occurs after it does reboot.
We can see an unusual situation that occurred at 2:30am this morning (where the balenaEngine restarted, but there was no reboot), but apart from that there’s nothing to indicate anything unusual apart from the healthcheck failing before the reboot (and no other information). Running the sample app on a Pi exhibiting this would be really helpful for us, as it’d help narrow down whether something in the service was causing issues. Given the very violent nature of the reboot, and the fact that there’s nothing else in the logs, this does seem like something that isn’t graceful.
We’d obviously like to know as soon as you see this again so we can look just after it happens.
I did have some SD card problems recently so bought brand new cards for all five Pis. I’ll get a new Pi ordered so that we can have something to test with…
Kind regards,
Al.
PS: I enabled access for the max length of time possible…
Ok, new Pi is up and running - 127a42ad211fabb8c558376fc4fa09a2 - I granted support access to it for a week. We will see if it also restarts itself or not…
Thanks @ajs1k. Just to confirm, despite support access having been granted to the Pi for a week (thank you!), we don’t have a mechanism to alert us upon an “event” (the Pi restarting itself), so if you notice it happening, please send us a message here and we will look at it then.
Right now, 284708565de0b10a07721283235af17b is in a weird state - it is operational but it’s seriously slow and connecting to it via the balenacloud web UI doesn’t work.
Also, a2cbe5478b9f868bca3f1ba56c517165 is similarly slow… and 00e083378d5df0144cf34ddca09515da is totally offline - I had to power-cycle it. Support access granted to all of the above…
Hey @ajs1k, in a2cbe5478b9f868bca3f1ba56c517165 and 00e083378d5df0144cf34ddca09515da the kernel detected Under-voltage (0x00050005), You can make sure your devices power supply are providing sufficient power. (you can check those messages by typing journalctl) on your host os.
@ajs1k The issues on 284708565de0b10a07721283235af17b appear to be caused by sd-card corruption/ slow sd card. You can checkout this thread: SD Card lifetime on Raspberry for some references on sd-cards.
Many thanks - I’ll go and replace that SD card (they’re less than 3 months old!) and see if I can pick up some better power supplies. The PSUs on a2cbe54 and 2847085 are possibly dodgy, also on 00e0833 I discovered that it’s 2A and not 2.5A - the other PSUs are the “official raspberry PSUs”, at least.