Raspberry Pis keep rebooting

I’ve got a fleet of Pi3’s that I’m using for wallboards around the office.

Before I discovered Balena I was using a stock Raspbian and chromium and everything was really stable. I’m not sure why yet, but since I shifted to Balena my Pis reboot themselves about once a day - and although two of my dashboards Just Work without requiring anyone to log in, my NewRelic dashboard requires someone to VNC into the system and conduct a login.

Debugging steps so far:

  • upgrade to 2.5A micro USB power supplies - no effect.
  • turn on persistent logging, but that appears to have had no effect. I’m none the wiser as to just what is causing the boxes to reboot.

Anyone here got any suggestions as to how to proceed?

Thanks,
Al.

Hi,

What BalenaOS version are you running?

Can you run uptime in hostOS console to see how long before they rebooted?

We usually leave boards powered for multiple days and they don’t reboot by themselves.

I’m thinking that maybe your application somehow triggers a board reboot by crashing the kernel.

Could you run a sample application like https://github.com/balena-io-projects/simple-server-node and check uptime?

Uptime on device 00e083378d5df0144cf34ddca09515da is 4 hours as reported on the web UI, 3:51 as reported by ‘uptime’…

balenaOS 2.38.0+rev1

Unfortunately I don’t have any spare Pis at the moment to try the sample app on, but I’ll try and get hold of one.

The only thing the app is doing is starting up X11 and then running x11vnc and chromium. Chromium loads three tabs and cycles between them using tabcarousel.

Is there any place that records logs from a previously booted instance?

Hi,

We have the option to enable persistent logging.

In Device Configuration activate variable RESIN_SUPERVISOR_PERSISTENT_LOGGING and set its state to Enabled.

After reboots you should see past logs with:
journalctl --list-boots

You can read them with journalctl -b -1 or -2 and so on.

Please note that the size of logs is fixed to 8MB so they can be overwritten.

What could also help debug this issue would be to use a debug build and have the board connected to a computer over serial. This way you would catch all the logs.

Here’s the last couple of hours before a reboot at 7pm UTC yesterday. Can you theorise as to what might be causing the watchdog to timeout ? What does it monitor and what is it expecting to see?

Thanks!
Al.

Jul 30 18:24:27 a2cbe54 df7643779caa[18434]: Supervisor API: GET /v1/healthy 200 - 408.004 ms
Jul 30 18:29:32 a2cbe54 df7643779caa[18434]: Supervisor API: GET /v1/healthy 200 - 170.878 ms
Jul 30 18:29:32 a2cbe54 resin-supervisor[18863]: Supervisor API: GET /v1/healthy 200 - 170.878 ms
Jul 30 18:30:51 a2cbe54 resin-supervisor[18863]: Attempting container log timestamp flush…
Jul 30 18:30:51 a2cbe54 df7643779caa[18434]: Attempting container log timestamp flush…
Jul 30 18:30:51 a2cbe54 resin-supervisor[18863]: Container log timestamp flush complete
Jul 30 18:30:51 a2cbe54 df7643779caa[18434]: Container log timestamp flush complete
Jul 30 18:32:17 a2cbe54 wpa_supplicant[799]: wlan0: WPA: Group rekeying completed with 80:2a:a8:15:fc:27 [GTK=CCMP]
Jul 30 18:34:35 a2cbe54 df7643779caa[18434]: Supervisor API: GET /v1/healthy 200 - 183.789 ms
Jul 30 18:34:35 a2cbe54 resin-supervisor[18863]: Supervisor API: GET /v1/healthy 200 - 183.789 ms
Jul 30 18:39:39 a2cbe54 df7643779caa[18434]: Supervisor API: GET /v1/healthy 200 - 372.797 ms
Jul 30 18:39:40 a2cbe54 resin-supervisor[18863]: Supervisor API: GET /v1/healthy 200 - 372.797 ms
Jul 30 18:40:51 a2cbe54 df7643779caa[18434]: Attempting container log timestamp flush…
Jul 30 18:40:51 a2cbe54 df7643779caa[18434]: Container log timestamp flush complete
Jul 30 18:40:51 a2cbe54 resin-supervisor[18863]: Attempting container log timestamp flush…
Jul 30 18:40:51 a2cbe54 resin-supervisor[18863]: Container log timestamp flush complete
Jul 30 18:44:43 a2cbe54 df7643779caa[18434]: Supervisor API: GET /v1/healthy 200 - 217.869 ms
Jul 30 18:44:43 a2cbe54 resin-supervisor[18863]: Supervisor API: GET /v1/healthy 200 - 217.869 ms
Jul 30 18:49:46 a2cbe54 df7643779caa[18434]: Supervisor API: GET /v1/healthy 200 - 183.341 ms
Jul 30 18:49:46 a2cbe54 resin-supervisor[18863]: Supervisor API: GET /v1/healthy 200 - 183.341 ms
Jul 30 18:50:51 a2cbe54 df7643779caa[18434]: Attempting container log timestamp flush…
Jul 30 18:50:51 a2cbe54 df7643779caa[18434]: Container log timestamp flush complete
Jul 30 18:50:51 a2cbe54 resin-supervisor[18863]: Attempting container log timestamp flush…
Jul 30 18:50:51 a2cbe54 resin-supervisor[18863]: Container log timestamp flush complete
Jul 30 18:54:50 a2cbe54 df7643779caa[18434]: Supervisor API: GET /v1/healthy 200 - 286.239 ms
Jul 30 18:54:50 a2cbe54 resin-supervisor[18863]: Supervisor API: GET /v1/healthy 200 - 286.239 ms
Jul 30 18:59:55 a2cbe54 df7643779caa[18434]: Supervisor API: GET /v1/healthy 200 - 176.160 ms
Jul 30 18:59:55 a2cbe54 resin-supervisor[18863]: Supervisor API: GET /v1/healthy 200 - 176.160 ms
Jul 30 19:00:51 a2cbe54 df7643779caa[18434]: Attempting container log timestamp flush…
Jul 30 19:00:52 a2cbe54 resin-supervisor[18863]: Attempting container log timestamp flush…
Jul 30 19:00:52 a2cbe54 resin-supervisor[18863]: Container log timestamp flush complete
Jul 30 19:00:51 a2cbe54 df7643779caa[18434]: Container log timestamp flush complete
Jul 30 19:05:42 a2cbe54 systemd[1]: resin-supervisor.service: Watchdog timeout (limit 3min)!
root@a2cbe54:~#

Hi,

The Supervisor healthcheck shouldn’t cause a reboot, but it should cause the Supervisor to restart. There’s a couple of things here:

  1. The healthcheck failing isn’t normal, and there sounds like an underlying issue here. Is it possible your service stalls/halts/does something kernel-based that may be causing an issue?
  2. A reboot usually signifies something fairly catastrophic occurs in the kernel (as Sebastian notes), or potentially a corrupt SD card

Would it be possible to grant access support for this device (and supply the dash URL), so we can have a look at it in its current state? As also previously stated, it would be incredibly helpful if persistent logging were enabled, as this would also allow us to examine what occurs after it does reboot.

Best regards, Heds

Certainly - I’ll sort that out for you… a2cbe5478b9f868bca3f1ba56c517165 is the device URL…

And yes persistent logging is indeed enabled. I imagine that the reboot interval here is at most once every 24 hours.

Thank you, I’ll take a quick look now and let you know if I find anything untoward.

Best regards, Heds

Great - thankyou !!

Hi,

We can see an unusual situation that occurred at 2:30am this morning (where the balenaEngine restarted, but there was no reboot), but apart from that there’s nothing to indicate anything unusual apart from the healthcheck failing before the reboot (and no other information). Running the sample app on a Pi exhibiting this would be really helpful for us, as it’d help narrow down whether something in the service was causing issues. Given the very violent nature of the reboot, and the fact that there’s nothing else in the logs, this does seem like something that isn’t graceful.

We’d obviously like to know as soon as you see this again so we can look just after it happens.

Best regards, Heds

Thanks Heds!

I did have some SD card problems recently so bought brand new cards for all five Pis. I’ll get a new Pi ordered so that we can have something to test with…

Kind regards,
Al.
PS: I enabled access for the max length of time possible…

Ok, new Pi is up and running - 127a42ad211fabb8c558376fc4fa09a2 - I granted support access to it for a week. We will see if it also restarts itself or not…

Thanks @ajs1k. Just to confirm, despite support access having been granted to the Pi for a week (thank you!), we don’t have a mechanism to alert us upon an “event” (the Pi restarting itself), so if you notice it happening, please send us a message here and we will look at it then.

Will do - the wallboard fleet partially restarted today. I’ll keep you posted!

Right now, 284708565de0b10a07721283235af17b is in a weird state - it is operational but it’s seriously slow and connecting to it via the balenacloud web UI doesn’t work.

Indeed, it actually just rebooted itself:

root@2847085:~# uptime
09:41:51 up 0:01, 1 user, load average: 2.86, 0.99, 0.36
root@2847085:~#

Also, a2cbe5478b9f868bca3f1ba56c517165 is similarly slow… and 00e083378d5df0144cf34ddca09515da is totally offline - I had to power-cycle it. Support access granted to all of the above…

Hey @ajs1k, in a2cbe5478b9f868bca3f1ba56c517165 and 00e083378d5df0144cf34ddca09515da the kernel detected Under-voltage (0x00050005), You can make sure your devices power supply are providing sufficient power. (you can check those messages by typing journalctl) on your host os.

@ajs1k The issues on 284708565de0b10a07721283235af17b appear to be caused by sd-card corruption/ slow sd card. You can checkout this thread: SD Card lifetime on Raspberry for some references on sd-cards.

Many thanks - I’ll go and replace that SD card (they’re less than 3 months old!) and see if I can pick up some better power supplies. The PSUs on a2cbe54 and 2847085 are possibly dodgy, also on 00e0833 I discovered that it’s 2A and not 2.5A - the other PSUs are the “official raspberry PSUs”, at least.