Raspberry Pis keep rebooting

Same situation.

We will try this one of our boards and see if we can reproduce

I swapped over the board, just now.

I can also try swapping over the SD card if swapping the board has had no effect…

No effect - catting /var/cache/ldconfig/aux-cache crashes the system. I’ll swap out the SD card next.

Just a quick one; have you re-downloaded the image from the dashboard, or is this the same image file each time you flash? Would be good to know, as it could be a bad image file.

It’s the same image file - I’ll re-fetch it and compare MD5. Swapped out the SD card in the meantime - it’s now 1f231ff.

can still crash the system by reading this file. i’ll try flashing the newly downloaded image.

I reproduced this issue on our boards too. We will let you know how investigation goes

1 Like

Hi guys - how’s this looking ?

I’m giving balenaOS 2.41.0+rev3 a try… let’s see if there were any fixes in that release…

After I deployed the new OS, everything stopped working. It gets stuck starting my X11 server and I’ve also discovered that my docker build is busted since python no longer includes distutils.util as part of the base install… and I’ve been as yet unable to find out which package would contain it.

[solved] needed an extra apt-get update before apt-get install python3-distutils could be found.

Well 2.41.0r3 is even worse: my container stays up for a few minutes and then crashes and gets stuck in an infinite crash/retry/crash/retry loop.

Hi,
The following issue was fixed

by this PR volatile-binds: Avoid overlayfs mounts by agherzan · Pull Request #1620 · balena-os/meta-balena · GitHub

So this shouldn’t really happen in v2.41.

There could be something else going on here as well.
Can you please grant support access for a week and share the long device url?

Thanks
ZubairLK

I moved to 2.43.0r1 recently - ff0a69965cbf0e78924b1ef3fc5500a8 still seems to want to restart its container periodically. In fact none of my systems are really stable:

Uptimes are 17 hours, 16 hours, 2 days, 3 days and 3 days. None of the restarts was due to deliberate action on my part.

Thanks!
Al.

Well 2.41.0r3 is even worse: my container stays up for a few minutes and then crashes and gets stuck in an infinite crash/retry/crash/retry loop.

Can you please share the log if you have any?

Uptimes are 17 hours, 16 hours, 2 days, 3 days and 3 days. None of the restarts was due to deliberate action on my part.

Is this from the dashboard? That information is from the vpn/internet connectivity. Actual device uptime can be seen by logging into the device and checking using the uptime command.

I’m afraid its quite hard to debug sporadic crashes/reboots without looking at logs or stack traces.

The device you linked ff0a69965cbf0e78924b1ef3fc5500a8 seems to be running ok at the moment. I see some warnings in the the logs which I need to investigate further here Investigate systemd slice warnings · Issue #1691 · balena-os/meta-balena · GitHub

Unfortunately I don’t have any logs from 2.41.0r3’s behaviour.

Regarding the uptimes, yes this is from the dashboard - although there is another clue that all is not well: the wallboard has one page that I’ve not been able to have automated so I have to VNC into it and enter a username/password before it displays. And of course every time there is a restart of the Pi or of the container, I have to log in again. This happens at least once a day I’m afraid.

Hi, I checked the device you provided UUID for and I see a couple of issues there.

The device is shown online for 3 hours, but the uptime is more than a day. I found this corresponding log entry:

Oct 01 10:02:04 ff0a699 openvpn[1991]: Tue Oct  1 10:02:04 2019 Connection reset, restarting [-1]

So the VPN connection was reset somewhere on the way to our servers and this is why the device is shown as online for less time than it had been up.

Earlier in the kernel logs I see:

[    1.941086] fsck.fat 4.1 (2017-01-24)
               0x41: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
                Automatically removing dirty bit.
               Performing changes.
               /dev/disk/by-label/resin-boot: 157 files, 16367/80628 clusters

This leads me to think that the device was not rebooted cleanly, e.g. with a reboot command or similar.

This is all the information we could retrieve. I noticed that you enabled persistent logging, which is helpful. Once you notice a problem with those devices please ping us again so that we may continue investigating the logs.

Please let me know if you have any questions.

Thanks,
Zahari

I’m afraid its quite hard to debug without an active stack-trace or a fast repeatable test case.

The logs are polluted with

Sep 30 14:20:42 balena systemd[1]: Removed slice libcontainer_937_systemd_test_default.slice.
Sep 30 14:20:42 balena systemd[1]: Created slice libcontainer_943_systemd_test_default.slice.

for which I have a PR in the next version. https://github.com/balena-os/meta-balena/pull/1692

I’d recommend editing /mnt/boot/cmdline.txt and adding systemd.log_level=notice to reduce that noise.

Then perhaps with persistentLogging, we might be able to see a stack-trace or logs before the crash.

Also, please do check if your application can withstand intermittent connectivity as mentioned in the previous message by majorz.

Ok, Added that and will reboot. Let’s see if we can get any further…