First things first, happy new year! Hopefully 2021 will be a much better year.
Next up, over the weekend I had a RPI4 with BalenaOS v2.65.0+rev1 (dev), supervisor v12.2.11 that “crashed”. The RPI4 is not in local mode and downloads and runs the releases from the openBalena server. One of the containers ping’s our server, and about 01:00 it went offline and didn’t ping anymore. I logged in via our VPN at the office and tried to SSH into it, but it timed out.
Today, I came back at the office and tried to find out what was wrong. First thing I noticed is that the Ethernet port of the RPI4 had both lights on, but they were solid on. No flickering as always.
The Node.js application that’s supposed to run on the RPI4, also creates (via the NetworkManager’s DBUS API) an access point via wlan0. But that access point was gone. I hoped that I could SSH via the Access Point to gather some logs.
All information:
Raspberry Pi 4 (2GB RAM)
BalenaOS 2.65.0+rev1 (development image)
32GB SanDisk Max Endurance SD-card
Supervisor v12.2.11
openBalena server v3.1.1
Running 3 containers (redis, custom Node.js application, custom Golang application with serial communication, so enable_uart=1 is added to the config.txt via environment variables)
Another RPI4, which is being tested by our client, also has the same specs as above, but that’s a production image. This is running for 14 days now without a crash. It runs the same containers and I don’t see CPU spikes, high CPU temperatures or a memory leak there.
I’m happy to share additional information. I had to restart the RPI4 because it was ‘dead’. After a hard reset (cut power), it booted fine and is running again. I hope someone can do anything with this information.
Looks like one of two: either network problem occurred so that neither supervisor, nor containers could reach outside world, or the device crashed. I found similar problem on rpi forum: https://www.raspberrypi.org/forums/viewtopic.php?t=247355
I’ll ping my colleagues to ask if there is a good way to debug this.
I’ll enable persistent logging and let you guys know when it happens again.
What should I do when it happens again? Reboot the device and post the output of journalctl or something? Or is there some other command / file that I should upload?
Is the only way to enable persistent logging via the config.json file? Because when I check the documentation, it says that it can be enabled via the config.json or a configuration variable. But on this page, there’s no configuration variable that enables this.
The crash happened again, but there’s no /var/log/journal after adding the SUPERVISOR_PERSISTENT_LOGGING true yesterday. So that probably didn’t work. Maybe I’ve done something wrong here? If so, I’d like to know before something like this happens to devices in the field and I need the logs
I’ll try to add it to the config.json. Can I check if it works by looking for /var/log/journal after the config.json is changed?
Hi Bart,
can I ask you how you have added the variable ?
Also, could you please try to access to https://dashboard.balena-cloud.com/apps/<APP_ID>/config (change APP_ID with you application ID) and check if there is a configuration with this name Enable persistent logging. Only supported by supervisor versions >= v7.15.0.. If so, you should be able to change it from there. If this is what you did, could you please check the value of Storage in /etc/systemd/journald.conf and the value of persistentLogging in your config.json.
Oh sorry Bart, I’ve missed the openBalena part, so let’s delete the /config url sentence from my previous answer.
Let we know the config.json and journald.conf values, this would help us understand if something went wrong. Assuming the full UUID was used and not a short UUID the command you just shared looks like the right one.
The contents of /etc/systemd/journald.conf is this:
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See journald.conf(5) for details.
[Journal]
#Storage=auto
#Compress=yes
#Seal=yes
#SplitMode=uid
#SyncIntervalSec=5m
#RateLimitIntervalSec=30s
#RateLimitBurst=10000
#SystemMaxUse=
#SystemKeepFree=
#SystemMaxFileSize=
#SystemMaxFiles=100
RuntimeMaxUse=64M
#RuntimeKeepFree=
#RuntimeMaxFileSize=
#RuntimeMaxFiles=100
#MaxRetentionSec=
#MaxFileSec=1month
ForwardToSyslog=yes
#ForwardToKMsg=no
#ForwardToConsole=no
#ForwardToWall=yes
#TTYPath=/dev/console
#MaxLevelStore=debug
#MaxLevelSyslog=debug
#MaxLevelKMsg=notice
#MaxLevelConsole=info
#MaxLevelWall=emerg
#LineMax=48K
#ReadKMsg=yes
Everything is commented out as you can see, so the environment variable probably didn’t work. At this point, I haven’t changed the config.json, because I’m interested what went wrong with the SUPERVISOR_PERSISTENT_LOGGING environment variable.
After adding that environment variable (RESIN_SUPERVISOR_PERSISTENT_LOGGING), the device rebooted after some time and is not coming back up again. I’ll try and power-cycle it, but I thought it was worth mentioning.
I’m sorry, yes, it did bring the device back online. It looked like it crashed again, because both ethernet LED’s were solid.
However, the environment variable (RESIN_SUPERVISOR_PERSISTENT_LOGGING) did not change the /etc/systemd/journald.conf file. It’s a file from the HostOS right, not from within the resin_supervisor container?
I checked my own /etc/systemd/journald.conf, and it also hadn’t changed, despite me enabling persistent logging - I’m thinking that maybe we shouldn’t be expecting that file to change. Could you take a look at /mnt/boot/config.json?
Indeed, persistentLogging is true in /mnt/boot/config.json. Only thing to figure out is which env variable set it to true, SUPERVISOR_PERSISTENT_LOGGING or RESIN_SUPERVISOR_PERSISTENT_LOGGING?
Nonetheless, persistent logging is (probably) enabled now, so when it crashes again, I’ll post the logs!