Hello,
We have an application deployed on the Balena Fin that requires use of a cell modem. (We typically use the Quectel EC25 modems.). We have deployed our solution to many installations and it generally works well. However, we have seen occasions where a device suddenly goes offline and never recovers, even though the user has not unplugged it.
I added application code that detects when the gateway has not communicated with our servers for a while, and reboots the gateway using the following command (called from a shell script):
DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket \ dbus-send \ --system \ --print-reply \ --dest=org.freedesktop.systemd1 \ /org/freedesktop/systemd1 \ org.freedesktop.systemd1.Manager.Reboot
I also tried sending mmcli -m 0 -r
to explicitly reboot the cell modem before rebooting our device with the above command, but that also didn’t recover communications.
The only way we have been able to get these units back online is for the user to physically disconnect the power adapter and then reconnect it. Obviously, this is not an acceptable solution, so we need to figure out why the devices are going offline, and figure out a way to fix this in software, if possible.
–
Today, I worked with one of our customers who had a working device that went offline several days ago, even though it is in an area with strong cell coverage. He unplugged the power adapter on the device, and re-connected it, and it came back online as expected. I was able to pull the logs (using journalctl --all). I noticed the following repeatedly throughout the log before the user power cycled the device:
- OS fails to initialize / discover the cell modem.
- Application fails to connect to our servers for lengthy time.
- Application reboots the device.
Here are some specific messages in the log (repeated many times before the customer power cycled the device):
ModemManager[1396]: [/dev/cdc-wdm0] Opening device with flags 'version-info, proxy'...
ModemManager[1396]: [/dev/cdc-wdm0] created endpoint
ModemManager[1396]: cannot connect to proxy: Could not connect: Connection refused
ModemManager[1396]: [/dev/cdc-wdm0] Checking version info (20 retries)...
ModemManager[1396]: [ttyUSB0/probe] failed to parse QCDM version info command result: -7
ModemManager[1396]: [base-manager] couldn't check support for device '/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.4': not supported by any plugin
Once the customer power cycled the device, it was able to come online just fine, and the following messages were in the log:
ModemManager[1399]: [device /sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.2] creating modem with plugin 'quectel' and '6' ports
ModemManager[1399]: [base-manager] modem for device '/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.2' successfully created
ModemManager[1399]: [/dev/cdc-wdm0] Opening device with flags 'version-info, proxy'...
ModemManager[1399]: [/dev/cdc-wdm0] created endpoint
ModemManager[1399]: [/dev/cdc-wdm0] Checking version info (20 retries)...
ModemManager[1399]: [/dev/cdc-wdm0] QMI Device supports 30 services:
...
ModemManager[1399]: [/dev/cdc-wdm0] Reading expected data format from: /sys/class/net/wwan0/qmi/raw_ip
ModemManager[1399]: [/dev/cdc-wdm0] Allocating new client ID...
ModemManager[1399]: [/dev/cdc-wdm0] Registered 'wda' (version 1.16) client with ID '1'
...
ModemManager[1399]: [modem0] state changed (disabled -> enabling)
ModemManager[1399]: [modem0] state changed (enabling -> enabled)
ModemManager[1399]: [modem0] simple connect state (5/8): register
ModemManager[1399]: [modem0] simple connect state (6/8): bearer
ModemManager[1399]: [modem0] simple connect state (7/8): connect
ModemManager[1399]: [modem0] state changed (enabled -> connecting)
...
ModemManager[1399]: [modem0] state changed (connecting -> registered)
After the customer power cycled the gateway and it came back online, I used “mmcli” to query the modem. The state was set to “connected” and the signal quality was 70%.
So, when the device is in an “offline” state, it appears the balena OS cannot communicate with the cell modem for some reason. Any idea why? Also, my reboot DBUS command is apparently not sufficient to recover from this bad state. Is there anything I could do in software to try and fix or recover from this problem? We cannot have units in the field suddenly go offline like this. Any suggestions would be very appreciated.
(Most of our Balena Fin devices are running balena OS 2.80.3+rev1.)