Balena Fin Cell Modem Failure

Hello,

We have an application deployed on the Balena Fin that requires use of a cell modem. (We typically use the Quectel EC25 modems.). We have deployed our solution to many installations and it generally works well. However, we have seen occasions where a device suddenly goes offline and never recovers, even though the user has not unplugged it.

I added application code that detects when the gateway has not communicated with our servers for a while, and reboots the gateway using the following command (called from a shell script):
DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket \ dbus-send \ --system \ --print-reply \ --dest=org.freedesktop.systemd1 \ /org/freedesktop/systemd1 \ org.freedesktop.systemd1.Manager.Reboot

I also tried sending mmcli -m 0 -r to explicitly reboot the cell modem before rebooting our device with the above command, but that also didn’t recover communications.

The only way we have been able to get these units back online is for the user to physically disconnect the power adapter and then reconnect it. Obviously, this is not an acceptable solution, so we need to figure out why the devices are going offline, and figure out a way to fix this in software, if possible.


Today, I worked with one of our customers who had a working device that went offline several days ago, even though it is in an area with strong cell coverage. He unplugged the power adapter on the device, and re-connected it, and it came back online as expected. I was able to pull the logs (using journalctl --all). I noticed the following repeatedly throughout the log before the user power cycled the device:

  1. OS fails to initialize / discover the cell modem.
  2. Application fails to connect to our servers for lengthy time.
  3. Application reboots the device.

Here are some specific messages in the log (repeated many times before the customer power cycled the device):

ModemManager[1396]: [/dev/cdc-wdm0] Opening device with flags 'version-info, proxy'...
ModemManager[1396]: [/dev/cdc-wdm0] created endpoint
ModemManager[1396]: cannot connect to proxy: Could not connect: Connection refused
ModemManager[1396]: [/dev/cdc-wdm0] Checking version info (20 retries)...
ModemManager[1396]: [ttyUSB0/probe] failed to parse QCDM version info command result: -7
ModemManager[1396]: [base-manager] couldn't check support for device '/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.4': not supported by any plugin

Once the customer power cycled the device, it was able to come online just fine, and the following messages were in the log:

ModemManager[1399]: [device /sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.2] creating modem with plugin 'quectel' and '6' ports
ModemManager[1399]: [base-manager] modem for device '/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.2' successfully created
ModemManager[1399]: [/dev/cdc-wdm0] Opening device with flags 'version-info, proxy'...
ModemManager[1399]: [/dev/cdc-wdm0] created endpoint
ModemManager[1399]: [/dev/cdc-wdm0] Checking version info (20 retries)...
ModemManager[1399]: [/dev/cdc-wdm0] QMI Device supports 30 services:
...
ModemManager[1399]: [/dev/cdc-wdm0] Reading expected data format from: /sys/class/net/wwan0/qmi/raw_ip
ModemManager[1399]: [/dev/cdc-wdm0] Allocating new client ID...
ModemManager[1399]: [/dev/cdc-wdm0] Registered 'wda' (version 1.16) client with ID '1'
...
ModemManager[1399]: [modem0] state changed (disabled -> enabling)
ModemManager[1399]: [modem0] state changed (enabling -> enabled)
ModemManager[1399]: [modem0] simple connect state (5/8): register
ModemManager[1399]: [modem0] simple connect state (6/8): bearer
ModemManager[1399]: [modem0] simple connect state (7/8): connect
ModemManager[1399]: [modem0] state changed (enabled -> connecting)
...
ModemManager[1399]: [modem0] state changed (connecting -> registered)

After the customer power cycled the gateway and it came back online, I used “mmcli” to query the modem. The state was set to “connected” and the signal quality was 70%.

So, when the device is in an “offline” state, it appears the balena OS cannot communicate with the cell modem for some reason. Any idea why? Also, my reboot DBUS command is apparently not sufficient to recover from this bad state. Is there anything I could do in software to try and fix or recover from this problem? We cannot have units in the field suddenly go offline like this. Any suggestions would be very appreciated.

(Most of our Balena Fin devices are running balena OS 2.80.3+rev1.)

1 Like

Hi

Thanks for creating this issue. I am part of the hardware team here at Balena, and I have started a discussion with other folks from the team to figure what could have been happening. I have a couple of follow up questions

  • What version of the balenaFin are you using for your fleet? Especially the devices on which you have seen this issue.
  • have you seen similar behavior with other devices - for example a Rapsberry Pi 3?
  • were you able to reproduce this issue with a device you have access to? Perhaps something that was also connected over LAN. Or has it only been customer devices?
  • are the devices subject to high temperatures? What kind of case are these usually placed in?
1 Like
  • do you also have other devices connected to your USB ports that also become unresponsive?

Thank you for the quick response!

  1. We are using version 1.1 balena fins.
  2. Almost all of our products (including the device that went offline) use the Raspberry Pi Compute 3+ modules.
  3. This problem is difficult to reproduce, so unfortunately, I have rarely seen this problem with our own devices. (It has happened to our devices, but rarely.)
  4. Our devices are often deployed to industrial / warehouse environments in the US and foreign sites, and are subject to a variety of temperatures. The device that was power cycled by the user is currently reporting a CPU temperature of 47.2C. (We do log the CPU temperature periodically. When the device was offline, it was typically measuring 25 to 32C. This lower temperature range seems to make sense since the cell modem was not functioning at that time). We use the standard case from the developer kit.
  5. We have a USB Bluetooth dongle installed into one of the USB ports. I cannot tell if the dongle was functioning or not when the device went offline. I can add code to try and monitor this in the future.
1 Like
  • Can you confirm that its the balenaFin 1.1 and not balenaFin 1.1.1?
  • One of the reasons why I am asking this is that the 1.1 has a known issue where at high temperatures the USB hub IC powers down. You can find more about that on our root cause analysis page here - Root-cause analysis of the balenaFin high-temperature USB issue

Is it possible for me to read the hardware version on a remote device from the Balena Cloud? Most of our devices were purchased in the last year, so I expect these should all be 1.1.1, but I’m not sure how to confirm that.

Hi @dstewart, you can find which device you have remotely by using the Fin block on your devices which will give you the manufacturing information for the device GitHub - balenablocks/fin: The fin block is a balenaBlock that provides flashing utilities, status tagging, sleep control and firmata control functionality of the balenaFin.

Quick update. First of all, I was mistaken in my earlier post that claimed we were using 1.1 balena fins. I expect most of our units are using 1.1.1 hardware.

I added the fin block service to my docker-compose.yml file and deployed it to a local device running on old 1.1.0 hardware. After that, I issued the following command:
curl -sX GET localhost:1337/eeprom

This returned the following:
{"schema":null,"hardwareRevision":11,"batchSerial":null...

Is 11 the expected hardware revision value for a 1.1.0 board? What value should it return for a version 1.1.1 board?