Cellular connection reset watchdog?

We are using the BalenaFin v1.1.0 and recently some v1.1.1. Our devices have a Raspberry Pi Compute module 3+, and all have a Quectel cell modem (EC25-AF).

These devices are deployed in remote locations “out in the middle of nowhere” and we see intermittent cellular connectivity. That’s ok most of the time. Sometimes those devices “fall off” the network and don’t reconnect to cellular. When we see the device has been offline for more than a couple of days we roll a truck out to the site and power cycle the BalenaFin. That action restores cellular connectivity.

Is there a way to detect that we have lost cellular from the BalenaFin and restart ModemManager or something? Maybe a watchdog of some kind? Looking for an active way to monitor our connection and take action locally on the BalenaFin - I’d prefer not to simply reboot/reset the entire device on a schedule if possible.

Any suggestions are welcome! Thanks!

Hi Don, we currently don’t have built-in watchdogs for NetworkManager/ModemManager in balenaOS and such functionality has been relayed to user application space. This GH issue has a bit more context: https://github.com/balena-os/meta-balena/issues/1765

You can definitely add a sidecar container/service that monitors connectivity and restarts the ModemManager/NetworkManager as required (maybe even reboot the device, though hopefully it won’t come to that). You’ll need to talk to those services over DBUS, or even using nmcli might work for you. Here is a quick primer on using DBUS with balenaOS you might find this useful.

I’ve also found this project by a colleague: https://github.com/balena-io-playground/project-phoenix
Take it with a grain of salt, it might not fully work as it’s sitting in our playground org which means it’s not production ready for sure. But you can definitely take inspiration from it. I’ll ask Shaun about the project, maybe he has some insight to share.

Hi Don, are these machines sitting in an environment where it gets hot?

Thanks @tmigone for the suggestions.

@floion - These devices are in an outdoor enclosure that probably gets to 60-65C on a hot day. I don’t have constant measurements on temp exposure, but it’s something we’ve talked about building into our metrics.

Hi Don, are you seeing this problem with both v1.1.0’s and v1.1.1? Or just on v1.1.0?

Just deployed our first 1.1.1 today so I’ve only got data for 1.1.0

Given your description of the problem, I believe this is a hw issue in 1.1.0. See here for more info: https://www.balena.io/blog/usb-issue-rca/
Having said that, can you monitor the 1.1.1s and get back to us in a few days time to see if you have this problem on those devices too?

Yep, will do. It could take a week or so before we have enough indication to say if those units are experiencing the same issue. I’ll update here once we have more info.

I’ll also try to get a plan together for us to swap the 1.1.0’s we’ve seen issues on so we can send those into you on recall. I’ll need to coordinate some things so we can replace the old units.

Thank you for your help!

Thanks, keep us posted

Hey Don,

We run a small network of devices (~85) in the field with spotty cellular connectivity. We have a software watchdog on these to reset the modem (USB) as well as a hardware version which will power worst case. Our devices are up to 1000 miles apart.

We send a bunch of system stats to Grafana per device so we can see patterns/issues. Our devices are currently RPI 2b but we are prototyping the BalenaFin as a replacement.

I’m also interested in your setup and experience with the Fin.

There is no generic watchdog program for this kind of things I’m afraid. The reason why your device cannot be connected by the MM+NM combo could be either a system issue of some sort or, even more likely, a modem firmware issue, or even some network condition problem. If a system reset helps, that would also power cycle the modem I assume, so if it’s a firmware issue at least the modem is not bricked :smiley:

I’ve lately thought of writing a generic “wwan monitor” program (how I personally call the watchdog) that would work alongside the MM+NM combo to detect stuck mobile connections. NM itself has a retry mechanism to attempt connecting, but again, if the modem firmware is stuck or if the network status of the device isn’t the correct one, only a “hard” solution would be able to help. Sometimes it’s enough to get the modem in airplane mode, because that runs a full IMSI detach procedure with the network, and once the modem is back in full functionality mode the registration with the network is fresh and the connections are allowed again. Some other times, the airplane mode cycle is not enough, and a full modem reset is required (USB reset in the Pi I assume, although other systems have externally controlled power sources via e.g. dedicated GPIOs and such).

I have written multiple “wwan monitor” applications over the last years for multiple clients out there, and each and every application was specific to each client. E.g. some clients required to test not only generic connectivity with the network, but also test explicit TCP/IP connectivity with a given server (think of a data logger that periodically sends data to a specific server). Some other clients wanted to attempt first the airplane mode cycle instead of the full device reset because it’s quicker to reconnect and also safer. Some other clients didn’t care about the likelihood of breaking the module’s internal filesystems and wanted to just go on and fully power off the device externally as soon as it was disconnected to always start fresh. All these different input conditions are client specific and the decisions to take on how to recover are also client specific. Even the actual modem being used affects on what to do! E.g. if a modem that is controlled via AT commands exclusively gets stuck, it may happen that we cannot request a clean reset; while modems controlled with QMI usually never fail to process QMI requests, so it’s easier to always request clean resets. All very client specific, and so the work needed to setup a generic program that fits all needs is too much work, and I’m already very busy just with ModemManager already :smiley:

For your specific issue, though, a very simple “wwan monitor” program that checks whether the NM connection is up or not, and if it’s up whether it has connectivity to a given remote server would be enough. If the checks fail, you can attempt the USB reset, which may be enough (or not!).