Multi-container device connected over LTE doesn't reboot correctly

I have an issue with multi-container setups on a Raspberry Pi CM4 and an LTE module.

Setup details:

Host OS version: balenaOS 5.1.20+rev1
Supervisor version: 16.0.3
LTE module: Huawei ME906s

Problem description:

Note: this only happens in multi-container setups

When rebooting the device through the Balena dashboard → Actions → Reboot, it doesn’t always come back online.

When it doesn’t come back online, even connecting an ethernet cable doesn’t fix the issue, it seems the device is stuck in a mode where nothing but a power cycle can bring it back online.

Steps to repeat

  1. Create 2 fleets, one running a single container, the other multiple (tested with 2 and 3)
  2. Download the image of the single container fleet
  3. Modify image by adding a cellular configuration file
  4. flash CM4 with modified image
  5. Power up, wait for Balena connection
  6. Reboot device through Balena dashboard, it should reconnect automatically (repeat until satisfied)
  7. Move device to multi-container fleet
  8. Reboot device through Balena dashboard, it might reconnect but probably won’t (repeat until it doesn’t)

Troubleshooting so far

Moving the device to a single container fleet fixes the issue. Which in my opinion eliminates cellular or ethernet configuration issues.

Moving the device to a different multi-container fleet doesn’t fix the issue.

Note: the LTE module and SIM cards are detected no matter the fleet the device is on. This was checked using the following commands: (hostname and IMEI blocked out)

For the LTE module:

root@xxxxxx:~# lsusb
Bus 001 Device 003: ID 12d1:15c1 Huawei Technologies Co., Ltd. HUAWEI Mobile
Bus 001 Device 002: ID 0424:2514  
Bus 001 Device 001: ID 1d6b:0002 Linux 5.15.92-v8 dwc2_hsotg DWC OTG Controller

For the SIM card:

root@xxxxxx:~# mmcli -L
    /org/freedesktop/ModemManager1/Modem/0 [Huawei Technologies Co., Ltd.] HUAWEI Mobile
root@xxxxxx:~# mmcli -m 0
  --------------------------------
  General  |                 path: /org/freedesktop/ModemManager1/Modem/0
           |            device id: 4fe19f9208ad42ac94b4c75e9e0f1343c1b5caaa
  --------------------------------
  Hardware |         manufacturer: Huawei Technologies Co., Ltd.
           |                model: HUAWEI Mobile
           |    firmware revision: 11.617.00.00.00
           |         h/w revision: ML1ME906SM
           |            supported: gsm-umts, lte
           |              current: gsm-umts, lte
           |         equipment id: 867160022708402
  --------------------------------
  System   |               device: /sys/devices/platform/soc/fe980000.usb/usb1/1-1/1-1.1
           |              drivers: option, cdc_mbim
           |               plugin: huawei
           |         primary port: cdc-wdm0
           |                ports: cdc-wdm0 (mbim), ttyUSB0 (gps), wwan0 (net)
  --------------------------------
  Status   |       unlock retries: unknown (3)
           |                state: connected
           |          power state: on
           |          access tech: lte
           |       signal quality: 67% (cached)
  --------------------------------
  Modes    |            supported: allowed: 2g, 3g, 4g; preferred: none
           |              current: allowed: 2g, 3g, 4g; preferred: none
  --------------------------------
  IP       |            supported: ipv4, ipv6, ipv4v6
  --------------------------------
  3GPP     |                 imei: xxxxxxxxxxxxxxx
           |          operator id: 22802
           |        operator name: Sunrise
           |         registration: roaming
           | packet service state: attached
  --------------------------------
  SIM      |     primary sim path: /org/freedesktop/ModemManager1/SIM/0
  --------------------------------
  Bearer   |                paths: /org/freedesktop/ModemManager1/Bearer/0

I have tried swapping CM4, LTE module, carrier board to no avail, single-container works, multi-container doesn’t, this eliminates HW issues.

Has anyone else noticed a similar behaviour ?
Can you suggest further troubleshooting steps ?

@gh_maxcarrel you wanted to be in the loop as well.

@cedricreber thanks so much for reporting this.

Can you tell us a bit more about what’s running in the multi-container application? It does sound like one of the containers in the multi-container fleet might be causing a race condition or otherwise conflicting with the reboot.

What are you seeing in the logs? If you set persistent logging and then force it into this state where it won’t come back online, then switch it to a single-container fleet, are you seeing anything in the persistent log file that could give us any hints?

Anything you can share with us is appreciated.

@the-real-kenna of course!

The initial report came from a client of ours who’s application we don’t have access to. We also first thought about an issue residing in the application. To test this out I created barebones single-container fleet running the following application:

entry.sh, called directly from the dockerfile

#!/bin/bash

echo "############# Starting container ##############"
python main.py
sleep infinity

and main.py:

#!/usr/bin/env python3


def main():
    print("Hello, world!")


if __name__ == "__main__":
    main()

The multi-container fleet contains that same application, copied 3 times.

I will get back to you with the persistent logs.

@the-real-kenna I’m back with troubling news/discoveries.

As requested I’ve enabled persistent logging. But sadly had to notice that when the device doesn’t connect, there is no logging performed.
The following is the log of my latest run:

Feb 27 14:18:49 35aa343 5fb23ad57d08[1949]: [info]    VPN connection is active.
Feb 27 14:18:49 35aa343 5fb23ad57d08[1949]: [info]    VPN connection is active.
Feb 27 14:18:49 35aa343 balena-supervisor[3679]: [info]    VPN connection is active.
Feb 27 14:18:49 35aa343 balena-supervisor[3679]: [info]    VPN connection is active.
Feb 27 14:18:49 35aa343 5fb23ad57d08[1949]: [info]    VPN connection is active.
Feb 27 14:18:49 35aa343 balena-supervisor[3679]: [info]    VPN connection is active.
Feb 27 14:18:49 35aa343 5fb23ad57d08[1949]: [info]    VPN connection is active.
Feb 27 14:18:49 35aa343 balena-supervisor[3679]: [info]    VPN connection is active.
Feb 27 14:18:49 35aa343 openvpn[1945]: 2024-02-27 14:18:49 /sbin/ip route add 52.4.252.97/32 via 52.4.252.97
Feb 27 14:18:49 35aa343 openvpn[4247]: ip: RTNETLINK answers: File exists
Feb 27 14:18:49 35aa343 openvpn[1945]: 2024-02-27 14:18:49 ERROR: Linux route add command failed: external program exited with error status: 2
Feb 27 14:18:49 35aa343 openvpn[1945]: 2024-02-27 14:18:49 GID set to openvpn
Feb 27 14:18:49 35aa343 openvpn[1945]: 2024-02-27 14:18:49 UID set to openvpn
Feb 27 14:18:49 35aa343 openvpn[1945]: 2024-02-27 14:18:49 WARNING: this configuration may cache passwords in memory -- use the auth-nocache option to prevent this
Feb 27 14:18:49 35aa343 openvpn[1945]: 2024-02-27 14:18:49 Initialization Sequence Completed
Feb 27 14:18:50 35aa343 5fb23ad57d08[1949]: [info]    Applying target state
Feb 27 14:18:50 35aa343 balena-supervisor[3679]: [info]    Applying target state
Feb 27 14:18:51 35aa343 5fb23ad57d08[1949]: [debug]   Finished applying target state
Feb 27 14:18:51 35aa343 balena-supervisor[3679]: [debug]   Finished applying target state
Feb 27 14:18:51 35aa343 5fb23ad57d08[1949]: [success] Device state apply success
Feb 27 14:18:51 35aa343 balena-supervisor[3679]: [success] Device state apply success
Feb 27 14:18:53 35aa343 5fb23ad57d08[1949]: [info]    Internet Connectivity: OK
Feb 27 14:18:53 35aa343 balena-supervisor[3679]: [info]    Internet Connectivity: OK
Feb 27 14:19:04 35aa343 5fb23ad57d08[1949]: [info]    Reported current state to the cloud
Feb 27 14:19:04 35aa343 balena-supervisor[3679]: [info]    Reported current state to the cloud
Feb 27 14:41:05 35aa343 timesync-https.sh[1900]: [timesync-https.sh][INFO] Time synchronised via HTTPS.
Feb 27 14:41:05 35aa343 timesync-https.sh[4325]: Time synchronised via HTTPS.
Feb 27 14:41:05 35aa343 timesync-https.sh[1900]: [timesync-https.sh][INFO] Old time: Tue Feb 27 14:19:05 UTC 2024
Feb 27 14:41:05 35aa343 timesync-https.sh[4342]: Old time: Tue Feb 27 14:19:05 UTC 2024
Feb 27 14:41:05 35aa343 timesync-https.sh[1900]: [timesync-https.sh][INFO] New time: Tue Feb 27 14:41:05 UTC 2024
Feb 27 14:41:05 35aa343 timesync-https.sh[4359]: New time: Tue Feb 27 14:41:05 UTC 2024
Feb 27 14:41:05 35aa343 timesync-https.sh[4360]: hwclock: can't open '/dev/misc/rtc': No such file or directory
Feb 27 14:41:05 35aa343 healthdog[4361]: 2024-02-27T14:41:05Z chronyd version 4.2 starting (+CMDMON +NTP +REFCLOCK +RTC -PRIVDROP -SCFILTER -SIGND +ASYNCDNS -NTS -SECHASH +IPV6 -DEBUG)
Feb 27 14:41:05 35aa343 healthdog[4361]: 2024-02-27T14:41:05Z Enabled HW timestamping (TX only) on eth0
Feb 27 14:41:05 35aa343 healthdog[4361]: 2024-02-27T14:41:05Z Running with root privileges
Feb 27 14:41:05 35aa343 healthdog[4361]: 2024-02-27T14:41:05Z Frequency 4.845 +/- 0.649 ppm read from /var/lib/chrony/drift
Feb 27 14:41:05 35aa343 healthdog[4364]: [chrony-healthcheck][INFO] No online NTP sources - forcing poll
Feb 27 14:41:05 35aa343 chrony-healthcheck[4372]: No online NTP sources - forcing poll
Feb 27 14:41:05 35aa343 healthdog[4361]: 2024-02-27T14:41:05Z System clock was stepped by -0.000000 seconds
Feb 27 14:41:06 35aa343 healthdog[4361]: 2024-02-27T14:41:06Z Received KoD RATE from 104.244.76.13
Feb 27 14:41:11 35aa343 healthdog[4361]: 2024-02-27T14:41:11Z Selected source 185.105.3.230 (0.resinio.pool.ntp.org)
Feb 27 14:41:12 35aa343 healthdog[4361]: 2024-02-27T14:41:12Z Selected source 162.159.200.1 (2.resinio.pool.ntp.org)
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 1, refid: 00000000, correction: 0.000000000, skew: 0.000
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 2, refid: 00000000, correction: 0.000000000, skew: 0.000
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 3, refid: 00000000, correction: 0.000000000, skew: 0.000
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 4, refid: 00000000, correction: 0.000000000, skew: 0.000
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 5, refid: 00000000, correction: 0.000000000, skew: 0.000
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 6, refid: 00000000, correction: 0.000000000, skew: 0.000
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 7, refid: B96903E6, correction: 0.456125826, skew: 0.649
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 8, refid: A29FC801, correction: 0.375854552, skew: 0.649
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 9, refid: A29FC801, correction: 0.298762411, skew: 0.649
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 10, refid: A29FC801, correction: 0.221674398, skew: 0.649
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 11, refid: A29FC801, correction: 0.144536331, skew: 0.649
Feb 27 14:41:16 35aa343 healthdog[4375]: try: 12, refid: A29FC801, correction: 0.067428723, skew: 0.649

The reboot was triggered at 14:19
I left the device try to reconnect until around 14:35 when I pulled the plug to perform a power cycle.

I’m a bit stuck currently, is this expected behaviour?

Hello @the-real-kenna, did you by any chance have the time to look a bit more into this ?

@cedricreber could you please confirm that the device has access to all the networking requirements that we have at balena?

Hi @mpous,

Could you elaborate on how I should go about testing that ?

I don’t think it’s a port issue though as the device sometimes connects successfully.

Hello, to test the networking requirements you can try running the “device health checks” from the diagnostics tab on the dashboard and let us know the results.
Can you provide the test docker-compose and Dockerfile you used so we can try to replicate? Did you happen to try the same “steps to repeat” without the modem attached? (using WiFi or Ethernet)

1 Like