Balena logs stops updating after network switch.

Hi,

We’ve noticed an issue where the console log stops updating when the device switches networks. For example, we start in a garage connected to wifi and the balena logs updating normally. When we leave the garage and transition to ethernet (connected via router to a cell modem), the log stops updating. The console output still updates as expected in the journal on the device, but never makes it to the cloud. When we return to the garage and transition back to wifi, the log starts updating again with a big gap (all the data during the time on ethernet is present in the journal, but never shown in the log).

It seems like whatever handles pushing log data to the cloud doesn’t detect the network change?

Thanks,

Adam

Doesn’t really tell us much that we don’t know – that wifi disconnected – but just after the last message that got out to the cloud (~3 seconds later according to our data) the journal has the following:

Nov 28 18:28:03 293028d wpa_supplicant[1127]: wlan0: CTRL-EVENT-DISCONNECTED bssid=24:05:88:08:54:5a reason=0 locally_generated=1
Nov 28 18:28:03 293028d wpa_supplicant[1127]: dbus: wpa_dbus_property_changed: no property SessionLength in object /fi/w1/wpa_supplicant1/Interfaces/0
Nov 28 18:28:03 293028d NetworkManager[1058]: <info>  [1606588083.9645] device (wlan0): supplicant interface state: completed -> disconnected

There’s nothing else in the journal from the OS until about 16 seconds later:

Nov 28 18:28:19 293028d NetworkManager[1058]: <warn>  [1606588099.1905] device (wlan0): link timed out.
Nov 28 18:28:19 293028d NetworkManager[1058]: <info>  [1606588099.1906] device (wlan0): state change: activated -> failed (reason 'ssid-not-found', sys-iface-state: 'managed')
Nov 28 18:28:19 293028d NetworkManager[1058]: <warn>  [1606588099.1963] device (wlan0): Activation: failed for connection 'wlan0-connection'
Nov 28 18:28:19 293028d dbus-daemon[890]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.4' (uid=0 pid=1058 comm="/usr/sbin/NetworkManager --no-daemon ")
Nov 28 18:28:19 293028d NetworkManager[1058]: <info>  [1606588099.2001] device (wlan0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Nov 28 18:28:19 293028d avahi-daemon[1042]: Withdrawing address record for fe80::e7ef:26a2:569f:11fc on wlan0.
Nov 28 18:28:19 293028d avahi-daemon[1042]: Leaving mDNS multicast group on interface wlan0.IPv6 with address fe80::e7ef:26a2:569f:11fc.
Nov 28 18:28:19 293028d kernel: IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
Nov 28 18:28:19 293028d avahi-daemon[1042]: Interface wlan0.IPv6 no longer relevant for mDNS.
Nov 28 18:28:19 293028d NetworkManager[1058]: <info>  [1606588099.2147] dhcp4 (wlan0): canceled DHCP transaction
Nov 28 18:28:19 293028d NetworkManager[1058]: <info>  [1606588099.2148] dhcp4 (wlan0): state changed bound -> done
Nov 28 18:28:19 293028d avahi-daemon[1042]: Withdrawing address record for 192.168.86.34 on wlan0.
Nov 28 18:28:19 293028d avahi-daemon[1042]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 192.168.86.34.
Nov 28 18:28:19 293028d avahi-daemon[1042]: Interface wlan0.IPv4 no longer relevant for mDNS.
Nov 28 18:28:19 293028d dbus-daemon[890]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Nov 28 18:28:19 293028d babe3e8fbaf9[1070]: [api]     GET /v1/healthy 200 - 10.398 ms
Nov 28 18:28:19 293028d resin-supervisor[3045]: [api]     GET /v1/healthy 200 - 10.398 ms
Nov 28 18:28:19 293028d NetworkManager[1058]: <info>  [1606588099.7533] device (wlan0): set-hw-addr: set MAC address to AE:04:36:9F:43:F5 (scanning)
Nov 28 18:28:19 293028d kernel: IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
Nov 28 18:28:19 293028d NetworkManager[1058]: <info>  [1606588099.7729] dns-mgr: Writing DNS information to /sbin/resolvconf
Nov 28 18:28:19 293028d dnsmasq[1067]: dnsmasq[1067]: reading /etc/resolv.dnsmasq
Nov 28 18:28:19 293028d dnsmasq[1067]: dnsmasq[1067]: using nameserver 8.8.8.8#53
Nov 28 18:28:19 293028d dnsmasq[1067]: dnsmasq[1067]: using nameserver 192.168.32.1#53
Nov 28 18:28:19 293028d NetworkManager[1058]: <info>  [1606588099.9556] device (wlan0): supplicant interface state: scanning -> disabled
Nov 28 18:28:19 293028d NetworkManager[1058]: <info>  [1606588099.9563] device (wlan0): supplicant interface state: disabled -> inactive
Nov 28 18:28:19 293028d wpa_supplicant[1127]: wlan0: Reject scan trigger since one is already pending

A few seconds after that we stopped our application and then there’s a bunch of output after that. I can’t tell what is only related to us stopping the application/container and what is maybe related to the network change and this error so I’ve uploaded all of it here: http://files.pointonenav.com.s3.amazonaws.com/balena/shutdown_output.txt (the forum won’t let you attach a txt file, and the output is too big to paste in a code block).

Note that we came across this issue while trying to debug another problem: around the same time that we leave the garage, our container suddenly stops receiving serial data from /dev/ttymxc2 from our onboard microcontroller. We still have no idea what is causing that either, and nothing in the journal seems to be related. As soon as we restart the nautilus service, which closes the file handle to /dev/ttymxc2, the device starts receiving data again. We can tell via debugger that the MCU is operating normally when nothing is coming in, so this seems like an OS or docker issue. You can actually see exactly when it happens in the linked output file - the timestamp of the text_ui messages stops updating. It happens somewhere right in the middle of the network manager prints I pasted above.

Any advice would be greatly appreciated.

Hello,
During your transition from WiFi to ethernet, is there a period, even briefly, when there was no connectivity at all?
What device type are you using and is the MCU integrated into the device or connected by other means?

Hi @alanb128:

The device has two independent connectivity sources, one is hardline ethernet and one is wifi. Both are integrated onto the Variscite IMX8M module (and we’re using your BSP for that). The problem is not a transition from ethernet to wifi, but actually just wifi ‘fading’ away (ethernet remains on the whole time and I’ve verified this by maintaining an SSH connection to the device during this behavior).

The MCU is on the same device and is connected directly to a UART on the variscite module. We have 20 of these in the field and have never seen this behavior before the OS upgrade.

Hi. Which was the last working balenaOS+supervisor version, and which version doesn’t work?

The two devices we’re seeing this on are running balenaOS 2.60.1+rev3.

The others are all running balenaOS 2.50.1+rev1, but since there is no obvious way to actually downgrade devices (please correct me if I’m wrong here!) I’m not really able to revert these to verify that this issue isn’t also reproducible in the old version.

Hi Aaron,

You’re correct that we don’t support downgrading an existing balenaOS to an earlier version. That’s only possible by re-flashing the SD card with an earlier (and still available) OS image.

John

Understood, while it’s not the root issue here it definitely gives us some serious pause… we’ve identified a bit of an unknown regression and have no rollback path. Of course we tested this and luckily found a rare event in our dev fleet, we may not always be so lucky. Any plans to address this?

Hi Aaron,

During the time the device is on ethernet, does the device still appear as online on the dashboard? Have you looked at the output from the diagnostics during that period? Do any network issues appear there? Could you also share the output of the ip route command while on wifi and when transitioning to ethernet?

Regarding the question about downgrading. We generally recommend our users to test new OS releases on test devices before upgrading the fleet, for this reason we also generally provide access to the newest OS releases on our staging environment first, so users can check for compatibility issues early on.

Hi again Aaron. I forgot to ask before, is the ethernet network connection always available or is it turned on when the device loses wifi connectivity? Thanks for your help