Balena Container Loses Network Connection

We have an install of approximately 50 units running Pi Compute hardware and the Balena Fin base OS image, running off 2 routers. Sometimes a few devices will drop off Balena, showing disconnected or heartbeat-only in the dashboard, BUT the router shows they are still connected. From the router console, refreshing the connection (I suppose disconnecting and letting DHCP run again, or otherwise refreshing the link) causes the device to come back online with Balena. While the device is not fully online with Balena, applications in the container cannot communicate with the network. This means the devices are non functional and it is becoming a problem. Is it possible that the container gets in a state where it no longer has network access ?

Thanks for any help,

Kevin

Hi Kevin,

Thanks for reaching out!

Are you seeing any errors from NetworkManager during the disconnection? For instance, maybe by checking journalctl -u networkmanager from within the hostOS? Similarly, do you see anything helpful in the logs? You might want to enable persistent logging if you’re having to do any reboots.

Let us know what you find, and feel free to grant us Support Access if we can be of help looking through any of the errors / logs.

To be more specific, there is no way to access the device once Balena says it is disconnected. The container application is cut off from the internet, and it is not possible to log into the hostOS through Balena console, or ssh from the LAN. But the router reports the device is still connected, and if the router is used to reset the connection, the device comes back online with Balena.

I would have suspected that the network interface just went down, except that the router reports it online (Im not quite sure how), and the router can resurrect it by forcing a DHCP recycle.

Kevin

@ko7eraven,

You should still be able to check for errors and logs using the options I shared above once you reconnect to the device. Can you do that and share them with us so that we have a place to start?

Unfortunately persistent logging is not working for me, when I run journalctl --list-boots it only shows the current boot, although persistent logging is enabled.

For example this UUID b6cff9d01e4a42ad34189842e7fba044

Kevin

On one device I was able to detect openvpn errors after some network weirdness, but looking at network manager showed no issues.

This is device : 3c172c8b9a3fa773ff855eb18f7dca2f

root@3c172c8:~# journalctl | grep vpn | grep error | more
Nov 18 17:51:19 3c172c8 openvpn[3258]: Thu Nov 18 17:51:19 2021 Linux ip addr del failed: external program exited with error status: 2
Nov 18 17:51:20 3c172c8 openvpn[3258]: Thu Nov 18 17:51:20 2021 Exiting due to fatal error
Nov 18 17:51:36 3c172c8 openvpn[3730]: Thu Nov 18 17:51:36 2021 ERROR: Linux route add command failed: external program exited with error status: 2
Nov 18 18:41:33 3c172c8 openvpn[3730]: Thu Nov 18 18:41:33 2021 Linux ip addr del failed: external program exited with error status: 2
Nov 18 18:41:34 3c172c8 openvpn[3730]: Thu Nov 18 18:41:34 2021 Exiting due to fatal error
Nov 18 18:59:50 3c172c8 openvpn[9776]: Thu Nov 18 18:59:50 2021 Fatal TLS error (check_tls_errors_co), restarting
Nov 18 18:59:50 3c172c8 openvpn[9776]: Thu Nov 18 18:59:50 2021 SIGUSR1[soft,tls-error] received, process restarting
Nov 18 19:57:16 3c172c8 openvpn[9776]: Thu Nov 18 19:57:16 2021 ERROR: Linux route add command failed: external program exited with error status: 2

OK, I was able to get some journalctl networkmanager logs. In this one I believe we see a disconnect (device is off the balena dashboard network), and then by the router requesting the connection be reset, it comes back on again.

Kevin

networkmanager.txt (18.7 KB)