Unreliable Networking on Raspberry Pi W

Hi all,

I’ve been using balenaOS for some time now and for the most part it’s been great, but I’ve had nothing but issues in terms of networking. Originally, I had each Raspberry Pi Zero W connected to the Wi-Fi where I would experience random internet problems where I was unable to connect to the devices via the Balena Dashboard. They would show as offline, heartbeat only, no VPN connection, et cetera. Thinking that this was a Wi-Fi problem, I purchased some micro USB to Ethernet adapters, but the same problems ensued. There are 4 devices total at this location, all Raspberry Pi Zero W’s, all connected via Ethernet to an un-managed network switch, then from the switch to the modem/router. This is on a business class internet connection that very frequently has outages on other devices and there is no outbound firewall rules or similar device between the devices and the outside internet. I’ve ensured that all ports and hosts are available from the “Network Requirements” article in the balenaOS documentation. I’ve been able to connect to them for short amounts of time (<1 minute) before they disconnect. Sometimes shorter, sometimes longer, I haven’t noticed any patterns. Any help or pointers would be greatly appreciated.

Also, I originally contacted Balena themselves over this issue who directed me to this forum. They mentioned that including a traceroute could be useful in solving this problem, here are the results before it disconnected:

root@f399e03:~# traceroute api.balena-cloud[.]com
traceroute to api.balena-cloud[.]com (52.202.238.221), 30 hops max, 38 byte packets
1 _gateway (192.168.1.1) 1.191 ms 0.740 ms 0.972 ms
2 * * *
3 B4305.BFLONY-LCR-21.verizon-gni[.]net (100.41.223.230) 17.155 ms 16.015 ms B4305.BFLONY-LCR-22.verizon-gni[.]net (100.41.5.244) 12.687 ms
4 * * *
5 * * *
6 0.ae16.GW14.IAD8.ALTER[.]NET (140.222.226.37) 19.864 ms 20.501 ms 0.ae15.GW14.IAD8.ALTER[.]NET (140.222.226.31) 23.352 ms
7 204.148.170.66 (204.148.170.66) 23.905 ms 21.319 ms 21.450 ms
8 * * *
9 * * *
10 52.93.28.98 (52.93.28.98) 24.710 ms 25.448 ms 52.93.28.102 (52.93.28.102) 25.046 ms
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * *

Also, please note in the hostnames above, I have added the [.] myself. I tried posting it and it said new users could only post 5 links, so I’ve put the brackets to allow me to post this still.

Hi

Welcome to the forums!

  • Can you enable persistent logging on the devices, so that you can access the logs later, and across boots.
  • do you see anything suspicious under dmesg -wH
  • Also take a look at our HostOS masterclass, and especially the section on NetworkManager logs. If you can share verbose logs, it would help immensely to figure what’s going on.
  • Also, as I understand from your message, this is all at one site. You haven’t been able to recreate these issues locally right? The fact that you are seeing issues with WiFi as well as wired is strange. Can you tell us what’s the power situation like? What power supply are you using on all the devices? Is it same for the devices that are showing issues? Do you have more hardware connected to the Pi Zeros that might be consuming current?

Yes, this was already enabled as part of my own troubleshooting.

Yes, it is this message repeated about every 5 minutes.

   [Feb22 14:14] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
   [  +0.000245] brcmfmac: brcmf_cfg80211_set_power_mgmt: power save disable

I’ve turned this on using nmcli general logging level DEBUG domain ALL, but I’m confused on how to actually read that log now. Is it journalctl or is that from the container only?

I haven’t tried to recreate it locally yet (I can try this soon on Wi-Fi, but I don’t have another spare Ethernet adapter). They’re plugged into a Raspberry Pi power supply that’s 5V 2.5A which should be plenty. Besides power, Ethernet is plugged in as well as the HDMI port to a TV.

Hello,

  • Can you please confirm the current balenaOS version you are using?
  • I’ve turned this on using nmcli general logging level DEBUG domain ALL, but I’m confused on how to actually read that log now. Is it journalctl or is that from the container only?

You should now have more verbose output when using journalctl. Can you please see if there are is anything peculiar?

  • Can you please provide more details about those usb ethernet adapters?

Thanks

Yes, it’s balenaOS 2.54.2+rev1.

I was only able to connect for a few seconds before it quit, but taking a scroll through it, it seems to all be openvpn and kernel logs, no warnings or errors. I can keep trying, but I think the comment below helps more towards solving the situation.

Yes, here’s the link to it. Upon taking a look at the Amazon page again, I saw some reviews about duplicate MAC addresses and confirming in the balenaOS panel online, this seems to be true and is perhaps causing interference. Is there a way to manually set the MAC address in balenaOS and would something like that help in this situation?

From your original post:

This is on a business class internet connection that very frequently has outages

Did you mean it very infrequently has outages?

Sorry to have to ask this, but I just want to make sure.

Yes, sorry that was a typo on my end. Very infrequent outages, yes.

Ok, I mean, I was 99% sure that’s what you meant, but just wanted to verify.

Can you tunnel into the device locally and post the logs from journalctl? Should should be able to tunnel with balena ssh [uuid].local, where uuid is the short uuid (7 characters).

Yes, I was able to, but the log is very lengthy, so I’ve posted it here.

I looked in the logs you posted and there’s nothing obvious to me. However, it doesn’t look like NetworkManager is in debug mode either. Could you try running nmcli general logging level DEBUG domain ALL in the host OS, wait a bit, and get the logs again?

From your messages looks like you might be running NetworkManager inside a container? Is this the case?

Yes, I’ll do that now and update this thread in a few hours, hopefully something will come up.

I don’t think so, but maybe I’m wrong. This is the container I’m runnning.

Hi there, so looks like you’ve already checked off all the obvious troubleshooting steps, like ruling out the WiFi network and swapping to Ethernet/USB.

Though it is still possible that when both WLAN and LAN interfaces are configured, depending which one comes up last, the default gateway will be set to that interface. So even if you have both WiFi and Ethernet connected, the connectivity may still be using WiFi.

Quickest way to check is the terminal in and to run ip route, then inspect the dev set against the default gateway (hopefully it’s the USB Ethernet interface).

So when I said I’ve also tried Wi-Fi, this was on a separate copy of the OS. I initially tried WiFi and had connection problems, so I then moved to Ethernet. In my troubleshooting process, I wiped all the SD cards on the Raspberry Pis and never reloaded the WiFi credentials onto the SD cards, therefore, all traffic must be going through Ethernet, sorry for the confusion

Yes, I’ve got a bunch of logs now from each device, hopefully this is enough to tell what’s going on.

Device 1 Logs
Device 2 Logs
Device 3 Logs
Device 4 Logs

Are you able to get a “known working” device from Balena OS to work on your network?

For instance, could you try running the screenly app on your device so we can narrow down if it’s a container or network issue?

Unfortunately, I’m not there and cannot go there due to travel restrictions right now. Is there something else I can try with the existing devices?

If I can switch a device over to another application to try this and then put it back on the original, I would be able to try this and get back to you, assuming I’m able to do this with the limited network connection it has.

[vpn.balena-cloud.com] Inactivity timeout (--ping-restart), restarting lines in the log tend to suggest OpenVPN keeps restarting the connection because it thinks it is dead.

One way to troubleshoot this further, would be to run some regular traces on the devices from the host OS container, to send traffic across the VPN connection, for example:

# ping 52.4.252.97
PING 52.4.252.97 (52.4.252.97): 56 data bytes
64 bytes from 52.4.252.97: seq=0 ttl=255 time=102.123 ms
64 bytes from 52.4.252.97: seq=1 ttl=255 time=148.565 ms
64 bytes from 52.4.252.97: seq=2 ttl=255 time=90.183 ms
64 bytes from 52.4.252.97: seq=3 ttl=255 time=94.110 ms
^C

and

# while true; do curl -I 52.4.252.97/ping; sleep 3s; done

HTTP/1.1 200 OK
content-length: 58
cache-control: no-cache
content-type: text/html
connection: close
...

If these start timing out or you see a large variability in response times, then the issue could be a saturated network upstream of device (could be resource starvation on the device itself too).

Hi, I would like to follow up on this, were you able to solve the issue? If not, have you tried the connectivity check my colleague has suggested? Thanks