WiFi restart makes my Raspberry Pi3 device go offline until reboot

we are experiencing the same issue with one of our devices at Resin version OS 2.2.0+rev1 (prod) on RPI3. Bad connection quality at the clients site with wifi connectivity dropping every now and then. Only reboot is able to fix it after it runs into the described state of not being able to reconnect. Happened twice so far within 3 weeks of deployment

Hi, is there any news on the update?

Sorry to nag about this, but this issue is truly neckbreaking for us, we have multiple devices at partner locations and it’s sad to see some of them not come up after a wifi outage (due to a power outage or something else).

We have to call them and tell them they should restart the devices which makes us look bad and unstable.

Hi,

there’s a fix currently being implemented by the devices team. I’m reaching to the devices team to share more news as soon as possible.

Hi. Sorry for this taking this long but here is how the situation is. In order to mitigate the issue you are seeing, we have this open pull request: https://github.com/resin-os/resin-raspberrypi/pull/122
However, in order for this to get merged and not make raspberrypi 1 broken, we also need this patch http://lists.openembedded.org/pipermail/openembedded-core/2017-September/141938.html to get merged in the poky pyro branch. So that is what we are currently waiting on. We are pushing to have this merged as soon as possible.

1 Like

Just to update everyone on this. The latest resinOS v2.7.5+rev1 for the RPI3 now has the latest 4.9 kernel and the latest wifi firmware, so I believe this issue should be fixed. If you continue to see the issue on this version of the OS, please let us know.

Just experienced the issue again with the new resin os:

while it is still working the messages are (wlan0 is not used, wlan1 is a usb dongle):

[Thu Nov  9 16:41:43 2017] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[Thu Nov  9 16:41:43 2017] brcmfmac: power management disabled

Then there is the disconnect:

[Thu Nov  9 16:52:27 2017] RTL871X: linked_status_chk(wlan1) disconnect or roaming
[Thu Nov  9 16:52:34 2017] RTL871X: indicate disassoc
[Thu Nov  9 16:52:39 2017] RTL871X: nolinked power save enter
[Thu Nov  9 16:52:43 2017] RTL871X: nolinked power save leave
[Thu Nov  9 16:52:47 2017] RTL871X: set ssid [XXXXXX] fw_state=0x00000008
[Thu Nov  9 16:52:47 2017] RTL871X: set bssid:XXXXXX
[Thu Nov  9 16:52:47 2017] RTL871X: start auth
[Thu Nov  9 16:52:47 2017] RTL871X: auth success, start assoc
[Thu Nov  9 16:52:47 2017] RTL871X: assoc success
[Thu Nov  9 16:52:47 2017] UpdateHalRAMask8812A => mac_id:0, networkType:0x14, mask:0x000ffff0
	 ==> rssi_level:0, rate_bitmap:0x000ff010
[Thu Nov  9 16:52:47 2017] RTL871X: send eapol packet
[Thu Nov  9 16:52:47 2017] RTL871X: indicate disassoc
[Thu Nov  9 16:52:47 2017] RTL871X: set bssid:00:00:00:00:00:00
\xffffffb71X\xffffffa3Z%]X\xffffffe9^\xffffffd4\xffffffab\xffffffb2\xffffffcdƛ\xffffffb4T\xffffff82tA!=܇2	] fw_state=0x00000008
[Thu Nov  9 16:52:48 2017] RTL871X: indicate disassoc
[Thu Nov  9 16:52:48 2017] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready
[Thu Nov  9 16:52:56 2017] RTL871X: nolinked power save enter
[Thu Nov  9 16:53:12 2017] RTL871X: nolinked power save leave
[Thu Nov  9 16:53:16 2017] RTL871X: nolinked power save enter
[Thu Nov  9 16:53:45 2017] RTL871X: nolinked power save leave
...

Is there an easy way to restart the networking stack from inside the container? Then we could have a script which checks for connectivity and restarts the stack if it gets stuck.

What USB dongle are you using when you have this issue?

it is a D-Link DWA-171, using this driver: https://github.com/gnab/rtl8812au.git

Thanks for the info. Looks like there are other reports elsewhere that looks similar to / same as yours:

https://github.com/gnab/rtl8812au/issues/119

Have you tried the dongle with a regular Raspbian that it works? According to the issue above, the same seems to happen there too.

Also, we have a list of known working dongles: https://docs.resin.io/hardware/wifi-dongles/#known-working-devices

as far as I see it is not working at all for them. For us it is working for a long time, but when disconnected it sometimes does not recover.

Is there an easy way to restart the network stack completely in resin os? As it happens very rarely this would be an option for us. I tried reloading the connections via dbus which didn’t work. What did work was adding a new connection via dbus but it is not feasible to add a new connection at every disconnect. Last resort would be rebooting the whole device but I wouldn’t like to do that.

All the mentioned devices on the list are either unavailable or do not have 5 Ghz as far as I see. Our device is not listed on the elinux rpi wifi page but a lot of other DWA-1xx with lower numbers are, I guess these just do not get updated?

I was looking into the dbus interface for systemd, and one way to restart a service would be

DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket \
  gdbus call --system \
    --dest org.freedesktop.systemd1 \
    --object-path /org/freedesktop/systemd1 \
    --method org.freedesktop.systemd1.Manager.RestartUnit \
    "<servicename>.service" \
    "replace"

Where for example you’d replace <servicename> with a host OS service’s name, such as NetworkManager. Just tried it on a test device, that was connected, and it has reconnected fine afterwards.

See more info in these docs of what --method are available and what are their parameters: https://www.freedesktop.org/wiki/Software/systemd/dbus/

The dbus-send example would be as follows:

DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket \
  dbus-send --system --print-reply --reply-timeout=2000 \
    --type=method_call \
    --dest=org.freedesktop.systemd1 \
    /org/freedesktop/systemd1 \
    org.freedesktop.systemd1.Manager.RestartUnit \
    string:<servicename>.service \
    string:replace

This is just an example, though, and in general be careful of automatic network manipulation, as you might end up with a disconnected device. :warning: You are right, the device reboot should be last resort too. The best outcome would be to figure out what causes that outage, and fix up in the firmware level - though that we are usually have to rely on upstream, though we do our fair share of upstreaming of fixes…

Let us know if you have any experience trying it!

Yeah, need more 5GHz dongles, though we have made some developments in that direction too (will keep everyone posted). If you have any other dongles that you’d recommend based on experience, would love to hear.

1 Like

I can confirm that this works: After the dbus service replace command the wifi dongle connects to the router again!

@imrehg if this command is executed, does this interfere with the networking of running containers? Would all services need to be restarted when running this? The reason I ask is that we use an update lock in a container and if service restarts are required I’ll need to take that into consideration to remove the lock before doing this.

There’s a small problem when using this to restart NetworkManager while running an access point from the device. It seems that the RestartUnit method does not properly shutdown all processes of a unit and doing this while running an access point leaves the dnsmasq process around. When NetworkManager restarts, this existing dnsmasq causes problems. The solution I’ve found is to first call KillUnit("NetworkManager.service", "all", 15), this sends SIGTERM to all processes within the unit and stops NetworkManager cleanly. Then follow KillUnit with StartUnit("NetworkManager.service", "replace") and it is back up a running.

Hi @ejohnso49,
No it doesn’t interfere with the containers running and will not require any app container restarts.
Once the host OS network connectivity is restored, then the requests from the app container will start to succeed.

Thanks for sharing back your findings :+1:

Hi All,

Resurrecting the discussion:
I’m running a simple RPi3, downloaded the image, loaded up node-red and in a matter of hours or days it goes offline, without coming back.
I’m running 2.36.0+rev2.

I was wondering if the problem has been addressed in a definitive way and if it is known why the device at random goes out of the WLAN?

Thanks for the great support.

Hi @mvargasevans,
Can you share a few extra details about your setup.
What WiFi dongle are you using?
Do you have the same issue on more than one devices?
Could it be a connectivity issue? Does the device loose connectivity when using ethernet?

Could you also grant us support access to that device and share its dashboard url with us so that we can pull some logs from it?