Sent you a PM here, maybe it works better that way for the start.
Just to update everyone on the experiments. I spent quiet a bit of the afternoon trying to reproduce this scenario, having the wifi disconnect about 20 times with varying lengths of down time and monitoring the NetworkManager logs of the device. So far I haven’t been able to get into this state. Perhaps the use of a mobile hotspot is not the best way to test this as it could be a softer friendly disconnect. Tomorrow I will try set up a new router to test this (can’t disconnect the office one over and over )
If there are other things anyone in the thread can think of to try induce this issue, it would really help.
Hey everyone. Just another update. I finally managed to get a device into this state (although it was due to pulling out a GSM modem, and having the pi fall over to wifi). I pulled some logs and for what we can see it seems like some how the onboard wifi chip firmware gets stuck while scanning.
The next thing to test is to see if updating the wifi firmware which is being added here: https://github.com/agherzan/meta-raspberrypi/commit/acd58692356df2bbdc3aa1d6b15f44c104ad3b45
Hopefully this will fix this very weird bug!
Thanks for the update. When can we expect to see a resinOS version with the included fix?
Hey @peterjuras I am pushing to get it included into the next release of resinOS for the RPi3 so hopefully in the next couple of weeks.
This thread is quite the same thing that happened to me: Raspberry pi 3 Wifi issues on poor signal
See post #17 someone came up with a good way to test it.
So far I’ve been checking internet connection inside my app and restarting the raspberry’s wifi interface after a few minutes.
we are experiencing the same issue with one of our devices at Resin version OS 2.2.0+rev1 (prod) on RPI3. Bad connection quality at the clients site with wifi connectivity dropping every now and then. Only reboot is able to fix it after it runs into the described state of not being able to reconnect. Happened twice so far within 3 weeks of deployment
Hi, is there any news on the update?
Sorry to nag about this, but this issue is truly neckbreaking for us, we have multiple devices at partner locations and it’s sad to see some of them not come up after a wifi outage (due to a power outage or something else).
We have to call them and tell them they should restart the devices which makes us look bad and unstable.
there’s a fix currently being implemented by the devices team. I’m reaching to the devices team to share more news as soon as possible.
Hi. Sorry for this taking this long but here is how the situation is. In order to mitigate the issue you are seeing, we have this open pull request: https://github.com/resin-os/resin-raspberrypi/pull/122
However, in order for this to get merged and not make raspberrypi 1 broken, we also need this patch http://lists.openembedded.org/pipermail/openembedded-core/2017-September/141938.html to get merged in the poky pyro branch. So that is what we are currently waiting on. We are pushing to have this merged as soon as possible.
Just to update everyone on this. The latest resinOS v2.7.5+rev1 for the RPI3 now has the latest 4.9 kernel and the latest wifi firmware, so I believe this issue should be fixed. If you continue to see the issue on this version of the OS, please let us know.
Just experienced the issue again with the new resin os:
while it is still working the messages are (wlan0 is not used, wlan1 is a usb dongle):
[Thu Nov 9 16:41:43 2017] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [Thu Nov 9 16:41:43 2017] brcmfmac: power management disabled
Then there is the disconnect:
[Thu Nov 9 16:52:27 2017] RTL871X: linked_status_chk(wlan1) disconnect or roaming [Thu Nov 9 16:52:34 2017] RTL871X: indicate disassoc [Thu Nov 9 16:52:39 2017] RTL871X: nolinked power save enter [Thu Nov 9 16:52:43 2017] RTL871X: nolinked power save leave [Thu Nov 9 16:52:47 2017] RTL871X: set ssid [XXXXXX] fw_state=0x00000008 [Thu Nov 9 16:52:47 2017] RTL871X: set bssid:XXXXXX [Thu Nov 9 16:52:47 2017] RTL871X: start auth [Thu Nov 9 16:52:47 2017] RTL871X: auth success, start assoc [Thu Nov 9 16:52:47 2017] RTL871X: assoc success [Thu Nov 9 16:52:47 2017] UpdateHalRAMask8812A => mac_id:0, networkType:0x14, mask:0x000ffff0 ==> rssi_level:0, rate_bitmap:0x000ff010 [Thu Nov 9 16:52:47 2017] RTL871X: send eapol packet [Thu Nov 9 16:52:47 2017] RTL871X: indicate disassoc [Thu Nov 9 16:52:47 2017] RTL871X: set bssid:00:00:00:00:00:00 \xffffffb71X\xffffffa3Z%]X\xffffffe9^\xffffffd4\xffffffab\xffffffb2\xffffffcdƛ\xffffffb4T\xffffff82tA!=܇2 ] fw_state=0x00000008 [Thu Nov 9 16:52:48 2017] RTL871X: indicate disassoc [Thu Nov 9 16:52:48 2017] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready [Thu Nov 9 16:52:56 2017] RTL871X: nolinked power save enter [Thu Nov 9 16:53:12 2017] RTL871X: nolinked power save leave [Thu Nov 9 16:53:16 2017] RTL871X: nolinked power save enter [Thu Nov 9 16:53:45 2017] RTL871X: nolinked power save leave ...
Is there an easy way to restart the networking stack from inside the container? Then we could have a script which checks for connectivity and restarts the stack if it gets stuck.
What USB dongle are you using when you have this issue?
it is a D-Link DWA-171, using this driver: https://github.com/gnab/rtl8812au.git
Thanks for the info. Looks like there are other reports elsewhere that looks similar to / same as yours:
Have you tried the dongle with a regular Raspbian that it works? According to the issue above, the same seems to happen there too.
Also, we have a list of known working dongles: https://docs.resin.io/hardware/wifi-dongles/#known-working-devices
as far as I see it is not working at all for them. For us it is working for a long time, but when disconnected it sometimes does not recover.
Is there an easy way to restart the network stack completely in resin os? As it happens very rarely this would be an option for us. I tried reloading the connections via dbus which didn’t work. What did work was adding a new connection via dbus but it is not feasible to add a new connection at every disconnect. Last resort would be rebooting the whole device but I wouldn’t like to do that.
All the mentioned devices on the list are either unavailable or do not have 5 Ghz as far as I see. Our device is not listed on the elinux rpi wifi page but a lot of other DWA-1xx with lower numbers are, I guess these just do not get updated?
I was looking into the dbus interface for systemd, and one way to restart a service would be
DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket \ gdbus call --system \ --dest org.freedesktop.systemd1 \ --object-path /org/freedesktop/systemd1 \ --method org.freedesktop.systemd1.Manager.RestartUnit \ "<servicename>.service" \ "replace"
Where for example you’d replace
<servicename> with a host OS service’s name, such as
NetworkManager. Just tried it on a test device, that was connected, and it has reconnected fine afterwards.
See more info in these docs of what
--method are available and what are their parameters: https://www.freedesktop.org/wiki/Software/systemd/dbus/
dbus-send example would be as follows:
DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket \ dbus-send --system --print-reply --reply-timeout=2000 \ --type=method_call \ --dest=org.freedesktop.systemd1 \ /org/freedesktop/systemd1 \ org.freedesktop.systemd1.Manager.RestartUnit \ string:<servicename>.service \ string:replace
This is just an example, though, and in general be careful of automatic network manipulation, as you might end up with a disconnected device. You are right, the device reboot should be last resort too. The best outcome would be to figure out what causes that outage, and fix up in the firmware level - though that we are usually have to rely on upstream, though we do our fair share of upstreaming of fixes…
Let us know if you have any experience trying it!
Yeah, need more 5GHz dongles, though we have made some developments in that direction too (will keep everyone posted). If you have any other dongles that you’d recommend based on experience, would love to hear.
I can confirm that this works: After the dbus service replace command the wifi dongle connects to the router again!
@imrehg if this command is executed, does this interfere with the networking of running containers? Would all services need to be restarted when running this? The reason I ask is that we use an update lock in a container and if service restarts are required I’ll need to take that into consideration to remove the lock before doing this.
There’s a small problem when using this to restart NetworkManager while running an access point from the device. It seems that the RestartUnit method does not properly shutdown all processes of a unit and doing this while running an access point leaves the dnsmasq process around. When NetworkManager restarts, this existing dnsmasq causes problems. The solution I’ve found is to first call
KillUnit("NetworkManager.service", "all", 15), this sends SIGTERM to all processes within the unit and stops NetworkManager cleanly. Then follow KillUnit with
StartUnit("NetworkManager.service", "replace") and it is back up a running.