Unreliable wifi connection

Hello all,

Have a small fleet (18 RPi Zero W) that are connected to a wifi network (actually there is 4 different login/psk).

Some RPi lose connection and cannot reconnect, even though a RPi nearby is connected just fine. Usually, if I add a hotspot w/ some other wifi that is currently not showing, it does connect.

What I wanted to do is:
- Check the status of the wifi (connected or not);
- Somehow reset the wifi if not connect (so it start to looking for connections afresh, similiar to a complete reboot)

I thought about using ifconfig wlan0 to check connections and ifconfig wlan0 down && ifconfig wlan0 up to reset it.

But I think it would only work in Host, not in containers (even w/ privileged=true).

We use NetworkManager on balenaOS. You can have a look here on how to enable dbus communication to the host here: https://www.balena.io/docs/learn/develop/runtime/#dbus-communication-with-host-os
Then you should be able to use nmcli in the container and do the introspection / network manipulation using that.

Thanks, I have enabled dbus and Iā€™m playing around w/ python NetworkManager (seems better option than directly interacting w/ terminal).

Iā€™ve found how to check the state of connections but have not found how to reset them, could share some idea?

Do you mean something like nmcli dev wifi rescan ?

Seems not. As I understood, NetworkManager automatically scans already.

But Iā€™m in a situation where sometimes the device is disconnected even when a perfect wifi connection is available and the device have the auth to connect.

These times, always a reboot resolves the issue but I wanted to ā€œrebootā€ only the connectionā€¦ (The reason this problem occurs is not clear yet)

Hi,
What balenaOS version are you using on those RPis? Are you using any external WiFi dongle?
You should be able to restart the NetworkManager as a whole with the following dbus command:

DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket dbus-send --system --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager.RestartUnit string:'NetworkManager.service' string:'fail'

Give it a try and let us know whether it works for you.

1 Like

I will not be able to tell if it fixed the problem but the commands runned, output bellow:

method return time=1574103572.058136 sender=:1.1 -> destination=:1.80357 serial=2016701 reply_serial=2 object path "/org/freedesktop/systemd1/job/257564"

HOST OS VERSION
balenaOS 2.44.0+rev1

SUPERVISOR VERSION
10.3.7

The command I use to do what you want to do is:

nmcli c up CONNECTION_NAME

To find the connection name, run
nmcli c show

In my application, I spin off a python thread that is entirely devoted to waiting a set interval, checking for internet below and on failure, running the above (which if run with os.system waits on the command) then checking again to confirm it works and logging it for statistics.

My interval is every 20 seconds and my devices will ā€œsoft-resetā€ their network manager 20 to several hundred times a day. I suspect this has something to do with the adapter I am using.

def internet(host="8.8.8.8", port=53, timeout=3):
    """
    source:
    https://stackoverflow.com/a/33117579/11116438
    Host: 8.8.8.8 (google-public-dns-a.google.com)
    OpenPort: 53/tcp
    Service: domain (DNS/TCP)
    """
    try:
        socket.setdefaulttimeout(timeout)
        socket.socket(socket.AF_INET, socket.SOCK_STREAM).connect((host, port))
        return True
    except Exception as ex:
        print(ex)
        return False

It seems to remind the network manager that it can and should connect to a connection it chooses to stop connecting to. Note that why it stops in the first place remains unknown to me.

Restarting the entire network manager service was never an option for me because it seems to forget that unmanaged devices exist after such a restart.

To access the nmcli in a container you have to follow the instruction in the networking documentation balena has.

Good luck and if you ever find out more about whats going on let me know.

-Thomas

Thank you very much @tacLog .

Since I have several possible connections, donā€™t want to manually set one of the these (and check if it is working or not and etc).

What problems do you see w/ ā€œit seems to forget that unmanaged devices exist after such a restartā€?

Iā€™m using a logic simular as yours to detect if is connect but thought about using nmcli radio wifi off && sleep 5 && nmcli radio wifi on to reset the connectionsā€¦

Hey @deoqc

It makes sense you donā€™t want to mess with connections if you donā€™t know which one is active. You could as a last result try to set each up in turn and text between each. In addition, nmcli reports success or failure that you could parse.

As for the problems with restarting it. If you un-manage a device (nmcli dev set wlan9 managed no) then network manage no longer touches that device. If you then restart network manager, that device disappears from nmcli d s and canā€™t be manged again for usage. This is only relevant if you want to do other things with your adapters like monitor mode.

nmcli radio wifi off sounds like it would work to me, but I have never tried it.

My favorite manual for nmcli doesnā€™t say much about it.

It says that nmcli connection up ifname "$DEVICE" is a valid command and would avoid you having to choose which connection you want to use. You just have to choose what adapter to use, which in your case is probably just wlan0

Also if you just have one adapter, you could always just un-manage and manage it again. That would have to reset network manager.

Let me know what ends up working for you. I would love to learn more in-case we end up deploying on RPi Zwā€™s in the future. (They were second on our short list)

-Thomas

Hi!
Iā€™m really interested in running this python thread but just wonder where to put the code and how to run it? New to BalenaCloud and have issues with wifi connectivity on four different Piā€™s.

Hey @henrik
Are you referring to my mention of this thread here?

If so that is past of our closed source production code base for the company I work for. I would have to seek permission to spin the WiFi connectivity checking and control thread off to a separate open source project. To be clear I think I could get permission to do so, but I want to make sure that it will serve your need before doing so.

I think I can describe its function as it isnā€™t any secret.
This thread is designed to be spun off the main process and report and maintain the connectivity status.
It does several things to accomplish that.
First it writes a new network connection with nmcli that is adapter specific and sets it up:

print_command(f'nmcli connection add '
                      f'con-name {self.id} '
                      f'ifname {self.interface.name} '
                      f'type wifi '
                      f'autoconnect yes '
                      f'save no '
                      f'ssid {self.env["prod_ssid_master"]} '
                      f'802-11-wireless.cloned-mac {self.env.master_mac} '
                      f'802-11-wireless-security.auth-alg open '
                      f'802-11-wireless-security.key-mgmt wpa-psk '
                      f'802-11-wireless-security.psk {self.env["prod_psk_master"]}')

As you can see it relies on a object framework that fills in the important settings.
Then it can use that connection to accomplish some of the more useful debugging steps below.
The connectivity loop is probably more what your after:
Over and over it checks for internet with

 def internet(host="8.8.8.8", port=53, timeout=3):
    """
    source:
    https://stackoverflow.com/a/33117579/11116438
    Host: 8.8.8.8 (google-public-dns-a.google.com)
    OpenPort: 53/tcp
    Service: domain (DNS/TCP)
    """
    try:
        socket.setdefaulttimeout(timeout)
        socket.socket(socket.AF_INET, socket.SOCK_STREAM).connect((host, port))
        return True
    except Exception as ex:
        print(ex)
        return False

If that fails we do what I called a soft_reset:

def soft_reset(env):
    print_command('ip route flush 0/0')
    if not env.current_connection:
        if env['in_the_lab']:
            print_command('nmcli c up dm_debug')
        else:
            print_command(f'nmcli c up {env["primary_con_name"]}')
    else:
        env.current_connection.up()
    status = internet()
    if status:
        logging.info(f"Soft-reset complete, you should be able to see this")
    else:
        logging.info(f"Soft-reset failed")
    return status

That solves the bulk of solvable network issues in my experience. However, we ship with multiple adapters and if there is anything Network manager sucks at it is having multiple wifi adapters and only one connection settings.
So the next step would be to rotate adapters by managing and un-managing them in NM.
This is probably not relevant to most people.

@henrik
What are your needs from this python thread?
What kind of problems are you looking to solve?
What would be the most useful way to me to share this code? My preferance would be a simple python module that you can pull in from github and launch with a simple Thread() command.

-Thomas

Hi Thomas!

Yes, that was the thread I was referring to. We have tried a similar approach and will evaluate how it works!

Thanks!

// Henrik

As additional information NetworkManager itself does have a connectivity built-in check and it is really easy to get the overall connectivity state of the device by using it.

You may check nmcli gen and for seeing the precise value nmcli -t -f CONNECTIVITY gen.

Alternatively the same could also be done using the D-Bus API (e.g. with a library like python-networkmanager for example). My personal preferred way is by using the API, since I do not have to install nmcli in the container, but both are perfectly fine.

There is some related important information also in our documentation: https://www.balena.io/docs/reference/OS/network/2.x/#changing-the-network-at-runtime

Thanks,
Zahari

Hey all,

@henrik
I would be interested in your experience solving these issues because while our fleet is much more healthy than it used to be. There is still a % of devices that only connect during the night due to Wifi network congestion issues (as best we can tell).

Anything you learn while solving this might help me one day. So I would love to hear about how your doing.

@majorz
The D-Bus API seems the way to go if you really want to unlock the full configuration options of Network Manager. But I picked nmcli for because I could run and test the commands by hand and I have never used the D-Bus for anything and didnā€™t have time to learn how.

One thing I notice is that nmcli has poor error handling at best, and using it with a python script means at best errors get sent to the logs un-parsed. If I were to do the project again I would use the D-bus.

I didnā€™t know the nmcli gen commands. I will incorporate them into my stuff the next time I take a stab at further solving this issue on our fleet.

-Thomas