we have deployed about 20 intel nucs with balenaOS in a remote location. We have been experiencing the devices going down, but we can’t seem to have any debug info. they simply lose connectivity. 4-5 devices are unreachable, while the rest are working fine. it just seems like they are frozen.
it doesn’t look like we can SSH into them from the local network either, unless we reboot them.
Hi,
The xhci_hcd errors suggest something wrong with your USB.
If you search for those errors, you may be able to find more info.
This may or may not be connected to your wifi issue.
Regardless, you should be able to do that sysctl net.ipv4.tcp_ecn=0 or similar commands from your container; just hook it up in your entrypoint and you should be good to go.
it’s not really good to go. i need to run sysctl net.ipv4.tcp_ecn=0 on the host os not on the container.
the host os cuts off all the connections. the container just follows.
any way to do that? (i mean at application level) the reason is that these machines lose power once in a while and i don’t want to have to reapply this.
Yes, you’d need to apply that to the host OS. This can be done by adding net.ipv4.tcp_ecn=0 in a sysctl configuration file in the host OS’s filesystem. For each affected device, open a shell into the host OS and run:
(feel free to change tcp_ecn.conf to another file name. It doesn’t matter as long as it ends with .conf)
After the device is rebooted, check that the setting has been successfully applied:
$ sysctl net.ipv4.tcp_ecn
net.ipv4.tcp_ecn = 0
I’d recommend trying this out on a single device first and seeing if it solves your issue before going through the trouble of applying this setting to affected devices.
i noticed that if i update balenaOS, the persistent settings go away.
is there any way to add the sysctl tweaks to persist even after an OS update, so i don’t have to go into each machine and set them again if i have to update the OS?
i have one more question. i have been looking but i can’t seem to figure out how to add a script that runs on boot on the host OS.
We figured out, by using the development version, that the boxes don’t actually freeze, but the network gets cut off, for some of them. if we simply reboot the device when that happens, everything comes back online.
i made a script that checks if connectivity has been down for more than 15 minutes, and then it’s sort of safe to say the box will be unreachable, and the script reboots the box.
what would be the best way of having this script running at boot?
Is this something that you can do with a service that starts first - before other services - checks something, and then proceeds to exit. Or do you absolutely need it in balenaOS?
If a simple service can work for you, check the depends_on field in docker compose.
the script needs to start after the machine has been powered on, and checks once every x minutes if the device is still online.
if connectivity has been lost for 15-30 minutes, the script will reboot the machine automatically. i haven’t tried to see if restarting the network would bring the machine online though. if that works, i’d replace the reboot command with a network restart.
this is a script that runs constantly.
here’s the code i am planning on using:
#!/bin/bash
check_connectivity() {
wget -q --spider http://google.com ; echo $?
if [[ $? -eq 0 ]]; then
echo $(date) "1" | tee -a /mnt/data/connectivity/connectivity.log
echo OK > /mnt/data/connectivity/status
else
echo $(date) "0" | tee -a /mnt/data/connectivity/connectivity.log
echo 0 >> /mnt/data/connectivity/status
fi
}
fix_connectivity() {
status=$(cat /mnt/data/connectivity/status |wc -l)
if [[ $status -ge 5 ]]; then
### connectivity failed to get restored within 10 minutes. box likely dead.
## reboot
/sbin/reboot
## restart NetworkManager is also an option
# systemctl restart NetworkManager
fi
I think this is something you can do using a small service. Using the supervisor API, you can restart the device from this service - if the network fails as per your constraints.
You shouldn’t have to make changes to balenaOS for this - as that would be a lot more complicated than having a service that monitors this. You can look at different network modes for docker services - so that you can get direct access to the network of the OS - in case that’s something that you need.