debugging machines going down

hey guys, so we have been going and debugging this, implemented an automatic reboot via supervisor, but the nucs keep going down.

  • we have tested rebooting the routers when we had about 15-20 NUC’s down - thinking they would magically come online if there was a router problem - they didn’t
  • here’s a screenshot with the last output I can get from one of these nucs. they are connected via wifi. the wifi just seems to go down. the automated reboot doesn’t seem to even bring them always online, so I am at a relative loss here.


any tips on how to debug more or get closer to a solution - would be appreciated.

@katmai I’m sorry to hear you’re still struggling with this. I know you’ve checked journalctl in the past after enabling persistent logging, but can you tell us specifically what you see when running journalctl -u networkmanager from the hostOS? It would be helpful to see those specific logs if you have them, as well as any insights you’ve since gained from Datadog.

Hi, some things that might help debugging the problem:

  • Keep persistentLogging on as you have now
  • Add the following to the kernel command line (note this is removed in kernel v5.8):
iwlwifi.fw_monitor=1
  • Enable NetworkManager verbose logging from the hostOS:
nmcli general logging level DEBUG domain ALL
  • Check the logs for kernel oops messages

hi, just a quick update on this.

a common topic was that the cpu was maxed out, so we used cpuset to leave 1 core free for balena and our containers run on cores 1-3 only.

the nucs have been stable since this change.

I’m guessing our app was hogging the resources and then docker would just freeze.

Hi

Thanks for the update! That quite an interesting solution, and good to know it works for you!