debugging machines going down

katmai · June 10, 2021, 4:27pm

hey guys, so we have been going and debugging this, implemented an automatic reboot via supervisor, but the nucs keep going down.

we have tested rebooting the routers when we had about 15-20 NUC’s down - thinking they would magically come online if there was a router problem - they didn’t
here’s a screenshot with the last output I can get from one of these nucs. they are connected via wifi. the wifi just seems to go down. the automated reboot doesn’t seem to even bring them always online, so I am at a relative loss here.

any tips on how to debug more or get closer to a solution - would be appreciated.

the-real-kenna · June 11, 2021, 1:09am

@katmai I’m sorry to hear you’re still struggling with this. I know you’ve checked journalctl in the past after enabling persistent logging, but can you tell us specifically what you see when running journalctl -u networkmanager from the hostOS? It would be helpful to see those specific logs if you have them, as well as any insights you’ve since gained from Datadog.

alexgg · June 23, 2021, 4:48pm

Hi, some things that might help debugging the problem:

Keep persistentLogging on as you have now
Add the following to the kernel command line (note this is removed in kernel v5.8):

iwlwifi.fw_monitor=1

Enable NetworkManager verbose logging from the hostOS:

nmcli general logging level DEBUG domain ALL

Check the logs for kernel oops messages

katmai · June 24, 2021, 12:50pm

hi, just a quick update on this.

a common topic was that the cpu was maxed out, so we used cpuset to leave 1 core free for balena and our containers run on cores 1-3 only.

the nucs have been stable since this change.

I’m guessing our app was hogging the resources and then docker would just freeze.

anujdeshpande · June 28, 2021, 6:03am

Hi

Thanks for the update! That quite an interesting solution, and good to know it works for you!

Topic		Replies	Views
Intel NUC device goes offline after balenaOS update Product support	8	786	September 16, 2022
My Intel NUC Kit NUC6CAYS frequently goes offline / down balenaOS	33	2830	January 21, 2020
Device offline on web dashboard and cli Product support	24	3282	October 18, 2019
Intel NUC - intermittent online/offline Product support	2	599	April 20, 2019
Device Offline After Update balenaOS raspberrypi3	5	457	September 2, 2020

debugging machines going down

Related topics