debugging machines going down

katmai · February 8, 2021, 4:12pm

hi guys,

we have deployed about 20 intel nucs with balenaOS in a remote location. We have been experiencing the devices going down, but we can’t seem to have any debug info. they simply lose connectivity. 4-5 devices are unreachable, while the rest are working fine. it just seems like they are frozen.

it doesn’t look like we can SSH into them from the local network either, unless we reboot them.

uuid:
c081b603ce4d87562166e5b02d9c8c29
0502c9a83cf93a1a7fd740e821e94d84

do you guys have any pointers on what info might help in figuring this out?

thanks.

TJvV · February 9, 2021, 8:16am

Hi,

Have you tried enabling persistent logging and then checking journalctl from the Host OS after rebooting?

I had an issue with a bunch of Raspberry Pi’s a while back and this helped me identify the power supply as the source of the issue.

katmai · February 9, 2021, 2:30pm

thanks, i just did that, in addition to adding datadog. maybe something will pop up during next downtime.
will keep you posted.

katmai · February 17, 2021, 6:24pm

after enabling persistent logging we found this on a machine that’s crashing more frequently. it’s also running the latest 2.68 version.

Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee50 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee60 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee70 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee80 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee90 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eea0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eeb0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eec0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eed0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eee0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eef0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef00 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef10 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef20 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef30 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef40 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef50 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef60 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef70 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef80 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef90 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43efa0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43efb0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43efc0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43efd0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0

katmai · February 17, 2021, 6:27pm

another error i found is the one below. the machines are on the same network. majority of other devices are fine.

Feb 17 18:13:53 67902a2 kernel: iwlwifi 0000:02:00.0: Failed to wake NIC for hcmd
Feb 17 18:13:53 67902a2 kernel: iwlwifi 0000:02:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5
Feb 17 18:13:53 67902a2 kernel: iwlwifi 0000:02:00.0: Failed to wake NIC for hcmd
Feb 17 18:13:53 67902a2 kernel: iwlwifi 0000:02:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5
Feb 17 18:13:59 67902a2 kernel: iwlwifi 0000:02:00.0: Failed to wake NIC for hcmd
Feb 17 18:13:59 67902a2 kernel: iwlwifi 0000:02:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5
Feb 17 18:13:59 67902a2 kernel: iwlwifi 0000:02:00.0: Failed to wake NIC for hcmd
Feb 17 18:13:59 67902a2 kernel: iwlwifi 0000:02:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5
Feb 17 18:14:04 67902a2 cc47f4472d22[1090]: [error]   LogBackend: unexpected error: Error: getaddrinfo EAI_AGAIN api.balena-cloud.com
Feb 17 18:14:04 67902a2 cc47f4472d22[1090]: [error]         at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:64:26)
Feb 17 18:14:04 67902a2 resin-supervisor[2955]: [error]   LogBackend: unexpected error: Error: getaddrinfo EAI_AGAIN api.balena-cloud.com
Feb 17 18:14:04 67902a2 resin-supervisor[2955]: [error]         at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:64:26)

katmai · February 17, 2021, 7:28pm

i came across this thread: [SOLVED] iwlwifi Disconnecting Issues / Kernel & Hardware / Arch Linux Forums

i have manually set

sysctl net.ipv4.tcp_ecn=0

i wonder if this solves it. now I’m looking for ways to make this change permanent in case this turns out to keep the machines from going down.

TJvV · February 18, 2021, 7:51am

Hi,
The xhci_hcd errors suggest something wrong with your USB.
If you search for those errors, you may be able to find more info.
This may or may not be connected to your wifi issue.

Regardless, you should be able to do that sysctl net.ipv4.tcp_ecn=0 or similar commands from your container; just hook it up in your entrypoint and you should be good to go.

katmai · February 18, 2021, 2:28pm

it’s not really good to go. i need to run sysctl net.ipv4.tcp_ecn=0 on the host os not on the container.

the host os cuts off all the connections. the container just follows.

any way to do that? (i mean at application level) the reason is that these machines lose power once in a while and i don’t want to have to reapply this.

nghiant2710 · February 19, 2021, 8:01am

Hi,

I think you can set it in your Docker-compose file by adding something like this to your service.

sysctls:
      net.ipv4.tcp_ecn: 0

katmai · February 19, 2021, 8:17am

Doesn’t this add it to the service container, not to the host? which is not what i am looking for?

Ereski · February 24, 2021, 6:19pm

Yes, you’d need to apply that to the host OS. This can be done by adding net.ipv4.tcp_ecn=0 in a sysctl configuration file in the host OS’s filesystem. For each affected device, open a shell into the host OS and run:

$ mount -o remount,rw /
$ echo 'net.ipv4.tcp_ecn=0' > /etc/sysctl.d/tcp_ecn.conf
$ reboot

(feel free to change tcp_ecn.conf to another file name. It doesn’t matter as long as it ends with .conf)

After the device is rebooted, check that the setting has been successfully applied:

$ sysctl net.ipv4.tcp_ecn
net.ipv4.tcp_ecn = 0

I’d recommend trying this out on a single device first and seeing if it solves your issue before going through the trouble of applying this setting to affected devices.

katmai · February 25, 2021, 10:35am

thank you! this works wonderfully. i’ll keep you posted if the issues persist. hopefully this fixes all the problems.

have an awesome day!

rahul-thakoor · February 25, 2021, 10:54am

Glad it works. let us know if there are issues

katmai · March 1, 2021, 8:16pm

i have 1 more question:

i noticed that if i update balenaOS, the persistent settings go away.
is there any way to add the sysctl tweaks to persist even after an OS update, so i don’t have to go into each machine and set them again if i have to update the OS?

thank you very much for your time.

Ereski · March 2, 2021, 6:45pm

As far as I know, no. But you could use balena ssh in a script to do that. Something like:

while read id; do
 balena ssh $id <<EOF
mount -o remount,rw /
echo 'net.ipv4.tcp_ecn=0' > /etc/sysctl.d/tcp_ecn.conf
reboot
EOF
done

And then you could call this script with the UUID of all affected devices sent to stdin.

katmai · March 15, 2021, 5:29pm

thanks for this. didn’t think of it

katmai · March 18, 2021, 6:52am

i have one more question. i have been looking but i can’t seem to figure out how to add a script that runs on boot on the host OS.

We figured out, by using the development version, that the boxes don’t actually freeze, but the network gets cut off, for some of them. if we simply reboot the device when that happens, everything comes back online.

i made a script that checks if connectivity has been down for more than 15 minutes, and then it’s sort of safe to say the box will be unreachable, and the script reboots the box.

what would be the best way of having this script running at boot?

Thanks in advance.

anujdeshpande · March 18, 2021, 6:55am

Hi

Is this something that you can do with a service that starts first - before other services - checks something, and then proceeds to exit. Or do you absolutely need it in balenaOS?

If a simple service can work for you, check the depends_on field in docker compose.

katmai · March 18, 2021, 7:18am

the script needs to start after the machine has been powered on, and checks once every x minutes if the device is still online.

if connectivity has been lost for 15-30 minutes, the script will reboot the machine automatically. i haven’t tried to see if restarting the network would bring the machine online though. if that works, i’d replace the reboot command with a network restart.

this is a script that runs constantly.
here’s the code i am planning on using:

#!/bin/bash

check_connectivity() {

wget -q --spider http://google.com ; echo $?

if [[ $? -eq 0 ]]; then

    echo $(date) "1" | tee -a /mnt/data/connectivity/connectivity.log

    echo OK > /mnt/data/connectivity/status

else

    echo $(date) "0" | tee -a /mnt/data/connectivity/connectivity.log

    echo 0 >> /mnt/data/connectivity/status

fi

}

fix_connectivity() {

status=$(cat /mnt/data/connectivity/status |wc -l)

if [[ $status -ge 5 ]]; then

    ### connectivity failed to get restored within 10 minutes. box likely dead.

    ## reboot

    /sbin/reboot

    ## restart NetworkManager is also an option

    # systemctl restart NetworkManager

fi

}

while true

do

check_connectivity

sleep 60

fix_connectivity

done

anujdeshpande · March 18, 2021, 7:33am

I think this is something you can do using a small service. Using the supervisor API, you can restart the device from this service - if the network fails as per your constraints.

You shouldn’t have to make changes to balenaOS for this - as that would be a lot more complicated than having a service that monitors this. You can look at different network modes for docker services - so that you can get direct access to the network of the OS - in case that’s something that you need.

Topic		Replies	Views
My Intel NUC Kit NUC6CAYS frequently goes offline / down balenaOS	33	2852	January 21, 2020
Containers do not come up after the reboot Product support	16	714	September 25, 2020
Balena Containers leaving Bridge network balenaOS support , docker , network	25	1574	December 28, 2021
Balena Device only showing up on one router Product support	22	1269	February 19, 2020
Error: Request error: tunneling socket could not be established, cause=socket hang up Product support	12	1955	February 14, 2020

debugging machines going down

Related topics