debugging machines going down

hi guys,

we have deployed about 20 intel nucs with balenaOS in a remote location. We have been experiencing the devices going down, but we can’t seem to have any debug info. they simply lose connectivity. 4-5 devices are unreachable, while the rest are working fine. it just seems like they are frozen.

it doesn’t look like we can SSH into them from the local network either, unless we reboot them.

uuid:
c081b603ce4d87562166e5b02d9c8c29
0502c9a83cf93a1a7fd740e821e94d84

do you guys have any pointers on what info might help in figuring this out?

thanks.

Hi,

Have you tried enabling persistent logging and then checking journalctl from the Host OS after rebooting?

I had an issue with a bunch of Raspberry Pi’s a while back and this helped me identify the power supply as the source of the issue.

thanks, i just did that, in addition to adding datadog. maybe something will pop up during next downtime.
will keep you posted.

1 Like

after enabling persistent logging we found this on a machine that’s crashing more frequently. it’s also running the latest 2.68 version.

Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee50 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee60 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee70 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee80 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ee90 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eea0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eeb0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eec0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eed0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eee0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43eef0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef00 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef10 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef20 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef30 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef40 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef50 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef60 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef70 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef80 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43ef90 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43efa0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43efb0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43efc0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Feb 17 18:14:23 67902a2 kernel: xhci_hcd 0000:00:15.0: Looking for event-dma 000000022d43efd0 trb-start 000000022d43f000 trb-end 000000022d43f000 seg-start 000000022d43f000 seg-end 000000022d43fff0

another error i found is the one below. the machines are on the same network. majority of other devices are fine.

Feb 17 18:13:53 67902a2 kernel: iwlwifi 0000:02:00.0: Failed to wake NIC for hcmd
Feb 17 18:13:53 67902a2 kernel: iwlwifi 0000:02:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5
Feb 17 18:13:53 67902a2 kernel: iwlwifi 0000:02:00.0: Failed to wake NIC for hcmd
Feb 17 18:13:53 67902a2 kernel: iwlwifi 0000:02:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5
Feb 17 18:13:59 67902a2 kernel: iwlwifi 0000:02:00.0: Failed to wake NIC for hcmd
Feb 17 18:13:59 67902a2 kernel: iwlwifi 0000:02:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5
Feb 17 18:13:59 67902a2 kernel: iwlwifi 0000:02:00.0: Failed to wake NIC for hcmd
Feb 17 18:13:59 67902a2 kernel: iwlwifi 0000:02:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5
Feb 17 18:14:04 67902a2 cc47f4472d22[1090]: [error]   LogBackend: unexpected error: Error: getaddrinfo EAI_AGAIN api.balena-cloud.com
Feb 17 18:14:04 67902a2 cc47f4472d22[1090]: [error]         at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:64:26)
Feb 17 18:14:04 67902a2 resin-supervisor[2955]: [error]   LogBackend: unexpected error: Error: getaddrinfo EAI_AGAIN api.balena-cloud.com
Feb 17 18:14:04 67902a2 resin-supervisor[2955]: [error]         at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:64:26)

i came across this thread: [SOLVED] iwlwifi Disconnecting Issues / Kernel & Hardware / Arch Linux Forums

i have manually set

sysctl net.ipv4.tcp_ecn=0

i wonder if this solves it. now I’m looking for ways to make this change permanent in case this turns out to keep the machines from going down.

1 Like

Hi,
The xhci_hcd errors suggest something wrong with your USB.
If you search for those errors, you may be able to find more info.
This may or may not be connected to your wifi issue.

Regardless, you should be able to do that sysctl net.ipv4.tcp_ecn=0 or similar commands from your container; just hook it up in your entrypoint and you should be good to go.

it’s not really good to go. i need to run sysctl net.ipv4.tcp_ecn=0 on the host os not on the container.

the host os cuts off all the connections. the container just follows.

any way to do that? (i mean at application level) the reason is that these machines lose power once in a while and i don’t want to have to reapply this.

Hi,

I think you can set it in your Docker-compose file by adding something like this to your service.

sysctls:
      net.ipv4.tcp_ecn: 0
1 Like

Doesn’t this add it to the service container, not to the host? which is not what i am looking for?

Yes, you’d need to apply that to the host OS. This can be done by adding net.ipv4.tcp_ecn=0 in a sysctl configuration file in the host OS’s filesystem. For each affected device, open a shell into the host OS and run:

$ mount -o remount,rw /
$ echo 'net.ipv4.tcp_ecn=0' > /etc/sysctl.d/tcp_ecn.conf
$ reboot

(feel free to change tcp_ecn.conf to another file name. It doesn’t matter as long as it ends with .conf)

After the device is rebooted, check that the setting has been successfully applied:

$ sysctl net.ipv4.tcp_ecn
net.ipv4.tcp_ecn = 0

I’d recommend trying this out on a single device first and seeing if it solves your issue before going through the trouble of applying this setting to affected devices.

thank you! this works wonderfully. i’ll keep you posted if the issues persist. hopefully this fixes all the problems.

have an awesome day!

Glad it works. let us know if there are issues :slight_smile:

i have 1 more question:

i noticed that if i update balenaOS, the persistent settings go away.
is there any way to add the sysctl tweaks to persist even after an OS update, so i don’t have to go into each machine and set them again if i have to update the OS?

thank you very much for your time.

As far as I know, no. But you could use balena ssh in a script to do that. Something like:

while read id; do
 balena ssh $id <<EOF
mount -o remount,rw /
echo 'net.ipv4.tcp_ecn=0' > /etc/sysctl.d/tcp_ecn.conf
reboot
EOF
done

And then you could call this script with the UUID of all affected devices sent to stdin.

1 Like

thanks for this. didn’t think of it :slight_smile:

i have one more question. i have been looking but i can’t seem to figure out how to add a script that runs on boot on the host OS.

We figured out, by using the development version, that the boxes don’t actually freeze, but the network gets cut off, for some of them. if we simply reboot the device when that happens, everything comes back online.

i made a script that checks if connectivity has been down for more than 15 minutes, and then it’s sort of safe to say the box will be unreachable, and the script reboots the box.

what would be the best way of having this script running at boot?

Thanks in advance.

Hi

Is this something that you can do with a service that starts first - before other services - checks something, and then proceeds to exit. Or do you absolutely need it in balenaOS?

If a simple service can work for you, check the depends_on field in docker compose.

the script needs to start after the machine has been powered on, and checks once every x minutes if the device is still online.

if connectivity has been lost for 15-30 minutes, the script will reboot the machine automatically. i haven’t tried to see if restarting the network would bring the machine online though. if that works, i’d replace the reboot command with a network restart.

this is a script that runs constantly.
here’s the code i am planning on using:

#!/bin/bash

check_connectivity() {

wget -q --spider http://google.com ; echo $?

if [[ $? -eq 0 ]]; then

    echo $(date) "1" | tee -a /mnt/data/connectivity/connectivity.log

    echo OK > /mnt/data/connectivity/status

else

    echo $(date) "0" | tee -a /mnt/data/connectivity/connectivity.log

    echo 0 >> /mnt/data/connectivity/status

fi

}

fix_connectivity() {

status=$(cat /mnt/data/connectivity/status |wc -l)

if [[ $status -ge 5 ]]; then

    ### connectivity failed to get restored within 10 minutes. box likely dead.

    ## reboot

    /sbin/reboot

    ## restart NetworkManager is also an option

    # systemctl restart NetworkManager

fi

}

while true

do

check_connectivity

sleep 60

fix_connectivity

done

I think this is something you can do using a small service. Using the supervisor API, you can restart the device from this service - if the network fails as per your constraints.

You shouldn’t have to make changes to balenaOS for this - as that would be a lot more complicated than having a service that monitors this. You can look at different network modes for docker services - so that you can get direct access to the network of the OS - in case that’s something that you need.