Stuck devices - "tunneling socket could not be established: cause=socket hang up"


#1

Hi,

We’ve had approximately 2-3 remote production devices go into a weird state in which our application containers stop running, but devices show as “Online” in the balena website.

The “weird state” is characterized by: containers appear to be not running (no balena logs and no internal logs to our infrastructure). Issuing a restart command from the website is responded to with "“tunneling socket could not be established: cause=socket hang up” message displayed, and restart is unsuccessful. Issuing a reboot command from the website is similarly unsuccessful, with the same error message displayed. We discovered that we were able to SSH into the “Host OS” and run “shutdown -r now”, and this seemed to fully restore the devices to operational status. This has been observed on balenaOS 2.29.0+rev1 supervisor 9.0.1. and possibly other versions.

Why is this happening, and how can we prevent it from happening? What can we look into?


#4

Hi,
I have not heard of this type of behaviour so we will have to take a look when it occurs again.
Best option would be to contact support as soon as you find a device in that state and enable support access. You will have to give us a little time to take care, depending on the amount of active support issues.
Unfortunately most relevant logs are not available any more after reboot so there is not much we can see once the device has booted. If you share a device link and enable support access I can take a look anyway and see if I can detect anything unusual.
Regards
Thomas

JAN 29, 11:44 AM


#5

Hi Thomas,
Thanks for the info. We’ll do our best to try to forward you support access to a device that gets into this state the next time we see it.


#6

Out of curiosity, is there any way for us to go in and capture the low level system logs ourselves? Maybe we can look at them too without requiring the full support access option.


#7

hey @troyvsvs you should have the option on the dashboard to login to the host os. Once there you can use the standard tools to look around…


#8

I’ve also seen these exact symptoms; @samothx, I left one of my devices in this “stuck” state and granted support access so y’all can poke around: https://dashboard.balena-cloud.com/devices/d1ed939d7e2a7e1a4c14a0c791505d27/


#21

Hello there @msegado,

We’ve looked into the device @msegado kindly granted access support for, and we’ve determined the reason for failure is because the balenaEngine binary has become corrupted (amongst other files). This appears be down to SD card corruption.

We’ve also noticed the kernel ring journal shows a constant stream of the following errors:

[589909.286836] usb 1-1-port2: over-current change

This points to either an issue with the power supply (eg. not using a powerful enough supply) or an issue with the USB ports themselves. The following thread gives you a little more information:

https://www.element14.com/community/thread/55091/l/how-to-get-rid-of-over-current-change?displayFullThread=true

These two factors combined appear to point to power issues which have caused SD card corruption corrupting several required binaries.

We’re attempting to fix this device manually, but please be aware if there are power issues on this device, the likelihood of this happening again are quite high.

Best regards, Heds


#28

Hi again,

Just as an aside, we strongly recommend using an SD card that’s been rigorously tested, such as the SanDisk Extreme Pro. We often see customers with SD corruption due to cheap SD cards.

Best regards, Heds


#30

This doesn’t seem to address the obvious: we were able to get the devices working again by running “shutdown -r now” on the host OS. How can that possibly square with the “SD corruption”?


#31

Hi
SD-card corruption can manifest itself in a lot of ways. Depending on where corruption is located you can expects all sorts of weird behaviours.
Regards
Thomas


#35

@troyvsvs, the device that we know had SD corruption was @msegado’s. If you’re seeing the device come back fine after a reboot then it’s probably a different issue.

You had asked how you can get low level logs yourselves - there’s several things that could be useful to diagnose, but in this case I think these would be the most relevant:

systemctl status resin-supervisor.service
journalctl -n 500 -u resin-supervisor -a
systemctl status balena.service
journalctl -n 500 -u balena -a
balena ps -a
balena images
dmesg

This is because I suspect the supervisor isn’t running properly, given how the restart command is failing. I’m hoping we can see some useful error messages if you post the output of those commands (especially the journalctl ones).


#38

Thanks for looking into (and fixing!) this. While we’re not running any USB devices, both our SD cards and our PSUs were very cheap (rated at 2.5A but not verified…) so one of those seems like a likely culprit. I’ll look into replacing those.

That said, the last time this happened I was also able to reboot from the CLI like @troyvsvs and have the device come back online, so I’m wondering if there’s more going on here. I’m not too familiar with how the disk cache operates in Linux, but is it possible that the cache became corrupt due to power issues rather than the filesystem itself (and just never got invalidated since it’s read-only)? In that case the corruption would fix itself after a reboot.

Anyway, I’ll keep an eye out and let you all know if I get a device stuck in this state again so we can experiment more =)


#39

@pcarranzav

Those commands are very helpful thanks. We’ll run them (in addition to providing support access) if this is seen again.

To all balena folk on here: is there any way we can gather these low level logs from our application container? We have a fairly robust internal log infrastructure that might be useful in debugging this stuff if we could forward the OS logs to it. We could, e.g., give you guys logs going back a week or more.

@msegado
Thanks for the update and info about your ability to reboot. My hunch is that there’s something else going on here. Your disk cache explanation is a logical argument but it seems a stretch to justify the SD corruption as the culprit. I’m not aware of a corruptible read only disk cache (living somewhere between nonvolatile cells and RAM) as you’re describing. It seems that if power is too low to hold things in volatile memory, even RAM would be affected. Could be wrong on that though. Another thing is that we saw a spurt of devices encountering this issue immediately after a large portion of our fleet experienced a network outage. I guess that “network outage” could have been a power outage though. Random power outages might cause any power-related instability to surface.


#43

@troyvsvs, regarding: “is there any way we can gather these low level logs from our application container?,” to the best of my knowledge unfortunately the answer is no. This request/question pops up every now and then, and every time it does, it gets upvoted in our internal issue tracker and discussed in our roadmap meetings, so thank you for bringing it up!


#44

Are the devices all on a common power infrastructure? If the issue was correlated with an outage on network as well, and they are on a common power infrastructure, it seems possible that power fluctuation is a variable that could connect network outage to any disk / memory corruption.

Also, could you clarify which kinds of logs you’d like to gather from the application containers? It may be possible to include a logging export agent in your application image for the logs you want.


#45

@pdcastro
You’re welcome. Thanks for hopping on, and for the insight into Balena’s development processes.

@dt-rush
We aren’t sure exactly the topology of the power infrastructure at our install locations, but several of the devices were in the same building.

As to which kinds of logs we’d like to gather, we’d like to gather any logs that can help Balena or us debug stability issues and improve the core OS, or knowledge of the interaction between application containers and the OS for maximum reliability. Eventually we want or IoT service to reach > 99.9% uptime, and in the future numbers like that will be a requirement for us, so I want to do whatever I can to boost the reliability numbers. Any or all logs useful to achieving that can be accommodated.