Stuck devices - "tunneling socket could not be established: cause=socket hang up"

Hi,

We’ve had approximately 2-3 remote production devices go into a weird state in which our application containers stop running, but devices show as “Online” in the balena website.

The “weird state” is characterized by: containers appear to be not running (no balena logs and no internal logs to our infrastructure). Issuing a restart command from the website is responded to with "“tunneling socket could not be established: cause=socket hang up” message displayed, and restart is unsuccessful. Issuing a reboot command from the website is similarly unsuccessful, with the same error message displayed. We discovered that we were able to SSH into the “Host OS” and run “shutdown -r now”, and this seemed to fully restore the devices to operational status. This has been observed on balenaOS 2.29.0+rev1 supervisor 9.0.1. and possibly other versions.

Why is this happening, and how can we prevent it from happening? What can we look into?

Hi,
I have not heard of this type of behaviour so we will have to take a look when it occurs again.
Best option would be to contact support as soon as you find a device in that state and enable support access. You will have to give us a little time to take care, depending on the amount of active support issues.
Unfortunately most relevant logs are not available any more after reboot so there is not much we can see once the device has booted. If you share a device link and enable support access I can take a look anyway and see if I can detect anything unusual.
Regards
Thomas

JAN 29, 11:44 AM

Hi Thomas,
Thanks for the info. We’ll do our best to try to forward you support access to a device that gets into this state the next time we see it.

Out of curiosity, is there any way for us to go in and capture the low level system logs ourselves? Maybe we can look at them too without requiring the full support access option.

hey @troyvsvs you should have the option on the dashboard to login to the host os. Once there you can use the standard tools to look around…

I’ve also seen these exact symptoms; @samothx, I left one of my devices in this “stuck” state and granted support access so y’all can poke around: https://dashboard.balena-cloud.com/devices/d1ed939d7e2a7e1a4c14a0c791505d27/

Hello there @msegado,

We’ve looked into the device @msegado kindly granted access support for, and we’ve determined the reason for failure is because the balenaEngine binary has become corrupted (amongst other files). This appears be down to SD card corruption.

We’ve also noticed the kernel ring journal shows a constant stream of the following errors:

[589909.286836] usb 1-1-port2: over-current change

This points to either an issue with the power supply (eg. not using a powerful enough supply) or an issue with the USB ports themselves. The following thread gives you a little more information:

https://www.element14.com/community/thread/55091/l/how-to-get-rid-of-over-current-change?displayFullThread=true

These two factors combined appear to point to power issues which have caused SD card corruption corrupting several required binaries.

We’re attempting to fix this device manually, but please be aware if there are power issues on this device, the likelihood of this happening again are quite high.

Best regards, Heds

Hi again,

Just as an aside, we strongly recommend using an SD card that’s been rigorously tested, such as the SanDisk Extreme Pro. We often see customers with SD corruption due to cheap SD cards.

Best regards, Heds

This doesn’t seem to address the obvious: we were able to get the devices working again by running “shutdown -r now” on the host OS. How can that possibly square with the “SD corruption”?

Hi
SD-card corruption can manifest itself in a lot of ways. Depending on where corruption is located you can expects all sorts of weird behaviours.
Regards
Thomas

@troyvsvs, the device that we know had SD corruption was @msegado’s. If you’re seeing the device come back fine after a reboot then it’s probably a different issue.

You had asked how you can get low level logs yourselves - there’s several things that could be useful to diagnose, but in this case I think these would be the most relevant:

systemctl status resin-supervisor.service
journalctl -n 500 -u resin-supervisor -a
systemctl status balena.service
journalctl -n 500 -u balena -a
balena ps -a
balena images
dmesg

This is because I suspect the supervisor isn’t running properly, given how the restart command is failing. I’m hoping we can see some useful error messages if you post the output of those commands (especially the journalctl ones).

Thanks for looking into (and fixing!) this. While we’re not running any USB devices, both our SD cards and our PSUs were very cheap (rated at 2.5A but not verified…) so one of those seems like a likely culprit. I’ll look into replacing those.

That said, the last time this happened I was also able to reboot from the CLI like @troyvsvs and have the device come back online, so I’m wondering if there’s more going on here. I’m not too familiar with how the disk cache operates in Linux, but is it possible that the cache became corrupt due to power issues rather than the filesystem itself (and just never got invalidated since it’s read-only)? In that case the corruption would fix itself after a reboot.

Anyway, I’ll keep an eye out and let you all know if I get a device stuck in this state again so we can experiment more =)

@pcarranzav

Those commands are very helpful thanks. We’ll run them (in addition to providing support access) if this is seen again.

To all balena folk on here: is there any way we can gather these low level logs from our application container? We have a fairly robust internal log infrastructure that might be useful in debugging this stuff if we could forward the OS logs to it. We could, e.g., give you guys logs going back a week or more.

@msegado
Thanks for the update and info about your ability to reboot. My hunch is that there’s something else going on here. Your disk cache explanation is a logical argument but it seems a stretch to justify the SD corruption as the culprit. I’m not aware of a corruptible read only disk cache (living somewhere between nonvolatile cells and RAM) as you’re describing. It seems that if power is too low to hold things in volatile memory, even RAM would be affected. Could be wrong on that though. Another thing is that we saw a spurt of devices encountering this issue immediately after a large portion of our fleet experienced a network outage. I guess that “network outage” could have been a power outage though. Random power outages might cause any power-related instability to surface.

@troyvsvs, regarding: “is there any way we can gather these low level logs from our application container?,” to the best of my knowledge unfortunately the answer is no. This request/question pops up every now and then, and every time it does, it gets upvoted in our internal issue tracker and discussed in our roadmap meetings, so thank you for bringing it up!

Are the devices all on a common power infrastructure? If the issue was correlated with an outage on network as well, and they are on a common power infrastructure, it seems possible that power fluctuation is a variable that could connect network outage to any disk / memory corruption.

Also, could you clarify which kinds of logs you’d like to gather from the application containers? It may be possible to include a logging export agent in your application image for the logs you want.

@pdcastro
You’re welcome. Thanks for hopping on, and for the insight into Balena’s development processes.

@dt-rush
We aren’t sure exactly the topology of the power infrastructure at our install locations, but several of the devices were in the same building.

As to which kinds of logs we’d like to gather, we’d like to gather any logs that can help Balena or us debug stability issues and improve the core OS, or knowledge of the interaction between application containers and the OS for maximum reliability. Eventually we want or IoT service to reach > 99.9% uptime, and in the future numbers like that will be a requirement for us, so I want to do whatever I can to boost the reliability numbers. Any or all logs useful to achieving that can be accommodated.

Hy, I am having this problem rigth now. I try to runn the command “shutdown -r now” but did not solve the problem.

It is possible to get suport from balena team? I was already enable the support access.

Thanks a lot

@nuno_martins1, I’ve sent you a private e-mail that you can reply with your device UUID or web dashboard URL, as we need that in addition to the granting of support access to the device.

right: my UUID is 8b2b584c18961f602b82e38865d1be3e.

This probles was happedesd in the past and I was replace the memory card. But the device is far away. I wold like to had another kind of solution.

@nuno_martins1, thanks for sharing the device UUID and granting support access. I was looking at the device logs and I see that balena-engine and the balena supervisor are failing to start because of filesystem issues:

# on a host OS prompt
root@8b2b584:~# journalctl -xe
...
Oct 03 18:43:35 8b2b584 kernel: EXT4-fs error (device mmcblk0p6): ext4_lookup:1587: inode #2925: comm balenad: deleted inode referenced: 1573070
Oct 03 18:43:35 8b2b584 kernel[637]: [ 7829.876197] EXT4-fs error (device mmcblk0p6): ext4_lookup:1587: inode #2925: comm balenad: deleted inode referenced: 1573070
Oct 03 18:43:35 8b2b584 balenad[7509]: time="2019-10-03T18:43:35.916044766Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
Oct 03 18:43:35 8b2b584 balenad[7509]: time="2019-10-03T18:43:35.916334712Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd nam>
Oct 03 18:43:36 8b2b584 balenad[7509]: Error starting daemon: lstat /var/lib/docker/image/aufs/layerdb/sha256/dc34b7df1552b76c451b24f9c42c5b0364005fc166360901d436eca93f0935cb: structure needs c>
Oct 03 18:43:36 8b2b584 systemd[1]: balena.service: Main process exited, code=exited, status=1/FAILURE
Oct 03 18:43:36 8b2b584 systemd[1]: balena.service: Failed with result 'exit-code'.
Oct 03 18:43:36 8b2b584 systemd[1]: Failed to start Balena Application Container Engine.
Oct 03 18:43:37 8b2b584 systemd[1]: balena.service: Service RestartSec=100ms expired, scheduling restart.
Oct 03 18:43:37 8b2b584 systemd[1]: balena.service: Scheduled restart job, restart counter is at 5.
Oct 03 18:43:37 8b2b584 systemd[1]: Stopped Balena Application Container Engine.
Oct 03 18:43:37 8b2b584 systemd[1]: Closed Docker Socket for the API.
Oct 03 18:43:37 8b2b584 systemd[1]: Stopping Docker Socket for the API.
Oct 03 18:43:37 8b2b584 systemd[1]: balena-engine.socket: Start request repeated too quickly.
Oct 03 18:43:37 8b2b584 systemd[1]: balena-engine.socket: Failed with result 'start-limit-hit'.
Oct 03 18:43:37 8b2b584 systemd[1]: Failed to listen on Docker Socket for the API.
Oct 03 18:43:37 8b2b584 systemd[1]: Dependency failed for Balena Application Container Engine.
Oct 03 18:43:37 8b2b584 systemd[1]: balena.service: Job balena.service/start failed with result 'dependency'.
root@8b2b584:~# journalctl -au resin-supervisor
...
Oct 03 18:31:29 8b2b584 resin-supervisor[6828]: Unhandled rejection Error: (HTTP code 500) server error - layer does not exist
...

The web dashboard Diagnostics tab (Supervisor State) for the device says “error: tunneling socket could not be established, cause=socket hang up” because the supervisor is not running on the device, which in turn is caused by the filesystem issues.

Such filesystem corruption issues are most commonly caused by SD card failures, or power outages. If you cannot replace the SD card, it may be possible to remotely unmount the /mnt/data partition and run fsck on it. Is this something you would like me to try? There is a risk of further data loss on the device.