balenaFin stuck

rodley · January 15, 2020, 4:58pm

I have a BalenaFin 1.1 stuck at the very end of an application update. Updated 3 out of 4 containers, 4th container is stuck at the “Installed” status. Hard and soft reboots and subsequent application updates have failed to have any effect. The 3 good containers are running. Been stuck this way for a couple of days. Been having similar problems with Pi3/4s and moved over to the fin to eliminate SD card considerations.

diagnostics produces: “Error reported when querying checks data: ssh client socket error while initiating SSH connection”

Device is pingable throughout.

Anyone have any ideas?

hedss · January 16, 2020, 1:08pm

Hi,

So by the sounds of it, you’re not able to SSH into the device from the Dashboard or balena CLI? If this isn’t the case, then if you could supply us with the device dashboard URL, we’ll have a look at the device for you.

If it is the case that you’re not able to SSH to the device, is there another device on the same network?

Best regards,

Heds

rodley · January 16, 2020, 1:10pm

979bfd593d1f9a1f2bf4892e882dd51f

hedss · January 16, 2020, 1:38pm

Thank you. It seems that all services are running normally here looking at the Dashboard (none are stuck at installed and all are showing the latest application release).

I do see something strange in that the Supervisor service seems to have got into a bad state (and is not connecting properly). It’s also taken me several attempts to SSH in, which usually corresponds with a very bad network connection, however once in the device is very responsive.

Currently, the device is in a bad state due to the Supervisor. I’d like to try and sort the Supervisor out, but this might result in me having to stop balenaEngine and the services as well. Would this be acceptable?

Best regards,

Heds

rodley · January 16, 2020, 1:49pm

Go for it. I need to figure this out.

Ereski · January 16, 2020, 5:43pm

Hi. Some operations are timing out and taking too long, including a restart of the balenaEngine. Can this device be restarted?

As for the reason why, my guess is an SD card issue. System logs show balenad hanging for at least two minutes while trying to do I/O (and that’s also why it is timing out on a restart). I saw timeouts on I/O-heavy operations as well, and you mentioned SD card issues. If it’s possible, could you try another SD card? From experience, we usually recommend the Sandisk Extreme Pro.

rodley · January 16, 2020, 5:55pm

This device is a fin - so … SD card? It’s been power cycled.

Ereski · January 16, 2020, 6:05pm

Sorry, yes it’s a Fin. I’ll get in touch with the right people and update you when we have something.

rodley · January 16, 2020, 6:08pm

I’m not looking so much to fix this individual devices as to learn how to debug this situation myself. Personally, I suspected network connectivity but you see otherwise?

If it’s any help, the device has 6 i2c and 2 GPIO devices (4gpio pins) in use. The GPIO/i2c stuff all seems to work correctly individually. I can unplug all that if it helps.

rodley · January 16, 2020, 6:14pm

I don’t think it’s a power thing either - the fin is on the standard supply and the 4, 3b and 3b+ are all on decent 2.5A supplies that have handled similar loads for me in the past and I don’t get any blinky power LED.

Ereski · January 16, 2020, 6:25pm

Different pieces of the stack are disagreeing on what is the actual state of the supervisor, and that is leading to some odd issues. Part of the that seems to be caused by slow I/O, which may trigger some timeouts. I don’t think i2c or GPIO are related.

rodley · January 16, 2020, 6:26pm

IO as in network IO or storage IO?

Ereski · January 16, 2020, 6:26pm

Storage I/O

Ereski · January 16, 2020, 6:29pm

When I tried to restart the engine, balenad was killed by the kernel due to blocking way too long while doing disk I/O, so that’s an issue. I don’t know yet if that’s the issue.

rodley · January 16, 2020, 6:30pm

Container issue? 1 python-buster, 1 python-stretch, 2 golangs?

Camera presence? (1)

rodley · January 16, 2020, 8:54pm

What are you looking at to determine this, and can I see the same info?

Ereski · January 16, 2020, 9:12pm

Both the standard dmesg and journactl commands should give you info if a task is killed due to extreme situations, such as out of memory or blocking too long inside the kernel (such as due to very slow IO). journalctl is a standard utility in systemd machines, and it usually gives a lot of good information for debugging.

rodley · January 17, 2020, 6:20pm

Ideas?

garethtdavies · January 17, 2020, 9:59pm

Hi @rodley, as noted above this may be related to a storage issue. One of our engineers suggested a way to get a ball park estimation for the state of the eMMC on the Fin, which might provide some more insight into this. If you want to try then here are the instructions:

Install mmc utils (apt-get install mmc-utils)
Run mmc extcsd read /dev/mmcblk0
There will be a field called EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B
The value of that field gives you an idea of the wear on the memory in the following way:
0x01 ~ <10% of maximum wear
0x02 ~ <20% of maximum wear
…

rodley · January 18, 2020, 5:52pm

Here’s my action plan:

Reflash Pi3B, 3B+, 4B and Fin as development versions with local mode enabled using recommended sd cards (except fin)
Run the eMMC diagnostics above on fin to estimate wear (should be very low as device has been very lightly used)
Report back here with wear results.
Push some application updates locally and see if they “take” quickly, unlike the current situation where updates take many hours.
If updates go wonky, we’ve eliminated the firewall/router/internet connection as issue. Check dmesg and journalctl - get back on forum for further help.

Sound good?

Topic		Replies	Views
Device stuck in Updating loop Product support	16	1788	July 4, 2019
Device services stuck at "stopping" state Product support	10	2017	May 8, 2019
supervisor busy.. Cannot reach supervisor General	38	711	July 30, 2021
Stuck devices - "tunneling socket could not be established: cause=socket hang up" Product support	25	6820	November 17, 2020
Multicontainer project that works on RPI3 fails on Fin board (avahi & supervisor interaction) Product support	26	2182	December 25, 2018

balenaFin stuck

Related topics