balenaFin stuck

I have a BalenaFin 1.1 stuck at the very end of an application update. Updated 3 out of 4 containers, 4th container is stuck at the “Installed” status. Hard and soft reboots and subsequent application updates have failed to have any effect. The 3 good containers are running. Been stuck this way for a couple of days. Been having similar problems with Pi3/4s and moved over to the fin to eliminate SD card considerations.

diagnostics produces: “Error reported when querying checks data: ssh client socket error while initiating SSH connection”

Device is pingable throughout.

Anyone have any ideas?

Hi,

So by the sounds of it, you’re not able to SSH into the device from the Dashboard or balena CLI? If this isn’t the case, then if you could supply us with the device dashboard URL, we’ll have a look at the device for you.

If it is the case that you’re not able to SSH to the device, is there another device on the same network?

Best regards,

Heds

979bfd593d1f9a1f2bf4892e882dd51f

Thank you. It seems that all services are running normally here looking at the Dashboard (none are stuck at installed and all are showing the latest application release).

I do see something strange in that the Supervisor service seems to have got into a bad state (and is not connecting properly). It’s also taken me several attempts to SSH in, which usually corresponds with a very bad network connection, however once in the device is very responsive.

Currently, the device is in a bad state due to the Supervisor. I’d like to try and sort the Supervisor out, but this might result in me having to stop balenaEngine and the services as well. Would this be acceptable?

Best regards,

Heds

Go for it. I need to figure this out.

Hi. Some operations are timing out and taking too long, including a restart of the balenaEngine. Can this device be restarted?

As for the reason why, my guess is an SD card issue. System logs show balenad hanging for at least two minutes while trying to do I/O (and that’s also why it is timing out on a restart). I saw timeouts on I/O-heavy operations as well, and you mentioned SD card issues. If it’s possible, could you try another SD card? From experience, we usually recommend the Sandisk Extreme Pro.

This device is a fin - so … SD card? It’s been power cycled.

Sorry, yes it’s a Fin. I’ll get in touch with the right people and update you when we have something.

I’m not looking so much to fix this individual devices as to learn how to debug this situation myself. Personally, I suspected network connectivity but you see otherwise?

If it’s any help, the device has 6 i2c and 2 GPIO devices (4gpio pins) in use. The GPIO/i2c stuff all seems to work correctly individually. I can unplug all that if it helps.

I don’t think it’s a power thing either - the fin is on the standard supply and the 4, 3b and 3b+ are all on decent 2.5A supplies that have handled similar loads for me in the past and I don’t get any blinky power LED.

Different pieces of the stack are disagreeing on what is the actual state of the supervisor, and that is leading to some odd issues. Part of the that seems to be caused by slow I/O, which may trigger some timeouts. I don’t think i2c or GPIO are related.

IO as in network IO or storage IO?

Storage I/O

When I tried to restart the engine, balenad was killed by the kernel due to blocking way too long while doing disk I/O, so that’s an issue. I don’t know yet if that’s the issue.

Container issue? 1 python-buster, 1 python-stretch, 2 golangs?

Camera presence? (1)

What are you looking at to determine this, and can I see the same info?

Both the standard dmesg and journactl commands should give you info if a task is killed due to extreme situations, such as out of memory or blocking too long inside the kernel (such as due to very slow IO). journalctl is a standard utility in systemd machines, and it usually gives a lot of good information for debugging.

Ideas?

Hi @rodley, as noted above this may be related to a storage issue. One of our engineers suggested a way to get a ball park estimation for the state of the eMMC on the Fin, which might provide some more insight into this. If you want to try then here are the instructions:

  • Install mmc utils (apt-get install mmc-utils)
  • Run mmc extcsd read /dev/mmcblk0
  • There will be a field called EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B
  • The value of that field gives you an idea of the wear on the memory in the following way:
    0x01 ~ <10% of maximum wear
    0x02 ~ <20% of maximum wear

Here’s my action plan:

  • Reflash Pi3B, 3B+, 4B and Fin as development versions with local mode enabled using recommended sd cards (except fin)
  • Run the eMMC diagnostics above on fin to estimate wear (should be very low as device has been very lightly used)
  • Report back here with wear results.
  • Push some application updates locally and see if they “take” quickly, unlike the current situation where updates take many hours.
  • If updates go wonky, we’ve eliminated the firewall/router/internet connection as issue. Check dmesg and journalctl - get back on forum for further help.

Sound good?