balenaFin stuck

Potentially useful data points from the fin:

  1. running “apt-get install mmc-utils” in the python-buster container hung trying to connect to prod.debian.map.fastly.net and never completed until I killed the process named “/usr/local/lib/python3.5/site-packages/adafruit_blinka/microcontroller”. Once I killed that process, everything started working.
  2. mmc extcsd read reported back 0x01 - less than 10% max wear. Not a memory corruption issue.

Action plan from here while I wait for sandisk sd cards - get rid of the adafruit blinka code (not a big deal) and see if it doesn’t magically fix all my problems. Suspecting it will.

Process that was actually running was:

/usr/local/lib/python3.5/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/libgpiod_pulsein --pulses 2 --queue 20391 gpiochip0 20

Strong evidence that Linux package libgpiod2 running in a python-buster container is the problem.

Removed it (was only using it for HCSR04 ultrasonic distance detector) and the fin is behaving much better - updates immediately, logs come through balenacloud timely, ssh’ing into containers and host OS succeeds on first try and stays connected.

libgpiod2, a debian package required by the Adafruit blinka python package, in particular the command “libgpiod_pulsein --pulses 2 --queue 20391 gpiochip0 20” seems to interfere seriously with the operation of the supervisor.

The symptoms are:

  • spotty access to logs
  • very flaky update behavior
  • very flaky terminal access
  • logging within the supervisor that mimics corrupt SD card behavior

The confusing part is that the GPIO functionality of the library appears to work when run from the command line in a terminal. This is not a balenafin specific behavior - occurs on 3B, 3B+ and 4B.

hey @rodley,

thanks for your thorough investigation on this. We will look into this and update you once we have some feedback.
cheers,
Rahul

Hey, based on https://github.com/adafruit/libgpiod_pulsein/blob/master/src/libgpiod_pulsein.c it looks like it sets itself to the max priority possible and runs in a tight loop, which I expect is starving the system resources and that’s what’s causing the issues you’re seeing, can you check it’s behaviour whilst running to confirm if it’s maxing out resources?

That’s what I figured. The use case is arguably real time. I’ll check it out and report back here.

Jr