performance heavy UART affected by balena supervisor?

Issue with reading processes for high speed UART connections

We’re facing an issue including balenaOS and the read of UART devices connected with our service running on the balena device. The received byte-string seems to contain corrupted information.

Setup

raspberry pi 4 (2GB RAM)
python 3.6.9
balenaOS (all (tested) versions, up to 2.75.0+rev1),
Balena supervisor (all (tested) versions, up to 12.8.3)

Device UART settings / overlays

We do use the independent UART connections exposed with the overlays,

  • UART2, UART3, UART4
  • Available through /dev/ttyAMA1, /dev/ttyAMA2, /dev/ttyAMA3

Container running as balena service.
The service does have following components,

  1. main-process
  2. reading-process for the UART connections

main-process

The main-process does read the provided data packages delivered by the reading-process via an multiprocessing Queue of Python. This multiprocessing queue implementation in python uses OS pipes to exchange byte data between processes.

reading-process

The reading process holds an UART connection with a sensor device over the pySerial library. The UART connection has following characteristics,

  1. Baudrate: 921600
  2. Parity: Even
  3. Stopbits: One
  4. Bytesize: 8

The sensor connected is able to send a data frame every 120 milliseconds of around 3000byte per frame. Each data frame contains multiple 4-byte strings representing headers for different data types. A header is followed by the payload-size and the payload itself. The code does check the 4-byte string for a valid header string, in case of an invalid header, the package is flagged as corrupted. It is called a “Corrupted header”-exception.

We do receive approximately 500 frames per minute. Of these 500 frames around 5 are corrupted.

Issue

Only when running the code on balenaOS in a balena service do we see “Corrupted header”-exceptions. On other tested installations this can not be observed.

Other installation on identical hardware

  • Running code on Ubuntu 20.04 64bit on “plain” python interpreter. The OS was set up with the same commands used in balena service Dockerfile.
  • Running code on Ubuntu 20.04 64bit in a docker container, built on the base of the balena service Dockerfile. Also, the used Docker version was identical to the used balena-docker version.

Known improvements

  • Overclocking the raspberry pi 4 hardware over the standard clock of 1.5ghz seems to reduce the problem. Especially when using 1.7ghz.
  • Update the supervisor to the newest version of 12.8.3 also seems to reduce the problem significantly.

All this leads to the conclusion that the issue is caused by the balena supervisor or the balenaOS itself. With the reduced appearance of the issue by updating the balena supervisor it seems to narrow it down to the supervisor.

Can somebody identify any issue with the configuration?
Can somebody confirm the supervisor is affecting the reading of the UART?
Does somebody have an idea to sabilise / fix this?

Any help is much appreciated! Lot of thanks in advance!

Hi

Thanks for creating this ticket - this seems like a tricky one!

  • You seem to have done a lot of groundwork - is it possible for you to share an app for us to try out on a Pi4?
  • Do you see this on higher spec’d Pi4s? 4GB one for example
  • I am going to check the supervisor changelog and figure out what could have changed and helped so that we know if we can do more of that.

Hey, it seems like you’ve found some correlation between the CPU load/speed and frequency of corrupt headers. If we wanted to rule out the Supervisor as an issue which I’m not sure how it could be at this time then perform your test and record results (which you have done with “We do receive approximately 500 frames per minute. Of these 500 frames around 5 are corrupted.”) then stop the Supervisor and try it again to see if the results differ. You can stop the Supervisor with systemctl stop balena-supervisor. On older OS versions you will use the resin-supervisor service name (systemctl stop resin-supervisor).

Let us know the results.

Hi! Thanks for your reply.

  • Unfortunately we can not provide an app with the code.
  • We haven’t tried this yet but might be worth the try. However, the current RAM usage is around 500MB. So it shouldn’t be a low memory issue. But it could behave different on the 4GB version. I will try to give this a test.
  • This would be great! Especially as you might have better insights in what was solved with certain updates. The change in v12.8.3 “Prevent a recursive loop when reporting current state [Miguel Casqueira]” somehow sounds like a solved performance issue.

Hi! Thanks for your reply!

After upgrading our fleet to the, then newest, supervisor Version 12.8.3 the exception is caused 2x less than before.
Another test we made was starting a CPU load/benchmark on the balena device. Weirdly the exception count didn’t raise as one would expect. Especially when increasing the CPU clock seems to lower the exception count.

I will stop the supervisors in our development environment to test this. Btw, what functionality is lost when the supervisor is stopped?

If you stop the Supervisor then any updates made to the application (new releases) or device configurations will not be applied. Additionally, any actions such as restarting/shutdown/purging data and streaming of logs and device metrics will also be halted. The device will run normally otherwise for your end user you essentially just lost the admin control but can regain by starting the Supervisor again (note you do not lose VPN access which allows you to SSH to the device).

Interesting that mentioned 12.8.3 showed less exceptions because in that version we resolved a condition that did produce CPU spikes quite frequently. See New CPU usage calculation causes cpu spikes only engine sees · Issue #1673 · balena-os/balena-supervisor · GitHub for that issue which was resolved in 12.8.3.

I also created Optimize state diff calculation / reporting · Issue #1716 · balena-os/balena-supervisor · GitHub for more CPU optimizations to make. Let us know how your test goes after stopping the Supervisor and we can continue to work on this together :slight_smile:

Hey @crzdg, just wanted to ping on this issue. Especially interested in knowing if the packet corruption disappears or is otherwise affected by stopping the supervisor. Any news?