UART Failure after upgrading to 2.71.3+rev1 from 2.50.1+rev1 on Variscite IMX8M

We recently upgraded our lab devices to 2.71.3+rev1 (after recommendations from this thread Chronyc config is bad if device comes online without internet - #56 by adamshapiro0).

Unfortunately this has caused our UART to become unstable over long periods of time. Our device uses a 1.5Mbps UART of ttmxc2 to communicate with another MCU on the same PCB and these devices normally run for days without issues (no CRC errors etc.).

After upgrading to 2.71.3+rev1, the devices (two we’ve tested) fail after several hours. The failure mode is that they completely stop receiving any data on the serial port. Restarting the container without power cycling the device DOES resume the stream of data, which is interesting but obviously not a useful solution in production.

There are no entries in dmesg regarding the uart, and beyond that we’re not really sure how to further debug from here. We’d be happy to grant access to a device and would appreciate support on this issue as now we NEED to upgrade the base OS but simpy cannot since this serial issue is catastrophic for our application.

Note that we first encountered this issue with the 2.63 release and spent some trying to figure it out, but upgrading was not critical at that point so we held off and continued using 2.50. So we know that the issue appeared some time between 2.50 and 2.63.

FYI @acostach and @alexgg in case you didn’t see this thread.

I did some work to replicate the issue. Over the course of the day yesterday I was able to capture it 10 times. It seems to happen somewhat randomly. The longest I was able to run before hitting it was ~1.5 hours, and the shortest was just 1 minute.

I captured the journalctl output each time. The only thing that is consistent, and which appears every single time just before the data stops arriving, is what looks like a wifi scan by the network manager:

Mar 20 00:53:49 40fa62f NetworkManager[1093]: <info>  [1616201629.4267] device (wlan0): set-hw-addr: set MAC address to E6:01:44:CA:8C:6C (scanning)
Mar 20 00:53:49 40fa62f kernel: IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
Mar 20 00:53:49 40fa62f NetworkManager[1093]: <info>  [1616201629.4375] device (wlan0): supplicant interface state: inactive -> interface_disabled
Mar 20 00:53:49 40fa62f NetworkManager[1093]: <info>  [1616201629.4380] device (wlan0): supplicant interface state: interface_disabled -> inactive

Mar 20 00:53:52 40fa62f d600f32cd7a7[1107]: E20210320 00:53:52.340080    30 measurement_hub.cc:124]
Mar 20 00:53:52 40fa62f d600f32cd7a7[1107]:
Mar 20 00:53:52 40fa62f d600f32cd7a7[1107]: ********************************************************************************
Mar 20 00:53:52 40fa62f d600f32cd7a7[1107]:
Mar 20 00:53:52 40fa62f d600f32cd7a7[1107]: No IMU data received in 2.90612 seconds.
Mar 20 00:53:52 40fa62f d600f32cd7a7[1107]:
Mar 20 00:53:52 40fa62f d600f32cd7a7[1107]: ********************************************************************************

There are no other common details, at least not in the journal output.

I have no clue how a wifi scan could affect the Docker containers’ ability to receive from the UART, or if it’s unrelated. It did happen 100% of the time though, so it’s very possible that it’s related.

Do you guys have any ideas, or any thoughts on things we can try to help track this down? As @anathan84 said, this is a major issue for our software so we are not able to upgrade the host OS until we resolve it.

Thanks,

Adam

Hi,

Restarting the container without power cycling the device DOES resume the stream of data,

Does this mean if the container app closes and re-opens the fd, the transfer is resumed?

First thing you could check is if there’s any other process, be it in container or in the host that has this file open. And then check if it stalls when reading from the hostOS too. stty and cat should work for that.

Yes, if we close and reopen the container the fd works again for some time.

There are no other processes using that port from our side (the software we run on both version of the balena OS are the same).

We’ll get back to you on reading from stty from the hostOS results. Obviously we’ll have to shut off our normal operating container since it would open the port.

Just to clarify, our application does not close and reopen the fd without restarting the container, so we don’t know if that would resume the data. We do know that restarting the container (and therefore closing the fd) without restarting the device does resolve the issue.

I’m running a test now just doing a cat inside the container (I’m catting to a file and watching the file size since I didn’t want to risk the console printing of the high rate binary data affecting the test). Hopefully that fails eventually and this isn’t something we can only replicate via the fd open inside of our application. When it fails I can check for other processes and then try killing it and starting again without restarting the container. After that I’ll try the host OS. Unfortunately it takes a random amount of time to fail so I don’t know when it’ll happen, or even if it will under these conditions, but I’ll let you know once it does.

@acostach I ran with cat and confirmed that we can replicate the issue both inside the container and in the host OS. Steps to replicate were:

  1. Set baud rate to 1.5 mbps.
    • In practice, the serial settings were set already by our application and we didn’t restart the device so they remained as is. We never ran this step. Confirmed the settings via stty:
    root@40fa62f:~# stty -F /dev/ttymxc2
    speed 1500000 baud; line = 0;
    min = 1; time = 0;
    -brkint -icrnl -imaxbel
    -opost
    -isig -icanon -iexten -echo
    
  2. Generate a 100% CPU load on one core: while : ; do : ; done
  3. Run cat writing into a file so we can watch for progress and stalls: cat /dev/ttymxc2 > /var/lib/docker/volumes/1623312_resin-data/_data/garbage
  4. Watch until the file size stops changing: watch -n1 ls -l /var/lib/docker/volumes/1623312_resin-data/_data/garbage

I ran 3 trials:

  1. Run steps 3 and 4 inside the Docker container with no CPU load (no step 2)
    • Ran for 6 hours continuously with no issue
  2. Run steps 2-4 inside the Docker container (i.e., add the CPU load to mimic our application doing work – our application doesn’t use 100% CPU but this is a reasonable proxy)
    • UART died after ~1.5 hours
    • Confirmed via lsof in host OS that no other processes had /dev/ttymxc2 open at the time, just the one copy of cat
    • Also confirmed that running another cat /dev/ttymxc2 produced nothing. When it was running normally, if we ran a second cat, both inside the container and in the host OS, it would output data and the first cat would stop temporarily, and then when we closed it the first cat would start outputting again
    • Closing the first cat and restarting it resolved the issue immediately
  3. Stop all Docker containers (except the supervisor) and run steps 2-4 in the host OS
    • UART died after ~3 hours
    • Again confirmed no open files via lsof and confirmed second cat didn’t work
    • Closing the first cat and restarting it resolved the issue immediately

So we now know that it can be replicated in the host OS, and isn’t just a Docker thing. We also know that it seems to require a CPU load, at least to happen in a reasonable amount of time.

I do not see any wifi scans in the journal around the time of the failures. (In fact, I don’t see any wifi scans at all since March 20 – we were having some network trouble at my house that day so it’s possible that has since resolved and now it’s connected steadily.)

Any ideas on where to look next? We’re at a loss for what else to do to debug this and could definitely use some help.

Thanks,

Adam

Hi, thanks for the comprehensive explanation above. It is specially useful to know it can be easily replicated on the hostOS. I suspect a driver problem and it would be useful to reproduce on a Variscite development kit running the same kernel version than BalenaOS. Do you have a Variscite development kit for this board?

We do have that. Can you identify the exact kernel version that you are using in BalenaOS so we make sure we’re doing the right thing?

Hi, could you please confirm whether your board is a Variscite DART-MX8M or a Variscite DART-MX8M Mini? As fas as I can tell only the Variscite DART-MX8M Mini has a 2.71.3+rev1 release in production. In that case, the kernel version is 4.19.35.

Please let us know if that is not the board you are using.

We are using the MX8M-mini.

We ran a test using the Variscite recovery image that has kernel version 4.19.35. We ran this test on our device instead of the Variscite development kit to avoid changing two variables at once.

Confirming the kernel version:
root@imx8mm-var-dart:~# uname -r 4.19.35-imx8mm+g8aeff32a1a1f

Just as @adamshapiro0 did above, I ran the following:
stty -F /dev/ttymxc2 1500000 cs8 -brkint -icrnl -imaxbel -opost -isig -icanon -iexten -echo

And loaded the 1 CPU with this command in another session:
while : ; do : ; done

After 12 hours of running, I’m unable to replicate the same issue we saw repeatedly on the Balena based BSP.

I’m open to suggestions on what to try next. One idea we had to try shutting down all Balena services on Balena’s BSP, but we’d need some assistance in how to do that properly.

@adamshapiro0 are you using the MX8M-Mini too or the MX8M as you mentioned in the ticket you raised here?

@anathan84 you can either stop all containers from the dashboard or temporarily move your device to another empty application, one that has no releases pushed.

@acostach all of our devices are MX8M-Mini, that was a typo above.

Trial 3 from my note last week was as you suggested: I stopped all of our containers in the dashboard and ran both cat and the dummy CPU load from the host OS. We have not tried an empty application, but can do that too.

Since I already tested with no containers running though, we were wondering if it might be caused by an interaction with the supervisor or some other Balena service that we can’t stop from the dashboard. That’s what @anathan84 was curious about - how to stop the supervisor and anything else that might be running in the background.

Thanks for confirming @adamshapiro0. You can stop any service with systemctl but I don’t think stopping the supervisor will have any impact.

I’ve started running tests on our side with the MX8M Mini Devkit and the 2.71.3 image. So far it’s been running for almost 4 hours and transferred about 1.1G. Hasn’t failed yet but will let it run until it fails.

Ok, that sounds good. I’m running my device again the same way as before with all containers shut down to replicate it again. After that I’ll try moving it to an empty application.

Replicated after ~1.5 hours (not sure when it stopped exactly), ~56 MB collected. Will try an empty application next.

Edit: ~30 minutes assuming 56 MB @ ~33 kB/s.

Confirmed: empty application stopped updating after ~5 minutes (16 MB @ ~33kB/s).

FYI I ran it again with the empty application and it died after ~an hour (~124 MB).