UART Failure after upgrading to 2.71.3+rev1 from 2.50.1+rev1 on Variscite IMX8M

@adamshapiro0 I’ve been running the test for almost a day on the Devkit, it has transferred about 7GB and it hasn’t failed yet. I’m running on an empty App, with one instance of the cpu load script:

root@00d8c8f:~# cat /etc/os-release 
ID="balena-os"
NAME="balenaOS"
VERSION="2.71.3+rev1"
VERSION_ID="2.71.3+rev1"
PRETTY_NAME="balenaOS 2.71.3+rev1"
MACHINE="imx8mm-var-dart"
VARIANT="Development"
VARIANT_ID="dev"
META_BALENA_VERSION="2.71.3"
RESIN_BOARD_REV="9925f5d"
META_RESIN_REV="15117bd8"
SLUG="imx8mm-var-dart"

root@00d8c8f:~# date && uptime && ls -l /mnt/data/ttydata.bin 
Tue Mar 30 07:15:02 UTC 2021
 07:15:02  up  22:28,  1 user,  load average: 1.23, 1.23, 1.19
-rw-r--r-- 1 root root 7234537880 Mar 30 07:15 /mnt/data/ttydata.bin

root@00d8c8f:~# date && uptime && ls -l /mnt/data/ttydata.bin 
Tue Mar 30 07:15:16 UTC 2021
 07:15:16  up  22:28,  1 user,  load average: 1.19, 1.22, 1.19
-rw-r--r-- 1 root root 7235731688 Mar 30 07:15 /mnt/data/ttydata.bin

root@00d8c8f:~# date && uptime && ls -l /mnt/data/ttydata.bin 
Tue Mar 30 07:21:34 UTC 2021
 07:21:34  up  22:34,  1 user,  load average: 1.10, 1.14, 1.16
-rw-r--r-- 1 root root 7270151250 Mar 30 07:21 /mnt/data/ttydata.bin

root@00d8c8f:~# stty -F /dev/ttymxc2 
speed 1500000 baud; line = 0;
min = 1; time = 0;
-brkint -icrnl -imaxbel
-opost
-isig -icanon -iexten -echo

The SOM I’m using has the code 1933 and the carrier board is VAR-DT8MCustomboard V1.4. Are you using the same setup?

I’ve been doing my testing on our custom device, which is based on their carrier board and uses the DART-MX8M-Mini SOM. I don’t currently have a carrier board I can test on but that is what @anathan84 was using to test Variscite’s OS image above.

Once he’s around in a bit we can image that board with the Balena image and empty application and run the same test that I ran on our device.

@acostach how are you generating your incoming UART data and what is the generated data rate? Is your device connected to wifi? Mine is connected to both Ethernet and wifi currently. The DART bt+wifi module uses UART4, so one possibility is that it’s somehow messing with the other UARTs or something when it does transactions or wifi scans or something. I can try testing mine with the wifi disconnected.

@adamshapiro0 my devkit is connected trough wifi solely. I’m sending the same generated string in loop, from my PC:

char *sendstr = "H12130192839128398-38394809328490304948-31-1923-01293-019230192309-01293
-01930912-30919-0322093-01237392837283719283792137217398219837982179387398712937237982173
98217938712983721983712983721983723792873283798237283789217398123732198729183791283791283
71928371928377772831237287398217392173921739823791283721837283798217398278934703243248327
49832749328472938473298473298473287943874371490714093874032174317493208741487087314392087
47431873204873215672497564159723748327493802749382749321486713298546719324739208473984709
38474879320803473487014879307849301744879321487348787348793084793084731908474873920847148
74873190487314837194871390813749837141479362143786479321479216347632149786314197463214673
48937489037148956107890163290498748512459862137467631274693864357651093789267187765789645
767yuhghbvbhhvjoyugyixrdhuhiuy3274y312n73y7wefy7wq0e8fyef8qwehufrhv38u21gf8hrvhfdgaudoq3u
fh49371fhkjvchsaiyvg32o8yfgdcghsdjvgdiqvqhwovireygvpiqgvviwoeygviourvgewfivgerivohijchsdu
ifwhvuihuewvuihewriurhviewurhvurewivhup2g301fg48yfg13fg31yrviurvgiro28yvrgfiwvgofevqhiceh
quvfegvhwfoiohvco2igvyu8regvyiergvruihfjvhuisw0ruivriuvhsdvyuigsuiherruivhrqevuivuhwpeqru
hvpiwrehvewprhvrephvrp2iuhvgp23riuhvruihverueoivh2rpvhwpfivuhewpuihv2rpuivhruevipurhevupi
hwevuf31ygveuyrvhpeuhvu2egv8yprihg2rpuvhgrvp92hvpihuvhfevhv2pehrvrvurhgpeviur2hevp9r1hu2v
2chrhvuriepurgfh2u[wnecpwqjvr1oh[fouh[ovhio[eqvihpwhvufihvwpufiwhvhspdcqhdudwqhpwqohdpofd
hqfqphdjfhpqwoeuhfqwoihpefewqoifwpofhqvriuvhjdohvpexlo(\n";
int len = strlen(sendstr);

while (1) {
    sent = write(fd, sendstr, len);
    /* check write failed, etc omitted */
    tcdrain(fd);
    usleep(100);
}

It’s still running:

root@00d8c8f:~# date && uptime && ls -l /mnt/data/ttydata.bin 
Tue Mar 30 13:11:32 UTC 2021
 13:11:32  up 1 day  4:24,  1 user,  load average: 1.47, 1.21, 1.12
-rw-r--r-- 1 root root 9179786570 Mar 30 13:11 /mnt/data/ttydata.bin
root@00d8c8f:~# date && uptime && ls -l /mnt/data/ttydata.bin 
Tue Mar 30 13:11:33 UTC 2021
 13:11:33  up 1 day  4:24,  1 user,  load average: 1.47, 1.21, 1.12
-rw-r--r-- 1 root root 9179893369 Mar 30 13:11 /mnt/data/ttydata.bin

Ok interesting. I wouldn’t think it would matter, but just for reference we’re sending arbitrary binary data so it includes non-ASCII characters. Our MCU is outputting data at roughly 33 kB/s.

Just a quick update: did another test on my device with the empty application, this time with wifi disconnected, and it died again after 15 minutes. Still need to get the dev board setup with the Balena image to test that way.

@adamshapiro0, @acostach, what is the status of this investigation? I understand that there was a suggestion that @acostach could repeat his tests using binary (non-ASCII) data at a rate around 33kB/s, while @adamshapiro0 mentioned getting a dev board setup with the balena image. Did you a get a chance to make some progress?

I ask because this issue now appears to be a blocker to another (unrelated) issue (a paid support thread) to do with an overlayfs bug in certain combinations of balenaEngine and kernel versions, which is expected to be fixed in newer balenaOS versions, but this UART issue prevents upgrading balenaOS to fix that other issue.

Thanks!

Hi @pdcastro,

On our end, we still need to do the dev board + Balena image test. I have a new dev board coming but it is currently delayed in shipping. I’m hoping to have it available to test with this weekend.

This ticket is actually a blocker for a few of our issues, the overlayfs issue, as well as the Chrony issue, so we would definitely like to resolve it as soon as possible.

Hi Adam, I’ve ran several tests ranging from 10 to 24 hours each, continuously reading data from /dev/urandom on my PC and then sending it at various rates: 35, 80 and 300kB/s.

Still hasn’t failed on the devkit. I’ve also kept 1 core load as you do, no application pushed and read data with cat:

root@bebe144:~# cat script.sh

#!/bin/bash
stty -F /dev/ttymxc2 1500000 cs8 -brkint -icrnl -imaxbel -opost -isig -icanon -iexten -echo
cat /dev/ttymxc2 > /mnt/data/randomdata.bin &

Sending at 35kB/s:

root@bebe144:~# uptime && ls -l /mnt/data/randomdata.bin
 09:35:21  up 1 day  3:46,  1 user,  load average: 2.34, 2.40, 2.36
-rw-r--r-- 1 root root 2903136019 Apr 20 09:35 /mnt/data/randomdata.bin
root@bebe144:~# uptime && ls -l /mnt/data/randomdata.bin
 09:35:24  up 1 day  3:46,  1 user,  load average: 2.31, 2.39, 2.35
-rw-r--r-- 1 root root 2903209163 Apr 20 09:35 /mnt/data/randomdata.bin
root@bebe144:~# uptime && ls -l /mnt/data/randomdata.bin
 09:35:33  up 1 day  3:47,  1 user,  load average: 2.21, 2.37, 2.35
-rw-r--r-- 1 root root 2903458068 Apr 20 09:35 /mnt/data/randomdata.bin

Apart from the devkit, is it reproducible on your side with random data instead of what the microcontroller is sending?

Hi @acostach,

I finally got my new dev board but I am having trouble imaging it for some reason - it got through provisioning on the dashboard, but then after I booted it from internal flash and got to the “going to reboot” step it just never rebooted, and power cycling just seems to keep coming to the same “going to reboot” message. Need to figure out what’s going on there so I can test with it.

That aside, unfortunately the MCU UART is hard wired into the IMX8 so we can’t replace its data stream with an external one using the same IMX8 UART. We do have external hookups for other IMX8 UARTs so we can give those a shot.

My current hypothesis is that it could be a floating flow control pin or something like that on the board, though we have not found anything yet (and our design is based directly on the Variscite reference design). Even though flow control is disabled, it’s possible there is an issue in the kernel driver causing weird behavior. If that’s the case, this might be specific to the MCU UART so we might not be able to replicate it with the alternate UARTs. We’ll try them though.

Hi Adam,

It’s not clear why the board would need to reboot though. On the Variscite iMX8M Mini devkit we have, provisioning works as indicated in the dashboard:

  1. Boot switch is in external position and flasher sd-card inserted
  2. Power up the board, wait for image to get flashed. When flashing is completed the device should power off and notify post-provisioning state in the dashboard. It shouldn’t reboot. The power switch can also be toggled to off once flashing is completed.
  3. Change boot switch back to internal, remove sd-card, power on the board.

The only reboot I can think of would be in the case where a hostOS update was performed.

I figured out what was happening with the reboots: I had left the SD card in the device when I switched to external boot. The switch tells it to load uboot from emmc instead of SD, but apparently uboot itself actually ignores the switch and loads the flasher image from the SD card if it finds it no matter what. So basically, even though the board was configured for internal emmc boot, it was still redoing the flasher step over and over instead of booting the actual host OS image.

Yes, u-boot looks if there’s an sd-card that contains a flasher image, and if it exists, boots it. Only the firmware which loads u-boot checks the boot pins to load u-boot from the desired storage.

Hi @acostach and @pdcastro -

I finally got a set up where we are using the Varscite IMX8M dev board and just hooking up the UART to our device.

The setup is as pictured:

The jumper wires connect UART3 (ttymxc2) directly to the dev board from our MCU outputting data at 1.5mbps.

I tried two setups:

  • Running our software on the IMX8M mini with the 2.71.3+rev1 host image. Result: Data stoppage after about 1 hour.

  • Running nothing on the IMX8M except the HostOS and the Balena Supervisor. In the host, I ran:
    In console 1: cat /dev/ttymxc2 > /var/lib/docker/volumes/1623312_resin-data/_data/garbage
    In console 2: while : ; do : ; done
    In console 3: watch -n1 ls -l /var/lib/docker/volumes/1623312_resin-data/_data/

Result: Data stoppage after ~5 hours. Total data received is 429MB.

As @adamshapiro0 mentioned, this is a total show stopper for us. Our application has a critical data link between the MCU and the IMX8 over this UART and in previous releases, this link is stable for literally days at a time.

Please advise on what we should try next.

(also I checked:
dmesg shows no events over this time except an occasional wlan0: link is not ready. The device is on the hardline so I suppose this expected.
)

Hi Aaron,

Is the issue reproducible with the latest Yocto Dunfell official release v6.7 that has kernel 5.4.85 from Variscite’s website - dart-mx8mm-recovery-sd.v67.img.gz? I recommend flashing that image to the devkit eMMC so that it resembles the original setup. If it isn’t reproducible anymore with that image, we can look into updating to that yocto release.

Hey @acostach -

Quick update on our end. We haven’t yet been able to test with the latest Yocto, but I did have a follow up question for you. In your test, were you using a 5V or 3.3V TTL adapter? Our system is 3.3V @ 1.5mbps and I wanted to check if thats the same as you.

We’ll be working in parallel on getting the Dunfell release tested this week.

Thanks,

Aaron

@anathan84 I used a 3.3V adapter.