Balena Fin Reset

Hello,

We have an application running on a Balena fin and have deployed this to several balena fin gateways in the field. All gateways are running OS 2.46.1+rev3 or 2.51.1+rev1.

The gateways are always powered, but all of the gateways eventually reset - typically within a few hours to a few days at most. If we look at the fleet in Balena cloud, all online gateways have only been online anywhere from a few minutes to 1 day.

Our application is very simple. It periodically performs a scan to discover Bluetooth devices, pulls data from the devices (if needed), and sends the data to the cloud.

I was able to capture the log output from my desk gateway when a reset occurred using the serial UART pins on the 40-pin HAT header. Unfortunately, there is very little useful information in the log. Here is a portion of the output when the reset occurred.

[16117.615955] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[16432.622568] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[16747.630134] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[17044.981471] NET: Registered protocol family 38
[17045.014193] cryptd: max_cpu_qlen set to 1000
[17062.649126] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[17377.652784] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[17623.855779] i2c i2c-3: sendbytes: NAK bailout.
[17623.860313] leds pca963x:green: Setting an LED's brightness failed (-5)
[17691.348509] i2c i2c-3: sendbytes: NAK bailout.
[17691.353001] leds pca963x:blue: Setting an LED's brightness failed (-5)
[17692.695421] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[18008.952954] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
MMC:   mmc@7e202000: 0, mmcnr@7e300000: 1
Loading Environment from FAT... WARNING at drivers/mmc/bcm2835_sdhost.c:408/bcm2835_send_command()!
WARNING at drivers/mmc/bcm2835_sdhost.c:408/bcm2835_send_command()!
*** Warning - bad CRC, using default environment

In:    serial
Out:   serial
Err:   serial
Net:   No ethernet found.
WARNING at drivers/mmc/bcm2835_sdhost.c:408/bcm2835_send_command()!
WARNING at drivers/mmc/bcm2835_sdhost.c:408/bcm2835_send_command()!
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1...
Found U-Boot script /boot.scr
437 bytes read in 1 ms (426.8 KiB/s)
## Executing script at 02400000
Scanning mmc usb devices 0 1 2
24 bytes read in 2 ms (11.7 KiB/s)
Found resin image on mmc 0
Loading resinOS_uEnv.txt from mmc device 0 partition 1
** Unable to read file resinOS_uEnv.txt **
Loading bootcount.env from mmc device 0 partition 1
** Unable to read file bootcount.env **
No bootcount.env file. Setting bootcount=0 in environment
9439392 bytes read in 410 ms (22 MiB/s)
Kernel image @ 0x080000 [ 0x000000 - 0x9008a0 ]
## Flattened Device Tree blob at 2eff9300
   Booting using the fdt blob at 0x2eff9300
   reserving fdt memory region: addr=0 size=1000
   Using Device Tree in place at 2eff9300, end 2f002f3b

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.19.75 (oe-user@oe-host) (gcc version 8.3.0 (GCC)) #1 SMP Thu Jun 4 14:34:24 UTC 2020
[    0.000000] CPU: ARMv7 Processor [410fd034] revision 4 (ARMv7), cr=10c5383d
[    0.000000] CPU: div instructions available: patching division code
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[    0.000000] OF: fdt: Machine model: Raspberry Pi Compute Module 3 Plus Rev 1.0
[    0.000000] Memory policy: Data cache writealloc
[    0.000000] cma: Reserved 8 MiB at 0x3dc00000
[    0.000000] random: get_random_bytes called from start_kernel+0xb0/0x4b8 with crng_init=0
[    0.000000] percpu: Embedded 17 pages/cpu s39564 r8192 d21876 u69632
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 253242
[    0.000000] Kernel command line: coherent_pool=1M bcm2708_fb.fbwidth=656 bcm2708_fb.fbheight=416 bcm2708_fb.fbdepth=16 bcm2708_fb.fbswap=1 smsc95xx.macaddr=B8:27:EB:2F:F0:D2 vc_mem.mem_base=0x3f000000 vc_mem.mem_size=0x3f600000  dwc_otg.lpm_enable=0 console=tty1 console=ttyAMA0,115200 rootfstype=ext4 rootwait root=UUID=ba1eadef-e193-4d7e-a9c0-2ceb396ff21e rootwait
[    0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[    0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[    0.000000] Memory: 983640K/1021952K available (8192K kernel code, 655K rwdata, 2372K rodata, 7168K init, 828K bss, 30120K reserved, 8192K cma-reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xffc00000 - 0xfff00000   (3072 kB)
[    0.000000]     vmalloc : 0xbe800000 - 0xff800000   (1040 MB)
[    0.000000]     lowmem  : 0x80000000 - 0xbe600000   ( 998 MB)
[    0.000000]     modules : 0x7f000000 - 0x80000000   (  16 MB)
[    0.000000]       .text : 0x(ptrval) - 0x(ptrval)   (9184 kB)
[    0.000000]       .init : 0x(ptrval) - 0x(ptrval)   (7168 kB)
[    0.000000]       .data : 0x(ptrval) - 0x(ptrval)   ( 656 kB)
[    0.000000]        .bss : 0x(ptrval) - 0x(ptrval)   ( 829 kB)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] ftrace: allocating 28894 entries in 85 pages
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[    0.000000] arch_timer: cp15 timer(s) running at 19.20MHz (phys).
[    0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x46d987e47, max_idle_ns: 440795202767 ns
[    0.000006] sched_clock: 56 bits at 19MHz, resolution 52ns, wraps every 4398046511078ns
[    0.000022] Switching to timer-based delay loop, resolution 52ns
[    0.000271] Console: colour dummy device 80x30
[    0.000917] console [tty1] enabled

[ ...and a lot more startup messages after this... ]

Since this reset happens pretty randomly, do you have any suggestions on steps I could take to figure out what’s going on? I have run “df” to check disk usage, but we seem to have plenty of space.

Probably the best thing you can do is to run the diagnostics (from the diagnostics tab on the dashboard) as soon as you notice a device not being online. In the past this would be caused by VPN disconnections, but we improved the online status detection a few months back to avoid this. Do you know for a fact that the device is rebooting?

If you have the diagnostics file, just add it to this thread, and it will have a bunch of information that may be relevant to finding out what is going on with this devices.

You asked if I’m sure the device is rebooting. I’m not sure what is going on, but from the log I posted in my original message, it seems the system is completely resetting. Am I missing something?

Here is a diagnostic log file from a gateway in the field that just reset. Since this is a unit in the field, it’s possible (but unlikely) that the end user reset the gateway. If I can catch my local gateway resetting again, I’ll run diagnostics on it immediately after and upload the log.

9e8e67c48547d8992230521659c0308c_2020.08.14_15.14.25+0000.log (775.1 KB)

Also, if nothing else, I can try commenting out sections of our code until the reset stops occurring. The problem is it can take anywhere from several hours to a day or two before a reset occurs.

Sorry, I misread where the logs came from (serial) and thought that it could have been persistent logging + journald. I’ve scanned through the diagnostics but the only thing I see that’s out of place is a balene-engine crash. This should never cause a reboot, so either there’s a serious bug in the engine that your app is somehow exposing, or there’s another deeper cause which causes the balena-engine crash and the reboot. Unfortunately the logs didn’t catch the reason for the crash, only the tail end of the stack trace, so if you could get the diagnostics again we may be able to see the cause. In the meantime, I’ll ask our engine maintainer to take a look, but he’s out until Monday.

Thank you - hopefully I’ll be able to catch another crash and run diagnostics on my local gateway by Monday.

My gateway just reset again and I was able to capture the serial log, and run the device diagnostics on it from Balena Cloud within 1-2 minutes after the gateway reset. Both are attached.

The serial log also shows the output from “df” shortly before the reset (about 45 minutes before the reset), and the output from “df” after the reset finished.

I also ran the Device Health Checks and all tests succeeded - no errors or warnings.

Thanks for your help!

cb894b2e8e19589cde7ae7762b284bce_2020.08.14_17.52.55+0000.log (888.8 KB)
Gateway_Reset_Serial_Log.log (31.2 KB)

Hi there – thanks for the additional info. I’ll pass this on to our engineers, and we hope to have more info for you shortly.

All the best,
Hugh

Hi there – would you be able to grant access to one of the devices that is experiencing this?

Additionally, I’m not sure which version of the balenaFin you’re using, but I wanted to make sure you’re aware of the recall we’ve issued for version 1.1; you can read more about that here.

All the best,
Hugh

Hi Hugh,

Yes, I just granted access to one of our devices. Let me know if you need any information to access it.

Thanks for the information on the recall. I am running on a 1.1.0 board, but don’t think this is related to the recall issue.

Damon

Hello Damon,

Thanks for granting device access. We will need the full device UUID to gain access. If you prefer not to post it in this forum, I can send you an e-mail which you can reply to.

Alan

Ok, please send an email to “dstewart@parsyl.com” and I’ll send you the UUID.

Thanks, we have the UUID now. However, it looks like the device is currently in local mode which will need to be turned off for us to gain access. Let us know if it is OK to do that.

My bad - I just took it out of local mode.

Are you able to access the gateway?

Hi, yes. this looks pretty strange. Just to note, are you seeing this reboot problem on different units or just on one? Want to make sure it’s not a hw issue on a particular board since we have a lot of Fins in production and this is the first time I encounter this problem.

I mean, did you also check that the device which showed being restarted actually were restarted rather than being reconnected to the vpn?
If uptime reports the same time as the online time in the dashboard, then indeed, the units were restarted

Yes, this is a problem for all of our balena fins. The balena dashboard shows ALL of our online gateways at various facilities have only been online for up to a few (5) hours or less. Yes, I am quite confident the gateway actually reset. Look at the serial log file I attached earlier for details (Gateway_Reset_Serial_Log.log).

Ok. Can you describe your hw setup you are using?

Sure. We are using the Balena Fin v1.1.0. Our devices have a Raspberry Pi Compute module 3+, and most have a Quectel cell modem (EC25-A). The solution supports Ethernet, Wi-Fi, and cellular connectivity.

Do you have anything connected to the board? Over serial, spi or some other port?