regarding the LED driver max frequenscy and error code, I will perform some offline investigations and get back to you as soon as I have more info
Sounds good.
One more LED note. Anytime I set the color / brightness of the LED, I am actually writing all 3 files sequentially:
FileOutputStream(BRIGHTNESS_PATH_RED).use { it.writeBrightness(red) }
FileOutputStream(BRIGHTNESS_PATH_GREEN).use { it.writeBrightness(green) }
FileOutputStream(BRIGHTNESS_PATH_BLUE).use { it.writeBrightness(blue) }
The BRIGHTNESS_PATHs are /sys/class/leds/pca963x:{blue, green, red}/brightness
So perhaps writing the three files sequentially, back-to-back is part of the problem?
Looks like the gateway you were monitoring just reset. And the “Florin” device seems to have reset also. It shows as online for only 2 hours.
Hi, the device did not reset. The uptime shows it’s been running for over 5 days. The online time means the connectivity to the vpn was interrupted. So the running time and online time are different things
Okay, thanks for letting me know.
My local gateway reset again just now. Shortly before the reset, the temperature measured 47.8 C, so this doesn’t appear to be caused by temperature. I’m going to start commenting out code and see if I can narrow down the cause.
Also, I think I found why the gateway log showed some occasional LED failures. My application has a tick() function that sets the LED every 31 milliseconds. I discovered a bug today where the LED files were being written every 31 milliseconds - always. I changed that to only write the files if the LED color was changing, which significantly reduces the number of writes. Since making that change, I have not seen another LED failure in the log. However, I made this change earlier today and this latest gateway reset at my desk happened with that fix in place, so this confirms the LED is probably not related to the reset.
Hi there, as promised here is some more information about the LED error you experienced: error -5 is defined as EIO 5 /* I/O error */ so input output error. it is presented as a negative value because the kernel code negates it since it uses negative numbers for failure. As suspected this is likely triggered by a too high frequency on the provided LED animation, glad to read you have sorted that out already!
Thank you!
Hello,
I’ve noticed your balena fin running our application seems to be resetting regularly as well. I’m not sure if it’s the same reset condition that I am running into or not, but it has now been online for about 21 hours. I have checked this unit occasionally, and it has not been online for more than 3 days, to my knowledge.
Are you still looking at this unit? Do you know why it is resetting?
hey @dstewart how regularly is it resetting and when you say resetting is that a container/application restart or a full reboot. What source are you using to track the online hours for? is that from the dashboard or uptime?
Hi Shaun,
Our units crash / reset and a bunch of bad stuff is printed out the debug console. For example:
MMC: mmc@7e202000: 0, mmcnr@7e300000: 1
Loading Environment from FAT… WARNING at drivers/mmc/bcm2835_sdhost.c:408/bcm2835_send_command()!
WARNING at drivers/mmc/bcm2835_sdhost.c:408/bcm2835_send_command()!
*** Warning - bad CRC, using default environment
In: serial
Out: serial
Err: serial
Net: No ethernet found.
WARNING at drivers/mmc/bcm2835_sdhost.c:408/bcm2835_send_command()!
WARNING at drivers/mmc/bcm2835_sdhost.c:408/bcm2835_send_command()!
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1…
Found U-Boot script /boot.scr
437 bytes read in 2 ms (212.9 KiB/s)
Executing script at 02400000
Scanning mmc usb devices 0 1 2
24 bytes read in 1 ms (23.4 KiB/s)
Found resin image on mmc 0
Loading resinOS_uEnv.txt from mmc device 0 partition 1
** Unable to read file resinOS_uEnv.txt **
Loading bootcount.env from mmc device 0 partition 1
** Unable to read file bootcount.env **
No bootcount.env file. Setting bootcount=0 in environment
9439392 bytes read in 410 ms (22 MiB/s)
Kernel image @ 0x080000 [ 0x000000 - 0x9008a0 ]
Flattened Device Tree blob at 2eff9300
Booting using the fdt blob at 0x2eff9300
reserving fdt memory region: addr=0 size=1000
Using Device Tree in place at 2eff9300, end 2f002f3b
Starting kernel …
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Linux version 4.19.75 (oe-user@oe-host) (gcc version 8.3.0 (GCC)) #1 SMP Thu Jun 4 14:34:24 UTC 2020
[ 0.000000] CPU: ARMv7 Processor [410fd034] revision 4 (ARMv7), cr=10c5383d
…
–
I helped Balena support program a Balena fin with our application several days ago. That unit is called “Florin”. I don’t know what the Florin unit is doing, other than what I can see through the Balena console. That’s why I’m wondering if you are looking at the serial output or anything else to figure out if your unit is crashing like all of our units are.
I have watched the Florin unit occasionally, but have never seen the online time in the console get larger than 3 days. This morning, the Florin unit was listed as online for 1 day. I then connected to the Host OS console and ran the “uptime” command and it returned an uptime of 21:47 - about 1 day. So if no one there has physically reset the unit in the last 24 hours, then perhaps it is resetting like ours are in the field.
Okay, thanks for the info, if you are working on debugging it with florin it might be best to keep that all in one thread so we don’t break the context and florin is probably one of the best people to debug that with you. I will check in with him and see what has been happening on the device he provisioned into your app and see if he has updates to share with you on any findings.
Thank you!
Hi. On my unit I could not reproduce the reboots. I had a debug cable connected and did not witness the reboots you experience.
Could we maybe arrange that you send over a complete hw setup so I can have it here locally to try and further debug it?
One other thing, what type of compute module are you using?
Has this problem been solved?
I’m experiencing the same symptom on a Balena Fin 1.1 – a unit in the field is rebooting every 10 to 12 minutes. An identical unit here in my office is not exhibiting the issue.
I see the same failed LED message in dmesg ( although on this on it’s the red channel that’s failing). The FIN reboots about 10 seconds after.
Has any progress been made in finding a root cause?
Thanks!
I think the reset problem has been significantly reduced. In our case, we updated our database drivers from sqldelight to exposed, and that seemed to make a big difference.
Unfortunately, our application is likely quite different from yours, but I can share the process we went through to reach a solution and hopefully some of this will be helpful.
We first tried commenting out most of our application code until the device stopped resetting to establish a baseline. We then added sections back until we saw the device reset again. This helped determine the problem was due to something in our application, not the OS. This took a long time, but ultimately helped us pinpoint that the reset occurred during database transactions. We then did some general study on the sqldelight database library, and tried a couple other things (wrapping the database transactions in a mutex, updating the library, etc). Those steps didn’t fix the reset, so we finally switched to the exposed library, and now our gateways remain online for much longer.
Regarding the LED, I found that we were writing the LED files way too often, so we made a change to our code to only write the file if the value changes (which is rare for us), and that eliminated almost all of the failed LED messages.
I hope that helps! Where this issue is already super long, feel free to create a new issue and link to it.
Thanks for the insight. Our situation is that it’s a device that has run flawlessly, with no software changes, for months but now exhibits the issue. I’ll see if changing the software helps.
Was it a watchdog event that resulted in the reboot? Kernel Panic? Something else? It’s unusual for a software problem to cause a full device reset without a watchdog, and I’m looking for a place to start the search for a cause.
Thanks again.
The very first post has a log that shows the reset we were experiencing, which was captured using the serial UART pins on the HAT header of the Balena Fin (pin 6=GND, pin 8=TXD, pin 10=RXD). Our devices would pretty consistently reboot within one to two days.
Do you have just one device or multiple devices that are rebooting?
We also looked closely at the CPU temperature and discovered it sometimes got quite hot. In our case, we ultimately found that we could power down our bluetooth radio when not in use, and that helped reduce the CPU temperature, which may or may not have helped with the reset issue. To measure the CPU temperature:
/opt/vc/bin/vcgencmd measure_temp