Regular Crashes


#1

Hi,

I’m experiencing regular crashes of my Balena devices. I’m using the UP Squared board with the UP Board image in combination with BalenaCloud. I have 4 containers running, 2 of them are just NodeJS containers, 1 is a NGINX container and 1 an electronjs container.

But sometimes after hours and maybe a day, the whole OS is unresponsive, it appears offline at balenaCloud and I can’t SSH into it. Sometimes even the electronjs isn’t visible any more but just a grey screen with nothing on it, like it’s in sleep mode.

The host OS version is: balenaOS 2.26.0+rev3 (Development image)
The supervisor version: 8.0.0

I have persistent logging on true, because of these problems. This is the journalctl of the last time it crashed (Blank screen like sleep mode and unresponsive OS):

journalctl.log (1.1 MB)

If these problems keep occurring, it’s impossible for me to use it in production. That would be very disappointing for us, because we love the product…

Thanks in advance!


#4

I’ve disabled the ElectronJS container, because that is the container with the highest load. The 2 NodeJS containers don’t do anything really, only start an ExpressJS server with 1 page with just some text, for testing purposes. And the NGINX is a simple proxy, not really rocket science for a computer/UP Squared.

But even with the ElectronJS container disabled, the OS still crashes and I have to restart the whole PC. The CPU isn’t really hot. And the screen flickers, even without the ElectronJS docker not running. The OS now crashes regularly, like in an hour or 2.

I hope someone from the balena staff can look into it, because we can’t use the OS this way.


#5

Hi there @vedicium!

There is an interesting line at the very end of your journal that may be of note:

Nov 25 07:32:09 ee7bc41 kernel: [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
Nov 25 07:32:09 ee7bc41 kernel[615]: [   51.047618] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun

These messages seem to indicate a kernel bug of some sort, possibly the one referenced here. Since your device is running kernel 4.9.x, I think it would be best to first test the workaround listed in that bug report. Would you mind adding the kernel command line option i915.enable_rc6=0 as a test and let us know the results?

Thank you very much!


#9

Hi @xginn8,

Thanks for your response!
I’m going to try that! Is there an explanation/quick link on how to change the kernel in BalenaOS?

At the moment I’m testing an older BalenaOS version, 2.17 out the top of my head, and that doesn’t have the flicker and seems to be running okay. But obviously I prefer to always use the newest version.

UPDATE
I’ve added the i915.enable_rc6=0 line when booting. So I’ve pressed “e” when in the grub menu and added the line after the linux “command”. After checking /sys/module/i915/parameters/enable_rc6, it’s off.

I spoke too soon about the earlier version, because when I checked it this morning, the UP Squared crashed and there was a huge flicker in the screen. After checking, the kernel used is also 4.9.x, so that could be the explanation.

In my dmesg, I also found some interesting errors:
[ 5.067887] i915 0000:00:02.0: Direct firmware load for i915/bxt_dmc_ver1_07.bin failed with error -2 [ 5.067911] i915 0000:00:02.0: Failed to load DMC firmware [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling runtime power management.


#11

Hi @xginn8,

I don’t see the flicker anymore in my screen, so that’s good news!

The bad news is, after about an hour or 2, the whole OS becomes unresponsive. I can’t SSH into the system, I can’t click anywhere on the ElectronJS docker. It’s like the whole OS freezes.

I’ve added the journalctl.log. Persistent logging was set to true, so there is a log of the last boot and the current boot. I’ve looked at the log, but the last logs before the reboot were just some balenad warnings.

journalctl.log (1.1 MB)

I’m making an issue at the balena-os/resin-up-board regarding the OS crash referring to this topic. (Issue)

Extra information:
Board: UP Squared Intel Atom E3940
Connection: Ethernet
Display Connection: DisplayPort
BalenaOS Version: balenaOS 2.26.0+rev3
BalenaOS Supervisor: 8.0.0
Running on balenaCloud


#13

Hi again @vedicium,

First off, thank you very much for testing that fix. We appreciate all the information as well. To be clear, you have to physically power cycle the device in order to recover from that state? In the most recent journal, I see a reboot some time between 02:35:44 and 07:31:22 on Nov 25, presumably you manually rebooted the device around that second timestamp? We have safeguards in place to prevent device hangs like this, so I want to make sure we understand the sequence of events here.

Nov 25 02:35:44 
-- Reboot --
Nov 25 07:31:22

Thank you again!


#20

Hi @xginn8,

I’m putting as much effort as needed in this, because we really want to use balenaOS for our current project and (all of) our future projects. In the current project we use the UP Squared for this, and in future projects we’ll use other boards like the RPI. And I hope we can help other people that use the UP Squared to start with balenaOS, because the more the merrier.

Back to your question, I had to power-cycle the device in order to get it to work again. Like I said, an ElectronJS container is running, so the display is used by that. SSH isn’t working anymore at that time. So it’s impossible for me to log into the system and retrieve logs or do anything. In order to get it working again, I have to really pull the plug. And yes, I’ve rebooted the system around 7:30, so that timestamp is correct!

In my last try to fix this issue, I’ve googled everything, like kernel issues, processor issues etc. I’ve found another kernel command: intel_idle.max_cstate=1. I can’t confirm for 100% that this fixes the freezing/crashing issue, but the system is running for about 21 hours now and it still works. It’s still connected to balenaCloud and the ElectronJS container still works. But like I said, I can’t confirm that this fixes the issue. 1 day of testing in an idle state isn’t really testing. I’m also developing on this machine, so sometimes containers are restarting. I’ll try to make an exact copy of this setup and just keep it running and connected via ethernet to see if it’s stable.

But it would be best for me and for other people if you guys can confirm this fixes the issue or if an update is needed in balenaOS. We’re eager to use, help, develop and improve balenaOS and all other balena products!

EDIT
The fix with intel_idle.max_cstate=1 did not work. After about 23 hours, it crashed…

I’ve added processor.max_cstate=1, because the processor (Intel Atom E3940) probably doesn’t use the intel_idle driver. I hope this helps.

I’ve added the journalctl again.

journal.28-11.log (1.0 MB)


#28

Hi. We had seen in the past situations where cstate transitions would halt the system as you describe.
Thanks for testing this in your use case. While you have set this new cstate param, can you please check if you actually have the intel_idle module loaded by checking cat /sys/module/intel_idle/parameters/max_cstate please?


#29

Hi @floion,

As I said, I’m using the Intel Atom E3940. In my dmesg, after setting intel_idle.max_cstate, I got the response that intel_idle isn’t used for my processor family and model. So after much googling, I know added processor.max_cstate=1. That seems, and I say this carefully, to work!

I didn’t respond in this topic, because I thought intel_idle.max_cstate=1 worked, because the system was stable for 24 hours (which was a record), and then crashed. So I wanted to test some more. Now after exactly 1 day and 21:57 hours, the system is still stable and hasn’t crashed/freezed/halted. So I think it works!

Now my grub.cfg looks like this:

# Automatically created by OE

serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1

default=boot

timeout=3

menuentry 'boot'{

linux /vmlinuz root=LABEL=resin-rootA rootwait i915.enable_rc6=0 intel_idle.max_cstate=1 processor.max_cstate=1

}

For some detailed information:

  • i915.enable_rc6=0
    – Fixed the problem of the screen glitching/flickering

  • intel_idle.max_cstate=1
    – Maybe helps the Intel Pentium processors of the UP Squared board to stop halting (I didn’t test it)

  • processor.max_cstate=1
    – Helps the Intel Atom processor of the UP Squared board to stop halting (After 2 days of idle)

So maybe it’s something to add to the default UP Board images? For now, I’ve edited the image. I’ve mounted the image, changed grub.cfg_internal and the grub.cfg in the flash EFI, just in case.

But like I said, I’ve tested it for 2 days and it’s still working and stable. But I can’t confirm that it’s a definite fix and stays stable. I’ve ordered another UP Squared with the same processor to do a test, and just install BalenaOS and let it run until it crashes (which I hope it doesn’t).

P.S.
Is it an idea to create more forum categories, like “General” for feature/forum suggestions etc and “development” for people that want help with their software, specific for Balena content (like how to make the best Docker container/setup for Balena, or when something doesn’t work how a developer thinks it’s supposed to work. The community and the Balena team could help with that kind of problems)


#32

Hi. Then let’s wait for a few days more and check if that fixes the issue on your side. In the meanwhile, will check the SoC on the board I have here also.
About i915.enable_rc6=0, I think the next BSP update will have a newer kernel so this change will not be necessary in the short term. The next BSP update is scheduled until Christmas and then we’ll incorporate it and have the new kernel also.


#33

Hi @floion,

That would be awesome. I’ll keep you posted whether the device crashes again or not. But like I said, it’s a development device, so sometimes I need to reboot.

But it would be very nice to have a balenaOS with the coming BSP update.

Just a quick update. The uptime of my board is now 5 days and 3 hours. So I think it’s stable now!