Balena Crash / Balena Reliability

We have recorded three crashes of balena lately, here is the journalctl output of the latest.

Balena version:

Client:
 Version:       unknown-version
 API version:   1.35
 Go version:    go1.9.4
 Git commit:    unknown-commit
 Built: unknown-buildtime
 OS/Arch:       linux/arm
 Experimental:  false
 Orchestrator:  swarm

Server:
 Engine:
  Version:      17.12.0-dev
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   2fe3ad1568c1b783a9201fc3082452ad79d7f396
  Built:        Wed Aug  8 13:31:24 2018
  OS/Arch:      linux/arm
  Experimental: true

This appears to be the crash:

Oct 09 07:14:24 5f14761 healthdog[887]: time="2018-10-09T07:14:24.511615460Z" level=warning msg="unknown container" container=557844a07200e6499ba605d7d8c649e3ed60c7926774d49997bd06c15a6b01b5 module=libcontainerd namespace=plugins.moby
Oct 09 07:14:35 5f14761 healthdog[887]: time="2018-10-09T07:14:35.669668778Z" level=warning msg="unknown container" container=557844a07200e6499ba605d7d8c649e3ed60c7926774d49997bd06c15a6b01b5 module=libcontainerd namespace=plugins.moby
Oct 09 07:15:01 5f14761 healthdog[887]: time="2018-10-09T07:15:00.975296678Z" level=info msg="killing and restarting containerd" module=libcontainerd pid=911
Oct 09 07:15:02 5f14761 healthdog[887]: time="2018-10-09T07:15:02.258496627Z" level=error msg="containerd did not exit successfully" error="signal: killed" module=libcontainerd
Oct 09 07:15:02 5f14761 healthdog[887]: time="2018-10-09T07:15:02.906617033Z" level=error msg="failed to get event" error="rpc error: code = Internal desc = transport is closing" module=libcontainerd namespace=moby
Oct 09 07:15:02 5f14761 healthdog[887]: time="2018-10-09T07:15:02.875516960Z" level=error msg="failed to get event" error="rpc error: code = Internal desc = transport is closing" module=libcontainerd namespace=plugins.moby
Oct 09 07:15:03 5f14761 healthdog[887]: time="2018-10-09T07:15:03.418238603Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:03 5f14761 healthdog[887]: time="2018-10-09T07:15:03.635437375Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:03 5f14761 healthdog[887]: time="2018-10-09T07:15:03.715753524Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:04 5f14761 healthdog[887]: time="2018-10-09T07:15:04.184153060Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:05 5f14761 healthdog[887]: time="2018-10-09T07:15:05.040319989Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:05 5f14761 healthdog[887]: time="2018-10-09T07:15:05.205719180Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:06 5f14761 healthdog[887]: time="2018-10-09T07:15:06.448325394Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:07 5f14761 healthdog[887]: time="2018-10-09T07:15:06.788386607Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:07 5f14761 healthdog[887]: time="2018-10-09T07:15:06.975459667Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:08 5f14761 healthdog[887]: time="2018-10-09T07:15:08.147785303Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:09 5f14761 healthdog[887]: time="2018-10-09T07:15:09.711485782Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:10 5f14761 healthdog[887]: time="2018-10-09T07:15:10.466691012Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:10 5f14761 healthdog[887]: time="2018-10-09T07:15:10.536724177Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:11 5f14761 healthdog[887]: time="2018-10-09T07:15:10.738530827Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:12 5f14761 healthdog[887]: time="2018-10-09T07:15:12.345340048Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:13 5f14761 healthdog[887]: time="2018-10-09T07:15:13.546465747Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:14 5f14761 healthdog[887]: time="2018-10-09T07:15:14.664733541Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:17 5f14761 healthdog[887]: time="2018-10-09T07:15:16.765043746Z" level=error msg="failed restarting containerd" error="fork/exec /usr/bin/balena-containerd: cannot allocate memory" module=libcontainerd
Oct 09 07:15:19 5f14761 healthdog[887]: SIGABRT: abort
Oct 09 07:15:19 5f14761 healthdog[887]: PC=0x6e9d0 m=0 sigcode=0
Oct 09 07:15:19 5f14761 healthdog[887]: goroutine 0 [idle]:
Oct 09 07:15:17 5f14761 systemd[1]: balena.service: Watchdog timeout (limit 1min)!
Oct 09 07:15:18 5f14761 systemd[1]: balena.service: Killing process 887 (balenad) with signal SIGABRT.
Oct 09 07:15:19 5f14761 systemd[1]: balena.service: Killing process 889 (exe) with signal SIGABRT.
Oct 09 07:15:19 5f14761 systemd[1]: balena.service: Killing process 3468 (balena-containe) with signal SIGABRT.
Oct 09 07:15:19 5f14761 systemd[1]: balena.service: Killing process 4795 (balena-containe) with signal SIGABRT.
Oct 09 07:15:19 5f14761 systemd[1]: balena.service: Killing process 14639 (balena-healthch) with signal SIGABRT.
Oct 09 07:15:19 5f14761 systemd[1]: balena.service: Killing process 14640 (balena) with signal SIGABRT.

After it crashes there are many log entries, heres a sample: https://paste.ee/p/ZuAIx

Hardware is a 512MB SBC which normally has around 90 - 120MB free memory. And a 64MB swap file.

Why did it even try to kill and restart containerd?

Thanks for the report, logs like these are very useful. It looks like you’re probably running Balena within ResinOS here, is that right? Could you confirm the specific ResinOS version that you’re using?

We’ll take a look through and see if we can find out anything more from there. Do let us know if you see this again in future, or you have any more info what exactly triggers this issue.

Yes this is a ResinOS device.

Resin OS 2.14.3+rev5

About once a week from a fleet of 3 monitored devices we see a balena crash. No known cause on our end. This is the first time I’ve been around soon enough to capture the journal (as you can see it fills and cycles quickly due to those massive healthcheck errors that follow).

Heres another balena crash. This one occurred on a device provisioned today that was updating (delta) at the time. If I was to hazard a guess the update connection may have been interrupted, this device is at the edge of the WiFi (router in building A, this is in building B for supplementary radio RF testing)

In this case no containers booted after the crash.

root@b15ae3c:~# balena ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
root@b15ae3c:~#

Restarting the balena service didn’t resolve the issue, only a hardware cycle did.

After a reboot that devices balena crashed while updating. This time containers were restored after the crash and it looks like updates are resuming.

@pimterry

Anything come from looking at those logs?

We are currently experiencing this issue very very frequently. Here are our logs, which look very similar (watchdog timeout, followed by killing every container)
I’m putting the available logs as an attachment.

We are heavily using resources on our devices, and it seems like on some of the devices these issues appear on a regular interval (like every 60 minutes). The RAM usage is measured and send to our monitoring service every 5 minutes. It does not show an increase in RAM usage over time for any of the devices, which leads us to believe it is not a memory leak on the containers side.

One thing I could imagine is a scenario in which balena or the supervisor needs RAM every 60 minutes to do something and if at that point the RAM usage is over 95% for example it will cause an OOM or something from one moment to the next.

We are debugging this issue since quite some time and had it occur on startup since one of our firmware versions. We lowered the memory usage and it didn’t occur at startup anymore, but on a regular basis instead. As said we investigated for memory leaks, but our memory consumption stays constant over time. Older devices stay at 97% RAM usage aswell (2.12.7+rev1) but they do not cause these errors (they don’t run the latest firmware though)
top is showing at least 20K free at all times until the crash happens, aswell.

Can you give me pointers on how to further pin down what is going on?

journalctl.log (11.6 KB)

EDIT: I will also try to get a more complete jounralctl log from a device if I can catch one.

Hello !
Thank you for reporting this - as per our previous communication responding on the forums as this seems easier :slight_smile:
We took a look into the release history and it seems that balenaOS v2.14.3 had a problem with memory consumption and was was pulled back from the release server.
Any version after that should have the resource issue fixed.
Let us know if this helps !

What would our version be? It seems to report version numbers from a different series.

I’m assuming that by balenaOS you mean resinOS. (I see that “resin os versions raspberrypi3” does not return any 2.14.* versions, so I am assuming that’s what you meant)

My last comment was a bit of a mess regarding version definitions. During writing I still tested it on other resinOS versions and when it occured there aswell, I removed the section of my comment outlining the versions on which this error occurs completely.

The attached logs from my last comment were from a device running on 2.15.1, but with our latest devices we haven’t had the problem occur in quite some time. I am therefore assuming this to be resolved (or our own software’s fault) for now. If it happens again to us, we will surely investigate further and try to pin it down correctly as well as get the error to occur in a repeatable fashion.

Thank you for your assistance so far! :slight_smile:

@SplitIce For you I suggest updating to resinOS 2.15.1 as your resinOS version is the one that was pulled back from the release server. (You can see available versions by using resin CLI command “resin os versions $deviceType”)

Best Regards,
Tarek

OT: I see the forums are now on balena.io, can we expect resinOS to be named balenaOS in the future? :stuck_out_tongue:

@Tschebbischeff @cyplo

The device I do most of my testing is on Resin OS 2.14.3+rev5. As it’s a development mode it’s not easy to upgrade (no ResinHUP).

We had a script for doing upgrades to development devices but it’s been broken since ~2.15 and @brdrcol (co-worker) hasn’t yet worked out why.

@imrehg do we update development versions yet?

Hi @floion @SplitIce, we don’t support self-serve dev version updates, but:

  • it’s in the works to add support for
  • as one-off fleetops support, we can help updating the device, just send us the UUID, open support request, tell us which version, and when can we run the update.

@imrehg Any chance you could just provide the basics of the process.

Historically we have dd’ed over the network the image to the non-active partition and flipped the active partition in the boot folder (I need to ask my co-worker for the specific locations).

We are currently only using development versions as all devices are with external contractors, pilot sites, developers or located at testing labs. Not to mention we have a custom OS build (so no ResinHUP for us through your cloud).