Balena-engine crashing during livepush

Hey folks,

I’ve been having trouble livepushing a 6 container app to my local mode fin.
It appears that the balena-engine / docker daemon may be crashing but it’s hard to tell because sometimes the device completely locks up.

Here’s some output that will hopefully help pin-point the issue.

balena push 192.168.0.11 --debug sometimes just locks up and stops streaming logs. Any other active ssh sessions also lock up.

The last thing I’ll see is something like this:

[Build] [map-service] Step 5/9 : RUN pip install -r requirements.txt
[Build] [hal] Step 5/9 : RUN pip install -r requirements.txt
[Build] [supervisor] —> Running in 2f88a7d31c5d
[Build] [scheduler] —> Running in 367ced99e0e8
[Build] [map-service] —> Running in 9272ba052ebb
[Build] [hal] —> Running in a8a578ed9e21

The device-api also goes down, which I verify by hitting http://192.168.0.11:48484/ping (no response)

Other times I see an actual error during balena push 192.168.0.11 --debug

connect ECONNREFUSED 192.168.0.11:2375
or
[Error] Connection to device lost
or
ECONNRESET: socket hang up

And then I see errors in the systems logs like this:

Jul 01 20:28:43 07a2ee6 systemd[1]: resin-supervisor.service: Main process exited, code=killed, status=15/TERM

Jul 01 20:28:54 07a2ee6 resin-supervisor[19165]: [info] Supervisor v11.4.10 starting up…

Jul 01 20:28:43 07a2ee6 balenad[16612]: time=“2020-07-01T20:28:43.778617677Z” level=error msg=“Handler for GET /images/json returned error: write unix /var/run/balena-engine.sock->@: write: broken pipe”

It appears to me that either the resin_supervisor container or balenad crashes (or gets terminated?) or both?

Any thoughts on what’s going on?

EDIT: I’ve uploaded another log over here with more details: https://gist.github.com/ebradbury/80537374db3471442033cc5b26c04abe

Thanks for the help,
Elliot

Hi Elliot,

Probably the best strategy would be to try to narrow down more whether it is a specific container that causes this or it is out-of-memory type of issue unrelated to a specific container, e.g. what would happen when you push 5, 4, … containers?

Also are you able to push the same application through the cloud builders, which are much more beefy?

Thanks,
Zahari

Hey Zahari,

Thanks for the response. I can get it to build with 4 containers but I don’t see a direct correlation to any specific container causing the issue since they’re mostly ~15mb, low cpu flask servers. I have not tried to build in the cloud since this app is still in development so I’m trying to nail down a quick, dependable build process.

Some other things to note that seem to have helped the problem…

  • Throttle down some looping code with python’s time.sleep()
  • Assigned all of my containers to cores 1,2,3 using cpuset in attempt to guarantee that the resin_supervisor has core 0 to its self
  • Added better cooling to the rpi3 cm3

I’m not sure if this is a stable solution and would still like to get to the root of this issue.

Thanks for the help,
Elliot

Hi @ebradbury – thanks for the additional details. I will ping our devices team to see if they have any advice on this. In the meantime, there are a couple things I might suggest:

  • I’m curious what power supply you’re using, and whether it might benefit from improvement.

  • If you go to the dashboard for the device and click on Diagnostics, you should be able to run Device Checks. Are there any warnings that turn up? (One of the things we check for is undervoltage events, which relates to the previous point.)

  • Can you give a bit more detail on the cooling improvements you’ve made?

  • What are the specs on the Pi3 module you’re using?

  • I realize that you’re looking for a quick, dependable build process, but I wonder how your application behaves on its own if you use the cloud builders – that is, whether the application on its own might be enough to cause these problems, or if it only encounters problems when doing the build as well.

  • We have been doing some exploration with Netdata, which allows quick and easy monitoring of computer resources (CPU, RAM, network, temperature, etc). This is still a work in progress, but we’ve had some luck adding a Netdata container to applications. Have a look at https://github.com/balena-io-examples/balena-netdata/tree/netdata-smol (note: the branch I’m recommending is netdata-smol, not master); you may find that you can get an idea of what’s happening on the device up to the point where it hangs.

All the best,
Hugh

For those following, @ebradbury and I have resolved this issue by replacing the source image of

FROM balenalib/raspberrypi3-alpine-python:3.8-latest

and instead moved to

FROM balenalib/fincm3-python:3-stretch-run.

This has resolved all of our issues with livepush and laggy local performance. Not clear why. Mentioning here as a cross reference to our alternate but related conversation on https://forums.balena.io/t/is-there-a-maximum-number-allowable-containers/146698/33

I’m not sure this is an Alpine specific issue but may be more a bug related to Alpine with livepush. Completely conjecture but wanted to bring to the Balena / Fin Team’s attention.

Thanks for the note Alexander, appreciate the heads up on this. :slight_smile: