Logs in BalenaOS

So, just a quick update @hedss:

I’ve implemented a solution that Bind-mounts the socket via the io.balena.features.balena-socket label and then using it to get logs for all containers.

This requires installing docker on the Alpine image (I’m using balenalib/intel-nuc-alpine:3.9-run) for the scraping container, and adding the io.balena.features.balena-socket: '1' label to the container definition in the docker-compose file.

It has the nice benefit of actually having the whole range of docker commands available for use , which makes it easy to parse out container names and ids.

The main problem here is that we’d like to get all the logs for all running containers simultaneously, which means polling all the containers simultaneously. This can be done with something like docker -H unix:///var/run/balena.sock ps -q | xargs -L 1 docker -H unix:///var/run/balena.sock logs --follow --timestamps , but then we will not be able to match a log line to a specific container.

Another option is to have a wrapper script calling our scraping script on each of the containers, which will make it quite easy to match container name and id to a log line, and then push to Loki. This is exactly what I did, btw - my scraper script pipes docker -H unix:///var/run/balena.sock logs <CONTAINER_NAME> --follow --timestamps with the <CONTAINER_NAME> being passed on from the wrapper script.

This brings to light the fact, btw, that using the HTTP API seems more reasonable - since it does all of that for us by giving us succinct messages that match each log messages to the container it came from.

After running that solution overnight, there’s a grabbing logs: error opening journald follow descriptor: "Too many open files" error across most of the containers in my docker-compose application.

I’ve followed the instructions here: Grabbing logs: error opening journald follow descriptor: “Too many open files”, and ran sysctl -w fs.inotify.max_user_instances=512 on the Host OS (including restarting the containers, and even rebooting the device). No avail - still the same message persists. It actually threw my setup into a flux of kill → can’t kill for a couple of my containers, which was only solved by restarting the supervisor. After that I was still getting the grabbing logs: error opening journald follow descriptor: "Too many open files" error.

Conclusion: I will attempt to implement the same logic using the HTTP API and report back. All in all - I get the feeling that this is a weird edge case: We’re running ResinOS-In-Container, and trying to get the socket for that version of BalenaOS that runs inside the container. I wouldn’t mind digging deeper, but for now I’d like to get something working first.

Update 1: BTW, my device is now officially gorked. Might have to purge it just to get it usable again…

Update 2: can’t purge since it’s now getting socket time outs. Full logs of the supervisor:

2019-09-05T06:54:15.997451000Z [info]    Supervisor v10.0.3 starting up...
2019-09-05T06:54:16.241393000Z (node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2019-09-05T06:54:16.312126000Z [debug]   Starting event tracker
2019-09-05T06:54:16.314596000Z [debug]   Starting api binder
2019-09-05T06:54:16.317504000Z [debug]   Starting logging infrastructure
2019-09-05T06:54:16.343566000Z [event]   Event: Supervisor start {}
2019-09-05T06:54:16.385788000Z [debug]   Performing database cleanup for container log timestamps
2019-09-05T06:54:16.404518000Z [info]    Previous engine snapshot was not stored. Skipping cleanup.
2019-09-05T06:54:16.404679000Z [debug]   Handling of local mode switch is completed
2019-09-05T06:55:22.059791000Z Starting system message bus: dbus.
2019-09-05T06:55:22.061683000Z  * Starting Avahi mDNS/DNS-SD Daemon: avahi-daemon
2019-09-05T06:55:22.061742000Z    ...done.
2019-09-05T06:55:36.092923000Z (node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2019-09-05T06:55:44.160625000Z [debug]   Starting event tracker
2019-09-05T06:55:44.160639000Z [debug]   Starting api binder
2019-09-05T06:55:44.160652000Z [debug]   Starting logging infrastructure
2019-09-05T06:55:47.187593000Z [event]   Event: Supervisor start {}
2019-09-05T06:56:20.265384000Z Starting system message bus: dbus.
2019-09-05T06:57:55.884371000Z Starting system message bus: dbus.
2019-09-05T06:57:55.884393000Z  * Starting Avahi mDNS/DNS-SD Daemon: avahi-daemon
2019-09-05T06:57:55.884415000Z    ...done.
2019-09-05T06:59:14.838559000Z [info]    Supervisor v10.0.3 starting up...

Update 3: Ungorked by deleting all the volumes and re-deploying a release that does not contain the socket scraping script. Will do some HTTP API stuff later today and update. Also noted this problem in this ticket:

Same issue here, running sysctl -w fs.inotify.max_user_instances=512 does not fix the issue. Note that I’m running ResinOS-In-Container: