Logs in BalenaOS

I’m trying to figure out how I can manually scrape logs out of BalenaOS-based containers, specifically a BalenaOS-in-container setup.

Docker logs are (in usual Docker installations) outputted into /var/lib/docker/containers/<container-id>/<container-id>-json.log. In our case though, logging into the Host OS and browsing:

/var/lib/docker/containers/<container-id>
/var/lib/balena/containers/<container-id>
/var/lib/balena-engine/containers/<container-id>

All reveal nothing.

I also tried balena inspect <container-id> | grep LogPath, I get: “LogPath”: “”, which is not very helpful. Since BalenaEngine is based on Moby I also took a look at the way Moby does logging here, and couldn’t find any clues there.

This means that either the logs are never persisted into a file or they’re stored in some esoteric non-standard place, right? Or maybe there’s a third option I’m not aware of.

In any case - where is the Balena Supervisor scraping its logs from, for example?

Source: https://stackoverflow.com/questions/33017329/where-is-a-log-file-with-logs-from-a-container

Hi there,

The Supervisor talks directly to the balenaEngine socket to manager application service images and containers. As such, it uses this socket to retrieve logs from containers rather than relying on storage for logs (this could very easily fill the storage media very quickly, and prolonged writes especially on SD media will wear the media down significantly). These logs are then processed and sent to our own log retention system.

You can subscribe to the logs for a device, for example using the SDK (see here https://www.balena.io/docs/reference/sdk/node-sdk/#balenalogs--codeobjectcode), to consume all output from your service containers.

Examining the journal file for balenaEngine will also allow you to read the logs for the current session (using journalctl -f -n 1000 -a -u balena will show you the last 1000 lines from balenaEngine’s journal), although this will show everything from all containers. If you want to ‘scrape’ specific containers, you can discover the container ID for the specific service you want logs for and grep through the balenaEngine output (eg. journalctl -f -n 1000 -a -u balena | grep 1234567 will only show you output from the container with ID 1234567 that was present in the last 1000 lines of journal from balenaEngine).

Please let us know if this answers your question, or if you have any others.

Best regards,

Heds

@hedss this is very thorough - thank you!

I am contemplating between two different options right now: Bind-monuting the socket via the io.balena.features.balena-socket label and then using it to get logs for all containers, or using the HTTP API endpoint for the local mode application logs from here.

I will attempt both and see the results, and if none satisfy my need I will go the journalctl route.
I’m enticed by the HTTP API because it spits out nicely formatted JSON, though.
Will post a detailed solution here + my use case once It’s working:)

Thanks!!

Hey @tomgs ,

You’re very welcome, glad I could help!

The API is usually the preferred format for exactly the reasons you state, and it also means fewer extra services having to run on-device. We’d love to hear about your use-case!

Please let us know if anything else comes up we can help with. :slight_smile:

Best regards,

Heds

So, just a quick update @hedss:

I’ve implemented a solution that Bind-mounts the socket via the io.balena.features.balena-socket label and then using it to get logs for all containers.

This requires installing docker on the Alpine image (I’m using balenalib/intel-nuc-alpine:3.9-run) for the scraping container, and adding the io.balena.features.balena-socket: '1' label to the container definition in the docker-compose file.

It has the nice benefit of actually having the whole range of docker commands available for use , which makes it easy to parse out container names and ids.

The main problem here is that we’d like to get all the logs for all running containers simultaneously, which means polling all the containers simultaneously. This can be done with something like docker -H unix:///var/run/balena.sock ps -q | xargs -L 1 docker -H unix:///var/run/balena.sock logs --follow --timestamps , but then we will not be able to match a log line to a specific container.

Another option is to have a wrapper script calling our scraping script on each of the containers, which will make it quite easy to match container name and id to a log line, and then push to Loki. This is exactly what I did, btw - my scraper script pipes docker -H unix:///var/run/balena.sock logs <CONTAINER_NAME> --follow --timestamps with the <CONTAINER_NAME> being passed on from the wrapper script.

This brings to light the fact, btw, that using the HTTP API seems more reasonable - since it does all of that for us by giving us succinct messages that match each log messages to the container it came from.

After running that solution overnight, there’s a grabbing logs: error opening journald follow descriptor: "Too many open files" error across most of the containers in my docker-compose application.

I’ve followed the instructions here: Grabbing logs: error opening journald follow descriptor: “Too many open files”, and ran sysctl -w fs.inotify.max_user_instances=512 on the Host OS (including restarting the containers, and even rebooting the device). No avail - still the same message persists. It actually threw my setup into a flux of kill → can’t kill for a couple of my containers, which was only solved by restarting the supervisor. After that I was still getting the grabbing logs: error opening journald follow descriptor: "Too many open files" error.

Conclusion: I will attempt to implement the same logic using the HTTP API and report back. All in all - I get the feeling that this is a weird edge case: We’re running ResinOS-In-Container, and trying to get the socket for that version of BalenaOS that runs inside the container. I wouldn’t mind digging deeper, but for now I’d like to get something working first.

Update 1: BTW, my device is now officially gorked. Might have to purge it just to get it usable again…

Update 2: can’t purge since it’s now getting socket time outs. Full logs of the supervisor:

2019-09-05T06:54:15.997451000Z [info]    Supervisor v10.0.3 starting up...
2019-09-05T06:54:16.241393000Z (node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2019-09-05T06:54:16.312126000Z [debug]   Starting event tracker
2019-09-05T06:54:16.314596000Z [debug]   Starting api binder
2019-09-05T06:54:16.317504000Z [debug]   Starting logging infrastructure
2019-09-05T06:54:16.343566000Z [event]   Event: Supervisor start {}
2019-09-05T06:54:16.385788000Z [debug]   Performing database cleanup for container log timestamps
2019-09-05T06:54:16.404518000Z [info]    Previous engine snapshot was not stored. Skipping cleanup.
2019-09-05T06:54:16.404679000Z [debug]   Handling of local mode switch is completed
2019-09-05T06:55:22.059791000Z Starting system message bus: dbus.
2019-09-05T06:55:22.061683000Z  * Starting Avahi mDNS/DNS-SD Daemon: avahi-daemon
2019-09-05T06:55:22.061742000Z    ...done.
2019-09-05T06:55:36.092923000Z (node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2019-09-05T06:55:44.160625000Z [debug]   Starting event tracker
2019-09-05T06:55:44.160639000Z [debug]   Starting api binder
2019-09-05T06:55:44.160652000Z [debug]   Starting logging infrastructure
2019-09-05T06:55:47.187593000Z [event]   Event: Supervisor start {}
2019-09-05T06:56:20.265384000Z Starting system message bus: dbus.
2019-09-05T06:57:55.884371000Z Starting system message bus: dbus.
2019-09-05T06:57:55.884393000Z  * Starting Avahi mDNS/DNS-SD Daemon: avahi-daemon
2019-09-05T06:57:55.884415000Z    ...done.
2019-09-05T06:59:14.838559000Z [info]    Supervisor v10.0.3 starting up...

Update 3: Ungorked by deleting all the volumes and re-deploying a release that does not contain the socket scraping script. Will do some HTTP API stuff later today and update. Also noted this problem in this ticket:

Same issue here, running sysctl -w fs.inotify.max_user_instances=512 does not fix the issue. Note that I’m running ResinOS-In-Container:

Hi @tomgs ,
I just wanted to mention one more option you could try.
We’ve recently added new API to the supervisor running on a device that allows you to obtain a stream from journald where all the container logs are written to.
https://www.balena.io/docs/reference/supervisor/supervisor-api/#journald-logs
It could help you to obtain logs from all containers on the device and avoid too many files problem.

A problem with using the cloud APIs is that your device has to be connected to the VPN for it to work. If you have spotted connectivity, you might prefer having an on-device solution that allows you to buffer the data and upload it when connectivity is established.

In order to access supervisor API, you need to apply io.balena.features.supervisor-api: '1' label on your container.
Hope it helps.

hey @roman-mazur,

I think I may have misspoke - HTTP API was definitely referring to the supervisor API.
I have no interest in using the proxy on balena.io, since as you mentioned that could definitely be spotty.

Or am I missing something?

Or better yet - can you please explain the differences between:

  1. https://www.balena.io/docs/reference/supervisor/supervisor-api/#journald-logs

and

  1. https://www.balena.io/docs/reference/supervisor/supervisor-api/#stream-local-mode-application-logs-from-device

And which do you think is preferable?

Ah, sorry, I misread the message. Thanks for correcting me, and let us know how it goes.

@roman-mazur Literally just wrote a message above you:)

@tomgs
Sure, the different is mainly in what type of logs you will and from what source.

https://www.balena.io/docs/reference/supervisor/supervisor-api/#journald-logs
This API allows to obtain a stream from journald which contains kernel and engine logs in addition to container logs (analog to calling journalctl on the host).

https://www.balena.io/docs/reference/supervisor/supervisor-api/#stream-local-mode-application-logs-from-device
And this API provides a stream which is a composition of container logs taken from the engine (an analog of calling balena-engine logs on the host).

If you want the full picture of what is happening on the device, jourlnald is the best choice. But if you are interested in your container logs only. then the second option may be better. Also, if you enable persistent logging on a device, only with the first endpoint you would be able to get logs before the last boot.
Log formats are also different.

Does it help?

It really does @roman-mazur, thanks!

I will explore and ping back once I’ve got some idea of what’s a better way to go.

@roman-mazur An update:

First of all, and as an aside, I think the docs are missing the API key here, it says:

curl "$BALENA_SUPERVISOR_ADDRESS/v2/local/logs"

Where I suppose it should say:

curl --no-buffer "$BALENA_SUPERVISOR_ADDRESS/v2/local/logs?apikey=$BALENA_SUPERVISOR_API_KEY"

Back on point - re the local mode logs:

After running curl --no-buffer "$BALENA_SUPERVISOR_ADDRESS/v2/local/logs?apikey=$BALENA_SUPERVISOR_API_KEY" and looking at balena logs for the scraper container in the Host OS through the web GUI, I can see:

{"message":"Streaming logs","isSystem":true}

Which if I understand correctly comes from here, which means all systems are go and we’re waiting for more logs to come - although none do. Any idea why?

As per the journald logs:

Tried doing curl -X POST -H "Content-Type: application/json" --data '{"follow":true,"all":true,"format":"short"}' "$BALENA_SUPERVISOR_ADDRESS/v2/journal-logs?apikey=$BALENA_SUPERVISOR_API_KEY" > logfile.txt, and I keep getting:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot POST /v2/journal-logs</pre>
</body>
</html>

Same for GET, btw. Any thoughts why? The variables are expanded correctly (since the curl works and doesn’t get permission denied or however you deal with bad API keys).

Hey @tomgs, supervisor maintainer here,

The reason that the docs don’t contain a $BALENA_SUPERVISOR_API_KEY variable on the /v2/local/logs endpoint is that this endpoint only works in local mode, the logs get sent to the cloud otherwise. In local mode an api key isn’t required by the supervisor api. It’s probably worth me adding this information to the documentation though.

I’ll take a look at the endpoint and try to see why it’s not working for you, at first glace that should definitely work, and if you’ve applied the api label and are using a supervisor version > 10.3.0 you shouldn’t be getting that error.

Ah! Not using supervisor version > 10.3.0 @CameronDiver.
I’ve tried updating my Intel NUC’s BalenaOS, but it seems there is no newer version…
Can I somehow update the supervisor alone?

Hey @tomgs, supervisor v10.3.0 hasn’t been released yet, on a balena-os/meta-balena relase. Once it’s included in a new meta-balena release, you should be able to upgrade your OS to the BaleanOS release that includes a supervisor version > 10.3.0.

@mikesimos So, no way of using that API yet?
That’s a bit misleading to include it in the docs though, no? :slight_smile:
In any case - when can we expect it?

Also, I’m having trouble with the other API you provide for streaming the local logs above. Can you please assist me with that as well?

After running curl --no-buffer "$BALENA_SUPERVISOR_ADDRESS/v2/local/logs?apikey=$BALENA_SUPERVISOR_API_KEY" and looking at balena logs for the scraper container in the Host OS through the web GUI, I can see:

{"message":"Streaming logs","isSystem":true}

Which if I understand correctly comes from here, which means all systems are go and we’re waiting for more logs to come - although none do. Any idea why?

These logs only work in local mode, as the logs get sent to the cloud if the device is not in local mode. For this endpoint to be useful (as with all /v2/local endpoints), the device must first be in local mode.

So, in essence, you’re saying the ‘local mode’ API is only suitable for development and not for production, and the other API (the one that scrapes journald) is good for production, just not released yet?

Hey @tomgs this sounds about right. The non-local mode API is actually already in production for supervisor versions > 10.2.0 (I think 10.2.2 is the latest version available in production). The requirement on the project for version > 10.3.0 comes from the fact that that version adds this field https://www.balena.io/docs/reference/supervisor/supervisor-api/#format which is required to format the output and be able to parse it consistently