We have noticed some (but not all) devices experiencing errors maintaining a stream to the supervisor API endpoint for journald logs: /v2/journal-logs. In these cases, attempts to create a stream results in socket errors due to connection aborts.
Manually running the example curl command provided in docs works (including using --no-buffer):
curl -X POST -H "Content-Type: application/json" --data '{"follow":true,"all":true}' "$BALENA_SUPERVISOR_ADDRESS/v2/journal-logs?apikey=$BALENA_SUPERVISOR_API_KEY"
However, the error occurs when programatically making this request (in application code of the container). By running the above, I can see that the application is able to trigger the journal stream. Unlike the “good” devices, the request immediately reports a 200/OK and our abort occurs:
...
Jul 18 09:34:55 7435aa8 balena-supervisor[3465]: [debug] Spawning journalctl -a --follow -o json
Jul 18 09:34:55 7435aa8 balena-supervisor[3465]: [api] POST /v2/journal-logs 200 - 42.511 ms
...
It seems the first attempt at streaming offers the most chance of stability - sometimes it will have several minutes of uptime, before eventually hitting a loop of immediate aborts.
Please can you advise on the below questions?
What is the expected response-code structure for this endpoint? It is not specified in docs.
Is there any further debug logging we can leverage to understand the outcome of the API call?
Do you have any advice on request structure that might circumvent any of these issues?
If you would like me to provide specifics on our fleet/device ID over pm, please reach out!
The supervisor starts a journalctl in a subprocess and pipes the stdout to the response object.
I’ll try to reproduce this locally. Could you elaborate on before eventually hitting a loop of immediate aborts.? What log message (if any) are received client side?
I’ve tried to reproduce but for whatever reason I’m not experiencing the issue on my device. I think that if you could enable support access to a device experiencing the problems, and one that isn’t, and share their UUID’s (perhaps privately), we can investigate further
curl -X POST -H "Content-Type: application/json" --data '{"follow":true,"all":true}' "$BALENA_SUPERVISOR_ADDRESS/v2/journal-logs?apikey=$BALENA_SUPERVISOR_API_KEY" --no-buffer
On your device from inside your container that I assume is trying to stream the logs via the supervisor API. This command is working fine, and I can even see the error messages generated from the containers application logic saying that the connection to the supervisor API is being reset.
How are you programatically making this request? As the curl from inside the application container seems to work fine, I believe that there is something in the application logic that we must investigate.
Thanks for investigating @rcooke-warwick! We have shared a UUID of a problematic device through a direct support message.
As noted in my original post, we were also able to establish the stream via curl and see the errors - and it was this output that indicated an unusually fast termination of the request (stream).
I’ll try to reproduce this locally. Could you elaborate on before eventually hitting a loop of immediate aborts. ? What log message (if any) are received client side?
Sorry for not being clear, this is some IP code to re-establish the journal stream (new POST request), but it continually fails for the device referenced by the UUID we just shared. I’m sure you gathered this from the error stream.
We haven’t been able to systematically reproduce the error, but you might try to reproduce using e.g. axios (setting responseType: "stream").
Of course it is possible this relates to the specific device’s network stability, interacting with the specific application code. However, it would be useful to rule out any server-side connection issues (or terminations, etc).
Hello, we are still investigating this issue. I created a minimal example here using axios: GitHub - otaviojacobi/debug-axios-stream but it worked fine for me on balenaOS 2023.7.0 with supervisor 14.10.10 I will reflash my debug device to balenaOS 2.115.18+rev2, s}upervisor 14.11.8 to see if I can reproduce
Hi there, is hard to tell what exactly is going on with the device you shared, and we have not been able to replicate the issue. What I can tell you is that endpoint has unfortunately not been tested for the continued usage you require, and has not been tested with axios. It’s very likely that endpoint is removed in a future version v3 of the API. Moreover, even if the endpoint was stable, the supervisor is a service that may be restarted by the OS at any point because of configuration changes or if it’s found to be unhealthy, which will cause any long running connections to be interrupted.
If you need to reliably query the journal logs we would recommend using the io.balena.feature.journal-logs, label in the container configuration and query the journal directly using journalctl or other compatible tool.
Please let us know if this works for you or if you need help with using the journal-logs label.
Thank you for your further investigation, it’s interesting to know that this might have not been supported usage of this API endpoint – possibly something to note in the docs! I appreciate that there are several reasons the supervisor might restart, but in this case, it seems there is a more specific reason why the stream isn’t maintained. We will consider some alternative way to establish the stream - as we both noted, a simple curl seems to be a bunch more stable…
Is it possible to provide a rough estimate of when v3 of the API will become available? That would be much appreciated, if this endpoint is going to be deprecated/removed.
it’s interesting to know that this might have not been supported usage of this API endpoint – possibly something to note in the docs!
Yes, I fully agree with this. I’ll add a note to the logs.
Is it possible to provide a rough estimate of when v3 of the API will become available? That would be much appreciated, if this endpoint is going to be deprecated/removed.
We are not actively working on the v3 API but it’s definitely on my mind. Even when v3 is released, we’ll need to deprecate v1 and v2 in advance and we’ll probably need to do a OS major version bump to remove v1 and v2 to avoid breaking any user devices.
That’s another way to say it’s going to take some time