For those running openbalena who have observed an error message that regularly appears in device supervisor logs:
[error] LogBackend: server responded with status code: 504
The mystery has finally been solved - and I’d like to share the resolution with the community.
First some background information: device logs (including container logs and other supervisor events) are passed from the device’s
open-balena-api via the endpoint
/device/v2/:uuid/log-stream. Upon receiving a log event,
balena-supervisor opens a connection to the
log-stream endpoint, but rather than streaming log events directly to it, aggregates them in a local buffer first. This buffer then flushes to the
log-stream endpoint on the earlier of aggregating 50 log lines or 60 seconds of inactivity.
We recently added a loki server to our openbalena instance to aggregate and monitor server side logs, which prompted us to look into why devices were routinely failing to post logs to the
log-stream endpoint (which in turn posts those logs to loki). Turns out that it has to do with a timeout mismatch between
haproxy-ingress has a default connection timeout of 50 seconds, so unless your device generates more than 50 log lines in 50 seconds, it the connection will be timed out by the server, resulting in the LogBackend 504 message showing up on the device, and the log lines not being sent to the
To solve this, we had to change the
haproxy-ingress default timeouts, which in our case the changes were effected via a helm script config, but depending on your configuration might be implemented differently (however the underlying settings changes should remain the same):
config: timeout-server: 75s timeout-server-fin: 75s timeout-client: 75s timeout-client-fin: 75s
After deploying with this change, all of the 504 errors should disappear (because
balena-supervisor flushes the log after 60 seconds, before the server aborts the connection), and logs are captured by
Hope this is helps!