Observability solutions


What are the recommendations for observability?

Some of the things I want to achieve:

  • centralised searchable logs (ELK style)
  • granular metrics (host & container)
  • alarms on the above

I found that sumologic has a good pricing model for me. But is there some already integrated options?

Hello @zukoo i don’t have any recommendation but did you read this?

In the meantime, I will ask internally if anyone from the team can help you!

Here are a couple more blog posts:

Jason Dixon writes a weekly newsletter on monitoring. See the issue archive for lots of use cases and tools.

1 Like

Thanks for the responses,

I’m now considering more seriously Grafana cloud and Datadog.

I’ve tried to setup up both with the help from the above blog posts, Datadog works for the system metrics but I can see two things missing: no logs and no containers ( I can see the images but none of them shows as “Running”).

logs_enabled: true
  - name: docker
  - name: docker
    polling: true
    container_collect_all: true
    enabled: true
  enabled: false # disable APM
site: us3.datadoghq.com

Not sure if it’s related but i see those errors:

Hello @zukoo i’m not sure if the Datadog blogpost is too old. Did you try this?

it’s old as well and then i’m checking the Pull Requests existing to the associated repo → Pull requests · balena-io-examples/balena-datadog · GitHub

What did you try?

@mpous I tried all three blogs above, for graphana i couldn’t make it work. And for datadog both the IOT and the normal client (i had to modify the dockerfile to get it to work and use the latest version) had the same issue.

This is an example on the IOT client where metrics work (even the docker ones) but no logs and no containers:

@zukoo i never worked with Datadog! Maybe @kb2ma has some ideas?

on the other hand, what issues do you have with Grafana?

It’s not sending any data, the logs of the collector are filled with ‘Permanent error: Permanent error: Post "https://1480699:***@https//prometheus-prod-37-prod-ap-southeast-1.grafana.net/api/prom/push\": dial tcp: lookup https on no such host’

EDIT: well I’m sorry after writing this i realize the issue with URL which shouldnt have the https, I’m trying without it now

1 Like

Thanks @mpous, so my Grafana setup is now on par with the datadog one. I don’t see anything in the blog about logs. Do you recommend any way/exporter to upload the stdout/stderr of my containers to grafana?

So i didn’t manage to get the agent to tail the containers output, but at least temporally i got the logs out to Datadog directly from my application code using: GitHub - DataDog/datadog-api-client-python: Python client for the Datadog API

1 Like

That sounds good! let us know if this is your latest configuration!

Let us know if we can help you more!

I’m heading down this path myself, I need to set something up. @zukoo any updates to your setup and testing?

Hey @philletourneau,

Currently i use both IoT fleet monitoring with Datadog and balenaCloud: How small agent containers make a big impact - balena Blog to get host metrics, and the datadog SDK GitHub - DataDog/datadog-api-client-python: Python client for the Datadog API to stream logs.

If you find a way to use the agent to stream logs I’ll be interested to know how.

1 Like

Thanks for the update. I’d love to get logs and host metrics all-in-one too! I haven’t set anything up yet, but GrafanaCloud looks very tempting because of the pricing, though I don’t know if I can figure out how to set that up, they’re moving to something new now?

1 Like

@philletourneau @zukoo We started using GCP Monitoring to collect both (logs and metrics), published by an OTEL collector: opentelemetry-collector-contrib/exporter/googlecloudexporter/README.md at main · open-telemetry/opentelemetry-collector-contrib · GitHub

So as long as you get stuff into the collector, everything ends up in a GCP dashboard.

@mpous To get logs and metrics up, having an OTEL collector as a service is obviously easy. For metrics you have to deal with each service, makes sense, it’s very specific. But what would be a tremendous simplification is to have Balena export the console logs of all services to the collector automatically. In that case we don’t have to instrumentalise each service independently. Our devices run 5 to 10 different services, and some are 3rd party.


@ada this is a really interesting feedback!

Do you think you can introduce this in our public roadmap tool so our team can think about it?


:+1: done: Publish all console logs to OpenTelemetry Collector · Balena Roadmap

1 Like

Thanks for additional ideas and feedback folks!

I just started experimenting with a new solution in beta, Pydantic Logfire | Uncomplicated observability so far it’s really great if you’re running Python services.