Notifications for production errors?

Hello guys.

Is there a feature in Balena for sending alerts on certain log metrics, like if an exception is caught in a log? Like I can make an alert metric in AWS CloudWatch or the like for a server running in the cloud.

What do you guys use to ensure QA on your running fleet and quickly respond to errors?

I have tried to search Google and the forums, have not found anything.

Best, mpark

1 Like

Hey @mpark,

First off, thanks for bringing this topic up! It is a great thing to discuss and it is something we are actively working on. tl;dr as of today, there are no features for alerting like you describe, but we would love to know more about your use case.

We have published a few recommendations on how to log remotely or deploy to a canary group of devices. Canary deploys should mitigate the risk associated with a faulty push, and is an integral part of managing a production fleet.

Are you already deploying new releases to canary groups? If so, what types of errors are still causing downtime and headaches for you?

Hey, thanks for the inquisitive response!

I am not using canary at the moment. I honestly did not know it existed until now. :laughing:

I have not deployed my first fleet to production yet, although I have already made arrangements with a customer to get some RPIs out in the field. So to sleep better at night, I would like to be notified on errors.

My use case would primarily be to get alarms in my inbox when a log matches a certain filter, basically on any and all errors. I think the log agent could be a way to start out, at least.

Best
mpark

Hi again @mpark,

If you have the development bandwidth, it is always good practice to set up monitoring for your services & devices. Some common stacks include telegraf/TICK stack or 2, Prometheus, or Datadog. I hope to publish an updated Prometheus guide soon, so stay tuned for that as well. Moreover, something like a log forwarding service can be useful if you have a robust logging setup, though I find pattern matching in logs to be a little brittle for arbitrary errors in production.

Additionally, there are some things you can do on-device to make your application more resilient to failure. We always recommend configuring a HEALTHCHECK in your Dockerfile, and making sure you have tested some common failure cases for your app.

Again, we are working on many of these problems now internally, so please let us know what issues you run into or what would make your life as a fleet owner easier!

1 Like

I’ve looked a bit into what you’ve linked. I think the Prometheus guide is not an optimal design of IoT in general, since I believe the way to go is to have your IoT units isolated and only pushing outwards, not receiving calls inbound from a Prometheus server, unless the server can be placed in the same network as your IoT devices.

However, at least in my case, I plan to have multiple IoT instances spread across locations, even countries, for a Digital Signage solution. perhaps the PushGateway is an option? - but they seem to write it’s not recommended in many cases.

Another approach could be DataDog or NewRelic (is that still a thing?) I reckon, or some other hosted solution. Perhaps Grafana Cloud. Obviously, it all costs money, so… :man_shrugging:

Hi @mpark,

We’re using Sentry to log any errors in our software, including our Balena devices!

For the Balena devices, we’ve made an API that send Sentry errors to our server, before adding it to our Sentry dashboard. This is for authentication purposes and adding the serialnumber to any errors, so we know which devices have these errors.

Sentry has many options, like adding the release ID, and adding user details etc. It had a paid option, but also an open-source version which you can run on your own server! It send emails to your inbox with the error, but also has options to send it via Slack or other channels.

I’ve integrated it in our software (NodeJS), but it has many supported languages and options. Perhaps this is what you need!

Only question I have for the balena team regarding this: how can I retrieve the release ID of the current running containers? (Multi-container setup). This way I can set the correct Release ID to the Sentry setup!

Sounds cool, will check it out! :+1:

Regarding getting the release id, you can maybe ask the supervisor?

https://www.balena.io/docs/reference/supervisor/supervisor-api/#get-v2applicationsstate

Yeah, but I don’t know if it’s going to change when a new release is downloaded, so I’d like some clarity on that. But that was my thought too! I’m going to use the release ID for more things inside my software, like restarting the UI-interface when it changes. But thanks for the tip!

If you are concerned about Prometheus’ pull architecture over the WAN (fair concern!), I recommend looking into telegraf or Datadog. It is pretty straightforward to roll your own TICK stack if desired, though Datadog may be easier to get up and running more quickly (though perhaps more expensive).

If you still want to test out Prometheus, another option is setting up a VPN and scraping over that interface rather than the public URL. The Pushgateway is definitely useful, but at the same time quite heavily discouraged for anything that can avoid it. You lose a lot of the niceties of the pull architecture when using the Pushgateway, and Prometheus does not really have workarounds for that in lots of cases.

@bversluijs to your question about tracking the releaseId of currently running containers, @mpark is correct in that you can remotely query the supervisor for the current set of running containers. These values will updated as soon as a new release is deployed to your device, and reflects the current state rather than the target state. I hope this helps!

@xginn8 thanks for the confirmation! That’s the way to go for me!

I am glad you found Matthew’s recommendations useful. Let us know how it goes once you setup your alerting and what solution you went with!

So I have been running my digital screens with Sentry for a few weeks now, and so far it has been working really well. I quickly discovered some bugs I had done with some authentication tokens not updating, which was not caught when testing it up-front.

So so far I can recommend using Sentry for at least my use case.

Hi,
Thank you very much for sharing your experience with Sentry. Happy that it works well for you.