Notifications for production errors?

mpark · May 1, 2019, 12:46pm

Hello guys.

Is there a feature in Balena for sending alerts on certain log metrics, like if an exception is caught in a log? Like I can make an alert metric in AWS CloudWatch or the like for a server running in the cloud.

What do you guys use to ensure QA on your running fleet and quickly respond to errors?

I have tried to search Google and the forums, have not found anything.

Best, mpark

xginn8 · May 1, 2019, 2:35pm

Hey @mpark,

First off, thanks for bringing this topic up! It is a great thing to discuss and it is something we are actively working on. tl;dr as of today, there are no features for alerting like you describe, but we would love to know more about your use case.

We have published a few recommendations on how to log remotely or deploy to a canary group of devices. Canary deploys should mitigate the risk associated with a faulty push, and is an integral part of managing a production fleet.

Are you already deploying new releases to canary groups? If so, what types of errors are still causing downtime and headaches for you?

mpark · May 1, 2019, 3:39pm

Hey, thanks for the inquisitive response!

I am not using canary at the moment. I honestly did not know it existed until now.

I have not deployed my first fleet to production yet, although I have already made arrangements with a customer to get some RPIs out in the field. So to sleep better at night, I would like to be notified on errors.

My use case would primarily be to get alarms in my inbox when a log matches a certain filter, basically on any and all errors. I think the log agent could be a way to start out, at least.

Best
mpark

xginn8 · May 1, 2019, 4:57pm

Hi again @mpark,

If you have the development bandwidth, it is always good practice to set up monitoring for your services & devices. Some common stacks include telegraf/TICK stack or 2, Prometheus, or Datadog. I hope to publish an updated Prometheus guide soon, so stay tuned for that as well. Moreover, something like a log forwarding service can be useful if you have a robust logging setup, though I find pattern matching in logs to be a little brittle for arbitrary errors in production.

Additionally, there are some things you can do on-device to make your application more resilient to failure. We always recommend configuring a HEALTHCHECK in your Dockerfile, and making sure you have tested some common failure cases for your app.

Again, we are working on many of these problems now internally, so please let us know what issues you run into or what would make your life as a fleet owner easier!

mpark · May 2, 2019, 7:47am

I’ve looked a bit into what you’ve linked. I think the Prometheus guide is not an optimal design of IoT in general, since I believe the way to go is to have your IoT units isolated and only pushing outwards, not receiving calls inbound from a Prometheus server, unless the server can be placed in the same network as your IoT devices.

However, at least in my case, I plan to have multiple IoT instances spread across locations, even countries, for a Digital Signage solution. perhaps the PushGateway is an option? - but they seem to write it’s not recommended in many cases.

Another approach could be DataDog or NewRelic (is that still a thing?) I reckon, or some other hosted solution. Perhaps Grafana Cloud. Obviously, it all costs money, so…

bversluijs · May 2, 2019, 11:09am

Hi @mpark,

We’re using Sentry to log any errors in our software, including our Balena devices!

For the Balena devices, we’ve made an API that send Sentry errors to our server, before adding it to our Sentry dashboard. This is for authentication purposes and adding the serialnumber to any errors, so we know which devices have these errors.

Sentry has many options, like adding the release ID, and adding user details etc. It had a paid option, but also an open-source version which you can run on your own server! It send emails to your inbox with the error, but also has options to send it via Slack or other channels.

I’ve integrated it in our software (NodeJS), but it has many supported languages and options. Perhaps this is what you need!

Only question I have for the balena team regarding this: how can I retrieve the release ID of the current running containers? (Multi-container setup). This way I can set the correct Release ID to the Sentry setup!

mpark · May 2, 2019, 11:34am

Sounds cool, will check it out!

mpark · May 2, 2019, 11:40am

Regarding getting the release id, you can maybe ask the supervisor?

https://www.balena.io/docs/reference/supervisor/supervisor-api/#get-v2applicationsstate

bversluijs · May 2, 2019, 11:58am

Yeah, but I don’t know if it’s going to change when a new release is downloaded, so I’d like some clarity on that. But that was my thought too! I’m going to use the release ID for more things inside my software, like restarting the UI-interface when it changes. But thanks for the tip!

xginn8 · May 2, 2019, 12:43pm

If you are concerned about Prometheus’ pull architecture over the WAN (fair concern!), I recommend looking into telegraf or Datadog. It is pretty straightforward to roll your own TICK stack if desired, though Datadog may be easier to get up and running more quickly (though perhaps more expensive).

If you still want to test out Prometheus, another option is setting up a VPN and scraping over that interface rather than the public URL. The Pushgateway is definitely useful, but at the same time quite heavily discouraged for anything that can avoid it. You lose a lot of the niceties of the pull architecture when using the Pushgateway, and Prometheus does not really have workarounds for that in lots of cases.

xginn8 · May 2, 2019, 1:23pm

@bversluijs to your question about tracking the releaseId of currently running containers, @mpark is correct in that you can remotely query the supervisor for the current set of running containers. These values will updated as soon as a new release is deployed to your device, and reflects the current state rather than the target state. I hope this helps!

bversluijs · May 3, 2019, 12:41pm

@xginn8 thanks for the confirmation! That’s the way to go for me!

sradevski · May 3, 2019, 3:40pm

I am glad you found Matthew’s recommendations useful. Let us know how it goes once you setup your alerting and what solution you went with!

mpark · May 30, 2019, 12:33pm

So I have been running my digital screens with Sentry for a few weeks now, and so far it has been working really well. I quickly discovered some bugs I had done with some authentication tokens not updating, which was not caught when testing it up-front.

So so far I can recommend using Sentry for at least my use case.

afitzek · May 30, 2019, 12:40pm

Hi,
Thank you very much for sharing your experience with Sentry. Happy that it works well for you.

Topic		Replies	Views
Alerts feature request Product support	11	889	September 8, 2022
Relaying host and container logs & console to logstash or logagent balenaOS	2	832	August 4, 2020
Balena logging aggregator/shipper Product support docker	11	1028	May 27, 2024
API outage - words of support Product support support , status , api	1	393	October 20, 2021
Experimental device diagnostics features now available! Product support support , reliability , monitoring , diagnostics	5	507	April 17, 2020

Notifications for production errors?

Related topics