I’m looking for a way to notify ops when our devices are not ok. I really like the device diagnostics feature, since it already covers a lot of of ground. to start with I’m thinking I make a batch job that triggers the diagnostics check every few hours for all of our online devices. And make it send slack messages for failed checks.
It seems to me that would give me a lot of operational insight for not a whole lot of effort, and with a minimal use of bandwidth. These are all things I like very much
Now I was wondering if I could use this as a basic infrastructure towards better monitoring overall. Can I add my own application specific device checks to the diagnostics? I found this https://github.com/balena-io/device-diagnostics and that looks quite easy to extend. But how would I go about getting those extensions on my devices? And would the API and dashboard automatically pick those new checks up?
Hello,
You can fork the diagnostics repository and extend it.
But unless you create a pull request and it gets accepted, your additions won’t be picked up on the dashboard.
You can still add your own diagnostics container to your application using your forked repository.
I’m the main developer responsible for the diagnostics at the moment, so thanks for the feedback! I (we) are glad you find them valuable as-is. To add to my colleague’s comment, the ability to add custom checks is already on our radar, though may not necessarily run through this same interface. There is an open issue for custom checks that you should feel free to subscribe to for updates: https://github.com/balena-io/device-diagnostics/issues/157.
As you noted, the diagnostics themselves are open-source, and hopefully is structured in a way that makes contribution easy. If there are generalizable improvements you’re considering, PRs are more than welcome.
If you have any other feedback, I’m always happy to learn more about what you do or do not find useful.
I will definitely keep an eye on that issue. My main interest at the moment is really in checks that are quite specific to our applications. But should we come up with a generic one we’ll shoot a PR.
I think for now I’ll just start by leveraging the Docker health checks. The check_service_restarts would then give us a heads up to go take a look at what happened exactly.
But I’m curious, what are other people doing to monitor the state of their devices out in the wild?
Unfortunately our devices are usually connected over rather expensive Satellite connections. Our customers want us to keep the bandwidth usage of the devices down as much as possible. I’d like to do without sending a whole slew of metrics to a central monitoring system.
If it hadn’t been for the bandwidth constraints I’d probably have opted for Prometheus.
@ErikHH interesting situation! If you can’t send metrics to a central system, are you planning to keep it on-device? I suppose you could keep it on-device and only send out an alert or trigger an event if something goes out of threshold? I (and I know others) would be interested to know what you do end up deciding to do for your solution so please keep us posted!
@chrisys Yes that basically the idea. This is the main reason de diagnostics are so useful to me. They run on the device and only send back a small summarized result.
I’m going to see how far I can get it by making advanced health checks an integral part of each of the services running on the device. Then I’ll leverage the diagnostics checks to alert us about services that have been restarted after these health checks have failed. Fairly minimal maybe, but a hell of a lot more than I have now
If that doesn’t suffice maybe I look at running a small Prometheus on each device and use it’s alerting capabilities to phone home. But my gut feel is, that this would be quite a lot of overhead on each device CPU, memory and disk-wise. Might be a bit of a squeeze.
Anyway, I’ll keep the community posted on how it goes.