Can you extend device diagnostics with custom checks?

ErikHH · February 26, 2020, 4:22pm

Hi,

I’m looking for a way to notify ops when our devices are not ok. I really like the device diagnostics feature, since it already covers a lot of of ground. to start with I’m thinking I make a batch job that triggers the diagnostics check every few hours for all of our online devices. And make it send slack messages for failed checks.

It seems to me that would give me a lot of operational insight for not a whole lot of effort, and with a minimal use of bandwidth. These are all things I like very much

Now I was wondering if I could use this as a basic infrastructure towards better monitoring overall. Can I add my own application specific device checks to the diagnostics? I found this https://github.com/balena-io/device-diagnostics and that looks quite easy to extend. But how would I go about getting those extensions on my devices? And would the API and dashboard automatically pick those new checks up?

Cheers,
Erik

zvin · February 26, 2020, 4:43pm

Hello,
You can fork the diagnostics repository and extend it.
But unless you create a pull request and it gets accepted, your additions won’t be picked up on the dashboard.
You can still add your own diagnostics container to your application using your forked repository.

xginn8 · February 26, 2020, 8:12pm

Hi there @ErikHH,

I’m the main developer responsible for the diagnostics at the moment, so thanks for the feedback! I (we) are glad you find them valuable as-is. To add to my colleague’s comment, the ability to add custom checks is already on our radar, though may not necessarily run through this same interface. There is an open issue for custom checks that you should feel free to subscribe to for updates: https://github.com/balena-io/device-diagnostics/issues/157.

As you noted, the diagnostics themselves are open-source, and hopefully is structured in a way that makes contribution easy. If there are generalizable improvements you’re considering, PRs are more than welcome.

If you have any other feedback, I’m always happy to learn more about what you do or do not find useful.

ErikHH · February 27, 2020, 8:32am

Thanks for your reply’s.

I will definitely keep an eye on that issue. My main interest at the moment is really in checks that are quite specific to our applications. But should we come up with a generic one we’ll shoot a PR.

I think for now I’ll just start by leveraging the Docker health checks. The check_service_restarts would then give us a heads up to go take a look at what happened exactly.

But I’m curious, what are other people doing to monitor the state of their devices out in the wild?

robertgzr · February 27, 2020, 9:12am

hi @ErikHH,
there’s a blog post written by the very @xginn8 about monitoring devices using prometheus you could check out: https://www.balena.io/blog/monitoring-the-edge-with-prometheus-pt-1/

a few relevant links:

ErikHH · February 27, 2020, 1:25pm

That’s a really cool solution, looks good.

Unfortunately our devices are usually connected over rather expensive Satellite connections. Our customers want us to keep the bandwidth usage of the devices down as much as possible. I’d like to do without sending a whole slew of metrics to a central monitoring system.

If it hadn’t been for the bandwidth constraints I’d probably have opted for Prometheus.

chrisys · February 27, 2020, 1:49pm

@ErikHH interesting situation! If you can’t send metrics to a central system, are you planning to keep it on-device? I suppose you could keep it on-device and only send out an alert or trigger an event if something goes out of threshold? I (and I know others) would be interested to know what you do end up deciding to do for your solution so please keep us posted!

ErikHH · February 27, 2020, 2:39pm

@chrisys Yes that basically the idea. This is the main reason de diagnostics are so useful to me. They run on the device and only send back a small summarized result.

I’m going to see how far I can get it by making advanced health checks an integral part of each of the services running on the device. Then I’ll leverage the diagnostics checks to alert us about services that have been restarted after these health checks have failed. Fairly minimal maybe, but a hell of a lot more than I have now

If that doesn’t suffice maybe I look at running a small Prometheus on each device and use it’s alerting capabilities to phone home. But my gut feel is, that this would be quite a lot of overhead on each device CPU, memory and disk-wise. Might be a bit of a squeeze.

Anyway, I’ll keep the community posted on how it goes.

Topic		Replies	Views
Experimental device diagnostics features now available! Product support support , reliability , monitoring , diagnostics	5	507	April 17, 2020
Openbalena “No remote device diagnostics” - what does this cover openBalena	1	209	January 12, 2023
Device Health Check by supervisor API? General	6	987	July 28, 2021
Call Diagnostics through the Supervisor balenaOS	4	356	September 23, 2019
Building our own dashboard Product support	1	131	December 7, 2023

Can you extend device diagnostics with custom checks?

Related topics