What do you use for performance monitoring?

JoeClark · August 21, 2017, 10:54am

Hello,

Does anyone use a cheap or better yet free performance monitoring tool? I have a need to monitor about a dozen machines during an event to help identify bottlenecks but I’m having trouble finding a tool to do it. Ideally not something overly complex, this is a one time need.

Or does anyone know of a good tutorial for Performance Monitor? That can record the info I want to watch, cpu, memory, disk latency, network utilization, etc… but configuring it for remote machines seems harder than it needs to be. That and I would like to avoid setting up collector sets on 12 separate machines then have to deal with a bunch of separate reports. I actually checked Performance Monitoring Video for help but wasnt satisfied.

Any help will be appreciated.
Thank you.

imrehg · August 21, 2017, 3:12pm

Hi, @JoeClark, would this be along the line of your thinking?

Tristan107 · August 22, 2017, 11:00am

Hey,

I guess that’s what I was looking for with my question below on resin supervision features, thanks !

And I’ve found the second tutorial mentionned in the article here :

Tristan107 · August 26, 2017, 11:02am

@imrehg @craig-mulligan

So I’ve gone through both tutorials and I think these supervision features are really a “must have” for any production use of resin.io with a few dozen devices or more.

One problem I can see is the Prometheus discovery service is presented as a demo and not a maintained node (it has not moved in a year), so it may not stay compatible with Resin SDK or Prometheus format.

Is there a plan to submit it at Prometheus to become a SD they maintain, or to make it more “officially” maintained by Resin.io ?

imrehg · August 27, 2017, 9:16am

We have a couple of performance and monitoring related improvements that we’re working on. These are interesting thoughts and we’ll discuss with the team, thanks a lot for the feedback, and we’ll keep you posted. One difficulty is really one size does not fit all as we see for our users, thus every new stuff we build has to go through a lot of thinking how it will affect different fleets.

Tristan107 · August 27, 2017, 9:38am

Ok, good to know !

Are your performance and monitoring improvements around the prometheus tool or a new custom resin.io way of doing it ?

And are we still talking about this autumn or later ?

imrehg · August 27, 2017, 11:43am

It’s still in flux, so can’t share much facts. But our track record is to go with existing open source if it’s good enough, and only build something new (and open source!) if existing solutions don’t quite cut it.

I think at this moment the main priority is on-device stability and reliability (tying in with improvements made in Docker for that), and add suitable monitoring along the way (feels like it makes more sense than the other way around, but as I said, it’s work in progress).

We have a bunch of stuff scheduled for the autumn, and don’t want to overpromise/underdeliver. Will make sure to post updates here in the forums as things develop…

craig-mulligan · August 28, 2017, 11:23am

Hey @Tristan107 @imrehg,

I’ve been looking at monitoring again lately and while we may add some alerting to device status etc it’s more likely we will focus on making resin integrate well with existing monitoring solutions like datadog, prometheus , TICK stack etc.

My current recommendation is to use TICK stack, it’s a culmination of the influxdata’s open source projects, namely telegraf, influxdb, kapacitor and Chronograph. The projects work well together and are free to use, many SaaS products pricing models are based on cloud infrastructure which makes them way to expensive for IoT usecase. This may change soon.

Brief explanation on how TICK stack works:

telegraf is the agent that will run on the device, it has a series of input and output plugins, making it really easy collect data from anything (machine metrics, statsd, cadvisor, docker, or pushing stats directly from you app. It’s output plugins allow you to use the push or the pull model. In our previous we used prometheus which relies on the pull model, in that model you have to know the targets IP address and therefore use the resin-sdk to get those addresses and scrape their metrics endpoint. But using telegraf we can easily switch to a push model, making the stack a lot simpler.
The cloud portion will be a influxdb a timeseries database, Kapacitor which allows you to set alerts on queries, and chronograph which allows you to create graphs to visualise the queries.

We are still missing a few key pieces of functionality to make monitoring 1st class.

In most cases you’d want to monitor host metrics, which requires mounting specific volumes from the host, this is something resin doesn’t currently support.
In many cases you’d want to monitor docker stats, in this case you’d need access to the docker socket, this is something resin doesn’t support in production mode.
Multi-container is needed to make monitoring solutions a drop in replacement.

Here’s a screenshot from Chronograf: I’m using resin dev-image so I can monitor Docker, it’s pretty looks slick.

I’ll probably put together another blog post once we have a clearer idea on everything above.

Tristan107 · August 28, 2017, 3:47pm

@craig-mulligan Thank you very much for all these ideas.

Could you give some examples of what you miss when you gather system stats from inside a container and not directly from the host ? Won’t you see the same %CPU in use, memory usage, storage (no storage variations to see on the host since it’s a read-only fs) ?

About Chronograph, do you think it’s better than Grafana or is it to stick with the stack ?

Looking forward to reading your next blog article.

craig-mulligan · August 28, 2017, 5:15pm

@Tristan107,

Yea currently you won’t miss much in terms of machine metrics because of the way we run containers, but may change with multi-containers support.

Re: Chronograf, it’s still early days (I’ve only just tried it out) but it integrates really nicely with the tick stack, for instance they have prebuilt dashboards for most telegraf plugins, which makes setup really quick. Ofc if you’re doing custom stuff you still have to do things manually.

It also integrates with the kapacitor api so you can manage your alerts from the chronograf dashboard, super nice that it unifies things.

Are you working on a monitoring setup atm? I can share what I have so far if it’s useful to you in the interim.

Tristan107 · August 28, 2017, 6:36pm

Well, I’m “working” on it by reading your articles and reading what you say to read

So yeah I’ll be reading the TICK stack doc and find some plugins which fit our needs. If you have some resin.io related setup, I’ll be happy to see it.

olekspickle · April 29, 2024, 12:57pm

This is still relevant in 2024. What do people use for fleets nowadays?

mpous · May 2, 2024, 11:18am

@olekspickle thanks for your message!

Maybe this thread is relevant for this? Observability solutions - #3 by kb2ma

What do you recommend?

olekspickle · May 17, 2024, 11:55am

For us datadog looks more proable, albeit costly solution since you pay datadog per host.
I’ll discover prometheus way, but it’s a huge undertaking.

A huge disadvantage of datadog-iot-agent is that it does not export docker metrics, and I am not sure if prometheus is capable of it, my guess is some exported for sure must be able to

Topic		Replies	Views
Performance, Exceptions and Events - Monitoring and Analytics Product support raspberrypi3	3	706	June 26, 2017
Resin.io supervision features Product support	2	996	November 7, 2017
Monitoring the edge with Prometheus pt. 1 Discussions	2	396	October 1, 2021
Monitor SD card tear/wear Project help sdcard	8	828	February 18, 2021
CPU Performance of devices Product support terminal	3	1786	April 11, 2017

What do you use for performance monitoring?

Related topics