Enterprise device monitoring

barryjump · November 4, 2020, 8:28pm

Curious to hear from the community.

We’re piloting a bunch of Pi4 based Lora gateways and are looking for the best infrastructure monitoring platform to keep them running reliably.

We’ve been testing/running a Datadog container on a Pi4 with Balena for a few weeks now. Works like a charm, and VERY impressed with Datadog as company (their customer service is extraordinary).

I’d use them in production in a heartbeat if it werent for their per host pricing model. $15-$18 per month per host could get very expensive considering we’re looking at eventually having roughly 20 small gateways per customer.

This led me to explore Elastic as an option (either self hosted or their cloud offering) for two main reasons: 1) ELK is trendy and 2) No per agent/host fees. But ELK is complex and pretty hardcore it seems.

We’ve also had some experience with Dynatrace (not really an option) and Netdata (possible. free, but not as fully featured as Datadog).

But rather than guess on our own, what is the community doing? Any recommendations??

odyslam · November 5, 2020, 4:50pm

Hey @barryjump,

Former balenista here and currently Developer Relations at Netdata. We are shipping the most efficient and lightweight agent for system monitoring that exists (partly because it’s written in C) and we are currently developing a Cloud platform to manage multiple Netdata Agents.

Netdata Agent is one of the top Open Source projects on Github, while Netdata Cloud is closed source SaaS. Both are free and will remain so.

I was meaning to create a balena-netdata sample application one of these days, but I never found the time. I would love to see you try something and help you in any way possible. The experience with Netdata is not as smooth as with Balena, but I’ll do my best to help you leverage the best features of each platform.
Btw, in a balena project that I actively use, I have incorporated Netdata, so you can see just how easy it is: https://github.com/OdysLam/balena-nginx-raspberry.

barryjump · November 5, 2020, 11:10pm

Thanks @odyslam for the link! Will try it out. I managed to get the agent on test Balena device myself and you’re right it was very very easy.

Netdata is terrific (1 sec sample rates are incredible!) - though honestly I was left wishing it had logs, trace, APM, etc all rolled into one. Hard to do on a free product I’m sure, but undoubtedly you guys have built an incredible product so far.

hpgmiskin · November 10, 2020, 12:09pm

This is a really interesting thread, we have been weighing up a number of potentials, but are still not sure what the best solution would be. I would welcome all thoughts and suggestions.

Current we use InfluxData for our monitoring. The Telegraf reporter integrates very nicely with Balena. There is even a Balena blog post called Aggregate data from a fleet of sensors with balenaSense and InfluxDB. However, when sharing metrics frequently (every 10 seconds) from a number of devices the cost can quickly escalate.

We are considering using LogDNA for our logs. They are one of the few providers which do not charge per host, their pricing is solely based on retention and storage. We had a call with them the other week and it seems as though there are other IoT companies which make use of their platform.

Given the cost of uploading metrics to InfluxData we are considering moving to Grafana Labs. The cost for 1000 series reported every 10 seconds is $16 which is roughly a 4x reduction in cost from InfluxData (we have done rough calculations from looking at our existing metrics and costs). They also provide log storage which could be interesting, however, logs are not as well indexed as LogDNA.

@odyslam thanks for sharing Netdata, it looks very promising, especially the sample rate. However, I have tried to look into the future pricing model and cannot find any information. If available, would it be possible to share a link to the pricing page?

barryjump · November 10, 2020, 12:38pm

I assume you are speaking of Influx’s cloud costs or are you running your own influx and talking about your infrastructure costs?
My rational is that if we’re going to pay for a hosted solution it makes sense to run with a full turn key suite which is why I like Datadog so much. Especially if you are looking at the entire IoT solution from nodes, gateways, network, server and end applications. From a development perspective I like the idea of having all that visibility in one place where you can single click cross reference logs to APM and uptime for example.

I was chatting with the Datadog team and they just announced a lighter weight version of their Agent specifically for resource constrained IoT devices which is still in limited beta but looks promising. They’re going to have to reconfigure their per agent pricing though if they expect customers to deploy thousands of agents in a fleet of small devices.

I imagine that if you have the dev resources to tool up, weave together and support an Influx, Grafana, logDNA, Prometheus etc solution then maybe an Elastic ELK setup is probably going to work really well for you.

My thing is I’d rather us be focusing limited resources on the customers product, which means finding the easiest turn key analytics tool around to set and forget.

odyslam · November 10, 2020, 8:03pm

Hey everyone and thanks for the discussion,

Indeed InfluxDB is a very good choice, especially for the IoT due to it’s simple Push paradigm. Moreover, I think Datadog is pretty much a no-brainer on APM.

On the other hand, Netdata does a few things (at the moment), but it does them in a phenomenal way . It’s super lightweight, it’s super scalable since it has a distributed architecture and data live on the agents (although you can stream/export them for longer retention to something like Prometheus) and it has a zero configuration approach. This last thing means that it will auto-detect all the metric sources it can and will start aggregating instantly.

As a user, having connected several agents to Netdata Cloud, you will get composite charts out-of-the-box, organized in meaningful ways, so that you can focus on the business logic and the development of your application. It’s really something to see it first-hand.

Regarding the pricing, we are free and the features that you are experiencing will continue to be free. Netdata Agent is FOSS, so free by design. Netdata Cloud will eventually have paid features, but we can’t share more information at this point. You can envision that our philosophy is akin to Github. Furthermore, we are determined to make pricing as simple and transparent as possible, once we come up with a scheme that works and makes sense for our users and us.

Finally, you are not paying us with our data. Netdata cloud does not store anything, it simply streams the data directly from the agent to your browser, whenever you access the dashboard.

here are some relevant documentation:

barryjump · November 29, 2020, 9:35pm

Question that perhaps the more technical team from Balena can answer.

We’ve got Datadog running in a container with some visibility into adjacent containers (in our case Basicstation and Wifi Connect).

But Datadog (and others) permit running their agent on the host OS per their docs:

Datadog offers native Docker container monitoring, either by running the Agent on the host or running in a sidecar container. Which is the best way to run it? It ultimately depends on the tooling you have in place to manage the Agent’s configuration.

Is that even possible with BalenaOS? I get that its a highly customized Linux kernel so there’s probably a bunch of particulars, but am curious if there is a good reason to or not to install host wide apps like monitoring agents as opposed to running them within an agent.

Thanks!

dtischler · December 9, 2020, 11:39pm

Hi Barry, this is a delayed reply, but I wanted to add a tiny bit of extra context to the “install directly to the HostOS” topic … balenaOS (and it’s upstream source, Yocto) are not designed to be able to install applications after the OS has been built, generally speaking. Yocto is an “embedded linux” distribution, and is designed with extreme flexibility to customize prior to deployment, but, changing that later is less straightforward.

In Ubuntu, Fedora, or similar linux distributions, installing applications is easy via ‘apt-get’ , ‘dnf install’ etc. But, that basic capability doesnt exist in balenaOS, so, even if Datadog does have a Yocto-friendly binary package or clone-and-build-from-source method in place, you would essentially be in the position of building your own balenaOS, with that added Yocto recipe added (again assuming that even exists).

Hope that does provide a bit of useful info, ha!

barryjump · December 11, 2020, 2:54pm

Thanks @dtischler makes perfect sense.

barryjump · December 20, 2020, 11:39am

Morning guys, was hoping someone from the Balena team could help out here. I’m seeing a problem w Datadog agent installation related to /host/proc. Basically the DD agent is not gaining access to disk or docker stats. Which means the agent runs, and gives me some basic system status, but the disk and docker modules don’t report correctly which is a bummer because I’d love to see our balena containers in DD.

From Datadog:
Docker integration error:

Instance docker[ERROR]: could not get cgroups: open /host/proc: no such file or directory

Disk integration error:

[ERROR]: [{“message”: “[Errno 2] No such file or directory: ‘/host/proc/filesystems’”,

I’m almost certain this is related to privileged access, DD agent kinda looking for a traditional OS, and some compose config problem on my part, but thought I’d ask to see if there was a quick resolution.

Here’s my docker-compose:

version: '2'
services:
  #wifi-connect:
  #  build: ./wifi-connect
  #  network_mode: "host"
  #  labels:
  #      io.balena.features.dbus: '1'
  #  cap_add:
  #      - NET_ADMIN
  #  environment:
  #      DBUS_SYSTEM_BUS_ADDRESS: "unix:path=/host/run/dbus/system_bus_socket"
  basicstation:
    build: ./basicstation
    privileged: true
  datadog:
    image: achntrl/datadog-agent:latest
    privileged: true
    restart: always
    pid: "host"
    network_mode: "host"
    labels:
      io.resin.features.dbus: '1'
      io.resin.features.balena-socket: '1'
      io.balena.features.procfs: '1'
      io.balena.features.sysfs: '1'
      io.balena.features.journal-logs: '1'
  losant:
    image: losant/edge-agent:latest-arm
    ports:
      - "8080:8080"

I thought perhaps the labels would have solved the priviledged permission problem, and perhaps it did, but DD is maybe just looking in the wrong place considering BalenaOS isn’t Ubunutu, Debian, etc.

barryjump · December 20, 2020, 11:41am

Also, for what its worth heres what the logs in the balena UI and showing:

20.12.20 06:39:28 (-0500) datadog [ AGENT ] 2020-12-20 11:39:28 UTC | WARN | (cgroup_detect.go:126 in parseCgroupMountPoints) | No mountPoints were detected, current cgroup root is: /host/sys/fs/cgroup/

20.12.20 06:39:28 (-0500) datadog [ AGENT ] 2020-12-20 11:39:28 UTC | WARN | (checkbase.go:100 in Warnf) | Error collecting containers: could not get cgroups: open /host/proc: no such file or directory

20.12.20 06:39:28 (-0500) datadog [ AGENT ] 2020-12-20 11:39:28 UTC | ERROR | (runner.go:289 in work) | Error running check docker: could not get cgroups: open /host/proc: no such file or directory

Topic		Replies	Views
Observability solutions Project help	17	422	May 17, 2024
Install Netdata on balena device (Raspberry pi 3); Issues & alternatives balenaOS	9	2368	May 14, 2020
Performance, Exceptions and Events - Monitoring and Analytics Product support raspberrypi3	3	706	June 26, 2017
Streaming logs with Datadog IoT Agent Product support support	2	344	August 26, 2023
Enterprise Grade IoT Gateway General	1	339	December 9, 2020

Enterprise device monitoring

Related topics