Balena and Telegraf - Bad Performance in inputs

7ser23 · August 30, 2022, 12:23pm

Hi, I have been using Balena and Telegraf for quite some time and I see that after a few hours of my deployment running, the performance gets worse and it only gets better after a reboot

My DOCKERFILE looks like this:

FROM telegraf:latest

RUN apt-get update && apt-get install -y --no-install-recommends dnsutils mtr git iperf3 telnet tcpdump traceroute wvdial usb-modeswitch ppp nano vim lftp nmap cron && \
    rm -rf /var/lib/apt/lists/*

COPY ./telegraf.conf /etc/telegraf/telegraf.conf
COPY ./mtr-own.sh /etc/telegraf
COPY ./nping-own.sh /usr/local/bin/
RUN chmod 775 /usr/local/bin/nping-own.sh

After the reboot, tests look like this

but, at 19:12 they start to look worse and my nping stops working:

you can clearly see the problem.

Does anyone know what I can be doing wrong? or why it gets fixed when I reboot the rasp? also it will help if I can schedule a reboot at any time like with crontab, option that I think is not available in Balena. Please help

vipulgupta2048 · August 31, 2022, 6:11am

Hello,
Has this started happening recently? With newer iterations of balenaOS?
I don’t have a lot of experience working with Telegraf. To help you with scheduling reboots, we do have a cron block now that you can use to schedule that reboot balenaHub: an easier way to find and publish fleets, apps, and blocks for edge devices

mcraa · August 31, 2022, 10:59am

What are you pinging, and how long, how frequent?
Could it be that after an amount of request the other server treats your pings as spam and replies differently?

By the behavior itself (fails after time, works after reboot) my first guess would be that logs or other data that is needed for the functionality fills up the storage or RAM and causes it to malfunction.

@vipulgupta2048 's suggestion of the cron block could be a quick workaround, or since you install cron in your dockerfile already you can configure that to call the supervisor to restart.

rcooke-warwick · August 31, 2022, 11:33am

Hi there, in addition, could you also enable support access for this device? I can look at the device logs to see what happened at 19:12. However if the device has been rebooted since then, these logs will be lost - so let me know i thats the case.

7ser23 · August 31, 2022, 12:30pm

well, I do not have this problem when I use rasp os (they have the same telegraf config). I have around 60 sensors (like 20 work with balena in two fleets). The whole deployment involves: dns queries, http responses, speedtests, MTR, ping, etc.

I am pinging around 100 sites, 5 counts, 0.25 intervals every 3 min. All this data is dumped to the database every 20 min.

This is in production already and I have to reboot the devices myself. I am checking the health and behavior of a Telco HFC, Remote PHY, FTTH network.

I have tried to restart the device by calling the supervisor:

$ curl -X POST --header "Content-Type:application/json" \ "$BALENA_SUPERVISOR_ADDRESS/v1/reboot?apikey=$BALENA_SUPERVISOR_API_KEY"

but you have to understand that I will need to echo each api key for each sensor. On top of that, I cannot make it work. For some weird reason I can GET info with the command but I cannot POST anything.

7ser23 · August 31, 2022, 12:43pm

The devices were rebooted because I lost tests at 3 am again. The whole project is in production.

I can enable support access to my devices. should I share the fleet and uuid via support chat?

7ser23 · August 31, 2022, 1:03pm

right now I have a sensor that is with 100% CPU usage and when I try to enter terminal this is what I get:

I can grant access to this sensor as well.

the-real-kenna · September 5, 2022, 9:13pm

Hi @7ser23,

We just thought to reach out again and see if you’d be willing to enable support access so we can review device logs with you. Let us know if that’s possible. Thank you!

7ser23 · September 14, 2022, 3:46pm

Hi, I was able to fix the problems doing the following:

I added TINI in my dockerfile to kill zombie processes and make it run in PID 1. It fixed the issue but it had problems with my inputs.exec, because of this we checked telegraf documentation and found:

We added this line USER telegraf to our dockerfile and we are not seeing any problems now.

The dockerfile looks like this now:

FROM telegraf:latest
ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-static-arm64 /tini
RUN chmod +x /tini
ENTRYPOINT [“/tini”, “–”]
RUN apt-get update && apt-get install -y --no-install-recommends dnsutils mtr git iperf3 telnet tcpdump traceroute nano vim lftp nmap cron &&
rm -rf /var/lib/apt/lists/*
COPY ./telegraf.conf /etc/telegraf/telegraf.conf
COPY ./my_script.sh /usr/local/bin/
RUN chmod 775 /usr/local/bin/my_script.sh
USER telegraf
CMD telegraf

hraftery · September 15, 2022, 4:01pm

Great, thank you for reporting back with your solution! We’ll make sure this is logged so future users might benefit. All the best with it.

Topic		Replies	Views
Balena and Telegraf - Container not running balenaOS docker , raspberrypi4	8	779	December 10, 2021
Figure out cause of unexpected Balena Reboot after 'x' time Product support	4	239	February 24, 2022
balenaOS container & socket issues balenaOS	5	293	July 23, 2020
Find reason for automatic reboots in Balena OS balenaOS	1	274	November 24, 2023
BalenaOS first boot: green light blinks 4 times forever Product support	10	1382	August 11, 2022

Balena and Telegraf - Bad Performance in inputs

Related topics