DNS failure not caught by supervisor

anujdeshpande · September 17, 2020, 8:59am

Hi

Looks like both the devices are offline now - can you please restart them?

krenom · September 17, 2020, 2:08pm

Appologies, both are online.

Think it’s due to the multiple ethernet plug/unplug issue that they weren’t showing up.

… I’ll replace that ethernet cable later so that won’t happen…

ab77 · September 17, 2020, 11:33pm

Interestingly, I’ve just checked and the internal DNS resolution on the container bridge is working on both devices. They can all resolve each other by names.

krenom · September 18, 2020, 5:26am

It is surprising that they’ve been online for 15h without dying. A new record for both devices in over a month.

Last error on cec711 was apparently yesterday morning:

{“message”:“ScheduleExecutor.executeJob(): Error running task ‘orchestration-time-update’- getaddrinfo EAI_AGAIN orchestration orchestration:8080”,“level”:“error”,“timestamp”:“2020-09-17T04:26:30.3030Z”}

Hmm, also interesting; they are not registering an ethernet connection…
Scratch that, that’s my issue; something wrong with my equipment that I’ll need to sort out later. Sigh.
[Edit: there we go, re-paired my powerline adapters and we magically have ethernet connectivity back]

Well, unless you guys have changed something, it should just be a matter of time until they fall over so you can begin gathering data again.
I’ll keep an eye out and yell.

krenom · September 18, 2020, 8:57am

There we go, cec711 is deaded. Have at it.

krenom · September 18, 2020, 10:05am

And there goes a70bc.

Interestingly, both up for quite a while as per my first incredulous update this morning, and then when i start taking a look around and hopping into shells to start looking at logs… they suddenly start falling over on me.

shaunmulligan · September 18, 2020, 1:39pm

Hey Aidan, that is very interesting. Perhaps its some resource getting choked, 13 containers is quite a bit for a pi3 with 1Gb ram to handle (depending of course what they are doing) but it could be that some how poking around pushes this just over and results in the engine panicking. It looks like the engine panicked because of:

Sep 18 11:20:46 cec711c balenad[20252]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Sep 18 11:20:46 cec711c balenad[20252]: SIGABRT: abort

Which from some https://github.com/golang/go/issues/37006 seems to indicate that this is usually from trying to create too many thread or allocating too much RAM (leaking maybe?) . I know there was some version of the OS that had some issues with sshd processes not being cleaned up correctly and would slowly accumulate resource usage each time you ssh’d into your device. Let me try dig up what version that was fixed in, perhaps thats playing in here.

shaunmulligan · September 18, 2020, 2:01pm

Okay looking at the OS change logs it seems the sshd bug I mentioned was fixed in 2.47.1 and above so thats not the culprit, it must be something else leaking memory or creating tons of threads. We would probably need watch a functioning device on htop or better yet something like netdata.

One of my colleagues mentioned that it could also be due to they way container logging is implemented in the container engine using the journald loggingDriver, this method I believe creates a new logging process for each container and ends up allocating more memory that wanted. We are in the process of changing that implementation, but in the meantime you can verify if its that issue by setting some (or all of) your services to have logging driver “none” which you can do by adding this snippet to each of the services:

logging:
  driver: "none"

Note that this is obviously not idea as these containers will no longer publish logs to the dashboard. Let me know if you want me to further explain anything or help with how to add this to your services.

krenom · September 22, 2020, 11:30am

Good morning!

Apologies for the delay, but I’ve now pushed that out - both with none and "none" because they haven’t exhibited the lack of logging behaviour that you’ve described…

E.g.

This, presumably, is what you had in mind? If so, cool, then the two devices are now running with this, if not then you’ll need to elaborate.

krenom · September 24, 2020, 1:09pm

Note: devices being requisitioned for some other testing purposes for a couple of days - should be able to restore them in the very near future for DNS issue testing purposes once again.

krenom · October 22, 2020, 3:13pm

Apologies, we haven’t gotten round to restoring the devices for this specific testing just yet - we’re reconfiguring some of the containers and apps to consolidate and reduce the number thereof. That’s still ongoing in the final stages before we start testing in earnest.

However, I would like to raise an issue I am hoping that you can help us with that is currently quite vexing. We have devices that are just… restarting, without warning.

It happens at around a couple of minutes past every hour; and it’s not our services doing it or we’d see all the relevant logs about it doing such.

I was streaming journal logs to hopefully catch some indication of what perhaps was going on, but there was nothing - no warning, just an immediate restart:

I’ve already granted support access for this particular device (60be607480fbea1cbacedc1809f80972) on our app (basestation-production).

Thanks in advance once again.

krenom · November 5, 2020, 1:57pm

Good afternoon!

As a minor update to the most recent issue: Some of the restarting observed in the devices was, in fact, our own. We have pushed an update to fix these out of turn restarts on our side but are still left with some devices restarting in the same fashion as the above - no logs or warning of any controlled restart happening.

One of the main things causing restart issues is a CPU temp over 80. That seems to successfully kill stuff off and get a ‘Supervisor starting’ log spam going.
That’s fair enough.

Others, such as 5368440a25b40715bdaa6b9668083209 on our basestation-production app aren’t so happy (support access granted already fyi):

The latter restarts are us - our sensor seems to be offline, so it’s doing various remedial attempts including full restarts of the fin.
However, before 12 everything is registering as fine. For example: snippet of our logs for that 10:58 restart

And FYI, example logs for when we specifically deal with restarts/shutdowns ourselves:

Lots of lovely log spam telling us exactly what’s doing what and when and why and how. We don’t see these logs with these hourly restarts which leads us to wonder what else is going on under there somewhere.

How can we nicely see what may be doing this?

As an aside - with a new batch of devices on the go, flashed with OS 2.38, they are not seeing the same issue.
A similar look at the restarts of one of the new 2.38-flashed devices:

One restart. Lovely. Same software on our side across the fleet; all 13 containers running.

krenom · November 11, 2020, 9:25am

@shaunmulligan

Good morning!

Apologies for the long hiatus on this topic. However, it’s back on the priority list now that we have some of our own stabilisations out of the way.

I have flashed device c3910bbbdd9d6a79433f9ad6d453e632 on basestation-develop with the latest 2.58.3+rev1/11.14.0 OS/Supervisor image and got my share on with support for it.

Word of warning first though if anyone is going to be poking around the device today - I’m going to be running some temp tests with it (just got my hands on one of your heat sinks), so it’ll be running hot and I’ll obviously need to take it offline to fit said heat sink once I’m ready for that.

So, latest OS is behaving similarly, but not the same to the 2.51 version.

I can see in the logs we get that same large churn of ethernet interfaces that all fail to come up properly and we just start incrementing the DBus connections count per one of the set of logs I vaguely recall sharing like a billion years ago now.
There’s no outward sign of the DNS failing as with the 2.51 image (where our containers were all spamming EAI_AGAIN error messages). However, some kind of restarting is going on pretty regularly, including supervisor restart spam:

Snippet of logs showing one of our containers boots:

As an aside, the latest test SW is running on this app that features a wrap up of several of our containers smushed into one, so we’re down from 13 running to 10 in the vague hopes that it might relieve some stress on the device.

I’ll once again point out, as I like to do, that we have no issues with the 2.38 OS set. We have started to get our manufacturing process in place and are getting devices with the 2.38 image out there (hampered, of course, by lockdown) and those few we have managed to install are showing as much stability as the original 2.38’s we had running for the previous year.

So, now that we’re back on the case, with the shiny new 2.58 image, where do you want to start up on the troubleshooting for all this now?

Thanks.

Topic		Replies	Views
Balena Devices Supervisor Dies but Still on the Network balenaOS	5	377	June 9, 2021
Failed to connect to supervisor-api Product support	3	553	August 2, 2019
Supervisor will not run; proxy is not working Product support	2	559	May 8, 2020
Error reading supervisor logs from container Product support	12	215	December 19, 2022
Supervisor fails to resolve DNS on v4, v5 in offline/air-gapped setup using open-balena Product support raspberrypi3 , docker	5	215	June 12, 2024

DNS failure not caught by supervisor

Related topics