DNS failure not caught by supervisor

Hi

Looks like both the devices are offline now - can you please restart them?

Appologies, both are online.

Think itā€™s due to the multiple ethernet plug/unplug issue that they werenā€™t showing up.

ā€¦ Iā€™ll replace that ethernet cable later so that wonā€™t happenā€¦

Interestingly, Iā€™ve just checked and the internal DNS resolution on the container bridge is working on both devices. They can all resolve each other by names.

It is surprising that theyā€™ve been online for 15h without dying. A new record for both devices in over a month.

Last error on cec711 was apparently yesterday morning:

{ā€œmessageā€:ā€œScheduleExecutor.executeJob(): Error running task ā€˜orchestration-time-updateā€™- getaddrinfo EAI_AGAIN orchestration orchestration:8080ā€,ā€œlevelā€:ā€œerrorā€,ā€œtimestampā€:ā€œ2020-09-17T04:26:30.3030Zā€}

Hmm, also interesting; they are not registering an ethernet connectionā€¦
Scratch that, thatā€™s my issue; something wrong with my equipment that Iā€™ll need to sort out later. Sigh.
[Edit: there we go, re-paired my powerline adapters and we magically have ethernet connectivity back]

Well, unless you guys have changed something, it should just be a matter of time until they fall over so you can begin gathering data again.
Iā€™ll keep an eye out and yell.

There we go, cec711 is deaded. Have at it.

And there goes a70bc.

Interestingly, both up for quite a while as per my first incredulous update this morning, and then when i start taking a look around and hopping into shells to start looking at logsā€¦ they suddenly start falling over on me.

Hey Aidan, that is very interesting. Perhaps its some resource getting choked, 13 containers is quite a bit for a pi3 with 1Gb ram to handle (depending of course what they are doing) but it could be that some how poking around pushes this just over and results in the engine panicking. It looks like the engine panicked because of:

Sep 18 11:20:46 cec711c balenad[20252]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Sep 18 11:20:46 cec711c balenad[20252]: SIGABRT: abort

Which from some https://github.com/golang/go/issues/37006 seems to indicate that this is usually from trying to create too many thread or allocating too much RAM (leaking maybe?) . I know there was some version of the OS that had some issues with sshd processes not being cleaned up correctly and would slowly accumulate resource usage each time you sshā€™d into your device. Let me try dig up what version that was fixed in, perhaps thats playing in here.

Okay looking at the OS change logs it seems the sshd bug I mentioned was fixed in 2.47.1 and above so thats not the culprit, it must be something else leaking memory or creating tons of threads. We would probably need watch a functioning device on htop or better yet something like netdata.

One of my colleagues mentioned that it could also be due to they way container logging is implemented in the container engine using the journald loggingDriver, this method I believe creates a new logging process for each container and ends up allocating more memory that wanted. We are in the process of changing that implementation, but in the meantime you can verify if its that issue by setting some (or all of) your services to have logging driver ā€œnoneā€ which you can do by adding this snippet to each of the services:

logging:
  driver: "none"

Note that this is obviously not idea as these containers will no longer publish logs to the dashboard. Let me know if you want me to further explain anything or help with how to add this to your services.

Good morning!

Apologies for the delay, but Iā€™ve now pushed that out - both with none and "none" because they havenā€™t exhibited the lack of logging behaviour that youā€™ve describedā€¦

E.g.
image

This, presumably, is what you had in mind? If so, cool, then the two devices are now running with this, if not then youā€™ll need to elaborate.

Note: devices being requisitioned for some other testing purposes for a couple of days - should be able to restore them in the very near future for DNS issue testing purposes once again.

Apologies, we havenā€™t gotten round to restoring the devices for this specific testing just yet - weā€™re reconfiguring some of the containers and apps to consolidate and reduce the number thereof. Thatā€™s still ongoing in the final stages before we start testing in earnest.

However, I would like to raise an issue I am hoping that you can help us with that is currently quite vexing. We have devices that are justā€¦ restarting, without warning.

It happens at around a couple of minutes past every hour; and itā€™s not our services doing it or weā€™d see all the relevant logs about it doing such.

I was streaming journal logs to hopefully catch some indication of what perhaps was going on, but there was nothing - no warning, just an immediate restart:

Iā€™ve already granted support access for this particular device (60be607480fbea1cbacedc1809f80972) on our app (basestation-production).

Thanks in advance once again.

Good afternoon!

As a minor update to the most recent issue: Some of the restarting observed in the devices was, in fact, our own. We have pushed an update to fix these out of turn restarts on our side but are still left with some devices restarting in the same fashion as the above - no logs or warning of any controlled restart happening.

One of the main things causing restart issues is a CPU temp over 80. That seems to successfully kill stuff off and get a ā€˜Supervisor startingā€™ log spam going.
Thatā€™s fair enough.

Others, such as 5368440a25b40715bdaa6b9668083209 on our basestation-production app arenā€™t so happy (support access granted already fyi):
image

The latter restarts are us - our sensor seems to be offline, so itā€™s doing various remedial attempts including full restarts of the fin.
However, before 12 everything is registering as fine. For example: snippet of our logs for that 10:58 restart

And FYI, example logs for when we specifically deal with restarts/shutdowns ourselves:


Lots of lovely log spam telling us exactly whatā€™s doing what and when and why and how. We donā€™t see these logs with these hourly restarts which leads us to wonder what else is going on under there somewhere.

How can we nicely see what may be doing this?

As an aside - with a new batch of devices on the go, flashed with OS 2.38, they are not seeing the same issue.
A similar look at the restarts of one of the new 2.38-flashed devices:
image

One restart. Lovely. Same software on our side across the fleet; all 13 containers running.

@shaunmulligan

Good morning!

Apologies for the long hiatus on this topic. However, itā€™s back on the priority list now that we have some of our own stabilisations out of the way.

I have flashed device c3910bbbdd9d6a79433f9ad6d453e632 on basestation-develop with the latest 2.58.3+rev1/11.14.0 OS/Supervisor image and got my share on with support for it.

Word of warning first though if anyone is going to be poking around the device today - Iā€™m going to be running some temp tests with it (just got my hands on one of your heat sinks), so itā€™ll be running hot and Iā€™ll obviously need to take it offline to fit said heat sink once Iā€™m ready for that.

So, latest OS is behaving similarly, but not the same to the 2.51 version.

  • I can see in the logs we get that same large churn of ethernet interfaces that all fail to come up properly and we just start incrementing the DBus connections count per one of the set of logs I vaguely recall sharing like a billion years ago now.
  • Thereā€™s no outward sign of the DNS failing as with the 2.51 image (where our containers were all spamming EAI_AGAIN error messages). However, some kind of restarting is going on pretty regularly, including supervisor restart spam:

image

Snippet of logs showing one of our containers boots:
image

As an aside, the latest test SW is running on this app that features a wrap up of several of our containers smushed into one, so weā€™re down from 13 running to 10 in the vague hopes that it might relieve some stress on the device.

Iā€™ll once again point out, as I like to do, that we have no issues with the 2.38 OS set. We have started to get our manufacturing process in place and are getting devices with the 2.38 image out there (hampered, of course, by lockdown) and those few we have managed to install are showing as much stability as the original 2.38ā€™s we had running for the previous year.

So, now that weā€™re back on the case, with the shiny new 2.58 image, where do you want to start up on the troubleshooting for all this now?

Thanks.