DNS failure not caught by supervisor

Hi

Looks like both the devices are offline now - can you please restart them?

Appologies, both are online.

Think it’s due to the multiple ethernet plug/unplug issue that they weren’t showing up.

… I’ll replace that ethernet cable later so that won’t happen…

Interestingly, I’ve just checked and the internal DNS resolution on the container bridge is working on both devices. They can all resolve each other by names.

It is surprising that they’ve been online for 15h without dying. A new record for both devices in over a month.

Last error on cec711 was apparently yesterday morning:

{“message”:“ScheduleExecutor.executeJob(): Error running task ‘orchestration-time-update’- getaddrinfo EAI_AGAIN orchestration orchestration:8080”,“level”:“error”,“timestamp”:“2020-09-17T04:26:30.3030Z”}

Hmm, also interesting; they are not registering an ethernet connection…
Scratch that, that’s my issue; something wrong with my equipment that I’ll need to sort out later. Sigh.
[Edit: there we go, re-paired my powerline adapters and we magically have ethernet connectivity back]

Well, unless you guys have changed something, it should just be a matter of time until they fall over so you can begin gathering data again.
I’ll keep an eye out and yell.

There we go, cec711 is deaded. Have at it.

And there goes a70bc.

Interestingly, both up for quite a while as per my first incredulous update this morning, and then when i start taking a look around and hopping into shells to start looking at logs… they suddenly start falling over on me.

Hey Aidan, that is very interesting. Perhaps its some resource getting choked, 13 containers is quite a bit for a pi3 with 1Gb ram to handle (depending of course what they are doing) but it could be that some how poking around pushes this just over and results in the engine panicking. It looks like the engine panicked because of:

Sep 18 11:20:46 cec711c balenad[20252]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Sep 18 11:20:46 cec711c balenad[20252]: SIGABRT: abort

Which from some https://github.com/golang/go/issues/37006 seems to indicate that this is usually from trying to create too many thread or allocating too much RAM (leaking maybe?) . I know there was some version of the OS that had some issues with sshd processes not being cleaned up correctly and would slowly accumulate resource usage each time you ssh’d into your device. Let me try dig up what version that was fixed in, perhaps thats playing in here.

Okay looking at the OS change logs it seems the sshd bug I mentioned was fixed in 2.47.1 and above so thats not the culprit, it must be something else leaking memory or creating tons of threads. We would probably need watch a functioning device on htop or better yet something like netdata.

One of my colleagues mentioned that it could also be due to they way container logging is implemented in the container engine using the journald loggingDriver, this method I believe creates a new logging process for each container and ends up allocating more memory that wanted. We are in the process of changing that implementation, but in the meantime you can verify if its that issue by setting some (or all of) your services to have logging driver “none” which you can do by adding this snippet to each of the services:

logging:
  driver: "none"

Note that this is obviously not idea as these containers will no longer publish logs to the dashboard. Let me know if you want me to further explain anything or help with how to add this to your services.

Good morning!

Apologies for the delay, but I’ve now pushed that out - both with none and "none" because they haven’t exhibited the lack of logging behaviour that you’ve described…

E.g.
image

This, presumably, is what you had in mind? If so, cool, then the two devices are now running with this, if not then you’ll need to elaborate.

Note: devices being requisitioned for some other testing purposes for a couple of days - should be able to restore them in the very near future for DNS issue testing purposes once again.