Hi
Looks like both the devices are offline now - can you please restart them?
Hi
Looks like both the devices are offline now - can you please restart them?
Appologies, both are online.
Think itās due to the multiple ethernet plug/unplug issue that they werenāt showing up.
ā¦ Iāll replace that ethernet cable later so that wonāt happenā¦
Interestingly, Iāve just checked and the internal DNS resolution on the container bridge is working on both devices. They can all resolve each other by names.
It is surprising that theyāve been online for 15h without dying. A new record for both devices in over a month.
Last error on cec711
was apparently yesterday morning:
{āmessageā:āScheduleExecutor.executeJob(): Error running task āorchestration-time-updateā- getaddrinfo EAI_AGAIN orchestration orchestration:8080ā,ālevelā:āerrorā,ātimestampā:ā2020-09-17T04:26:30.3030Zā}
Hmm, also interesting; they are not registering an ethernet connectionā¦
Scratch that, thatās my issue; something wrong with my equipment that Iāll need to sort out later. Sigh.
[Edit: there we go, re-paired my powerline adapters and we magically have ethernet connectivity back]
Well, unless you guys have changed something, it should just be a matter of time until they fall over so you can begin gathering data again.
Iāll keep an eye out and yell.
There we go, cec711
is deaded. Have at it.
And there goes a70bc
.
Interestingly, both up for quite a while as per my first incredulous update this morning, and then when i start taking a look around and hopping into shells to start looking at logsā¦ they suddenly start falling over on me.
Hey Aidan, that is very interesting. Perhaps its some resource getting choked, 13 containers is quite a bit for a pi3 with 1Gb ram to handle (depending of course what they are doing) but it could be that some how poking around pushes this just over and results in the engine panicking. It looks like the engine panicked because of:
Sep 18 11:20:46 cec711c balenad[20252]: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Sep 18 11:20:46 cec711c balenad[20252]: SIGABRT: abort
Which from some https://github.com/golang/go/issues/37006 seems to indicate that this is usually from trying to create too many thread or allocating too much RAM (leaking maybe?) . I know there was some version of the OS that had some issues with sshd
processes not being cleaned up correctly and would slowly accumulate resource usage each time you sshād into your device. Let me try dig up what version that was fixed in, perhaps thats playing in here.
Okay looking at the OS change logs it seems the sshd
bug I mentioned was fixed in 2.47.1 and above so thats not the culprit, it must be something else leaking memory or creating tons of threads. We would probably need watch a functioning device on htop or better yet something like netdata.
One of my colleagues mentioned that it could also be due to they way container logging is implemented in the container engine using the journald loggingDriver, this method I believe creates a new logging process for each container and ends up allocating more memory that wanted. We are in the process of changing that implementation, but in the meantime you can verify if its that issue by setting some (or all of) your services to have logging driver ānoneā which you can do by adding this snippet to each of the services:
logging:
driver: "none"
Note that this is obviously not idea as these containers will no longer publish logs to the dashboard. Let me know if you want me to further explain anything or help with how to add this to your services.
Good morning!
Apologies for the delay, but Iāve now pushed that out - both with none
and "none"
because they havenāt exhibited the lack of logging behaviour that youāve describedā¦
E.g.
This, presumably, is what you had in mind? If so, cool, then the two devices are now running with this, if not then youāll need to elaborate.
Note: devices being requisitioned for some other testing purposes for a couple of days - should be able to restore them in the very near future for DNS issue testing purposes once again.
Apologies, we havenāt gotten round to restoring the devices for this specific testing just yet - weāre reconfiguring some of the containers and apps to consolidate and reduce the number thereof. Thatās still ongoing in the final stages before we start testing in earnest.
However, I would like to raise an issue I am hoping that you can help us with that is currently quite vexing. We have devices that are justā¦ restarting, without warning.
It happens at around a couple of minutes past every hour; and itās not our services doing it or weād see all the relevant logs about it doing such.
I was streaming journal logs to hopefully catch some indication of what perhaps was going on, but there was nothing - no warning, just an immediate restart:
Iāve already granted support access for this particular device (60be607480fbea1cbacedc1809f80972) on our app (basestation-production).
Thanks in advance once again.
Good afternoon!
As a minor update to the most recent issue: Some of the restarting observed in the devices was, in fact, our own. We have pushed an update to fix these out of turn restarts on our side but are still left with some devices restarting in the same fashion as the above - no logs or warning of any controlled restart happening.
One of the main things causing restart issues is a CPU temp over 80. That seems to successfully kill stuff off and get a āSupervisor startingā log spam going.
Thatās fair enough.
Others, such as 5368440a25b40715bdaa6b9668083209 on our basestation-production app arenāt so happy (support access granted already fyi):
The latter restarts are us - our sensor seems to be offline, so itās doing various remedial attempts including full restarts of the fin.
However, before 12 everything is registering as fine. For example: snippet of our logs for that 10:58 restart
And FYI, example logs for when we specifically deal with restarts/shutdowns ourselves:
How can we nicely see what may be doing this?
As an aside - with a new batch of devices on the go, flashed with OS 2.38, they are not seeing the same issue.
A similar look at the restarts of one of the new 2.38-flashed devices:
One restart. Lovely. Same software on our side across the fleet; all 13 containers running.
Good morning!
Apologies for the long hiatus on this topic. However, itās back on the priority list now that we have some of our own stabilisations out of the way.
I have flashed device c3910bbbdd9d6a79433f9ad6d453e632 on basestation-develop with the latest 2.58.3+rev1/11.14.0 OS/Supervisor image and got my share on with support for it.
Word of warning first though if anyone is going to be poking around the device today - Iām going to be running some temp tests with it (just got my hands on one of your heat sinks), so itāll be running hot and Iāll obviously need to take it offline to fit said heat sink once Iām ready for that.
So, latest OS is behaving similarly, but not the same to the 2.51 version.
Snippet of logs showing one of our containers boots:
As an aside, the latest test SW is running on this app that features a wrap up of several of our containers smushed into one, so weāre down from 13 running to 10 in the vague hopes that it might relieve some stress on the device.
Iāll once again point out, as I like to do, that we have no issues with the 2.38 OS set. We have started to get our manufacturing process in place and are getting devices with the 2.38 image out there (hampered, of course, by lockdown) and those few we have managed to install are showing as much stability as the original 2.38ās we had running for the previous year.
So, now that weāre back on the case, with the shiny new 2.58 image, where do you want to start up on the troubleshooting for all this now?
Thanks.