balenad[PID]: runtime/cgo: pthread_create failed: Resource temporarily unavailable

Hi Balena team,

I’m running into the same symptom presented in this thread.

While troubleshooting, I noticed balenad logging “shim started” and “shim reaped” every three minutes. Is this normal for containers that are not restarting? I’m hoping the answer to this question might connect with the symptoms stated initially. Thank you!

balena_log.log (7.6 MB)
balena_log2.log (9.5 MB) balena_log3.log (9.5 MB) balena_log4.log (9.5 MB) balena_log5.log (9.5 MB)

Thanks for your report and logs. Is it correct to say you’re also running OS 2.46.1+rev1 as in that thread, and you’re seeing pthread_create failed: Resource temporarily unavailable ?

From reading the logs my initial theory is that the “shim started” / “shim reaped” messages are due to a crashloop of the container engine itself; the shim is just a part of the container runtime which is symptomatic of container starts/stops (so, yes, completely normal for containers that are not restarting, or are restarting but then the container engine exits). As for debugging the root cause, logs show that balenad is exiting with pthread_create failed: Resource temporarily unavailable. Do either of alexgg’s bullet points in this thread comment apply to your case?

We’ve been able to reproduce the error by running for i in seq 1 200; do balena run -d busybox sleep 1d; done on the host. We’re not sure what exactly is happening in the application that mirrors the situation. We do have a container A that exits when it cannot contact another service B. The balena engine schedules it again (as we expect) and it continues this cycle until service B is available. Is it possible that there is a resource leak associated with container A constantly restarting?

Hi Daniel,

You mention running multiple containers, and wonder if you might be running out of disk space. A df -Th from the HostOS terminal could reveal current usage and storage capacity. This is admittedly a bit of a longshot, but worth checking.

John

Thanks for the suggestion John. It doesn’t appear to be the case.

image


I think this currently opened issue best fits the situation.

Hi Daniel,

In that case you linked to, we believed the number of containers the user was running caused balena-engine to consume more memory than usual due to the way logging to the dashboard works. This is because the Supervisor uses balena logs, which spawns OS threads that consume memory. At the time, our recommendation was to upgrade the HostOS to 2.53.9+rev1.

Can you tell us if you have many containers running? Are you seeing any memory errors?

John

We’re running 13 containers and the application memory doesn’t show any study leaks. However We’ve been running balenaOS 2.51.1+rev1. Updating to balenaOS 2.58.3+rev1 seems to successfully restart the device without getting into a corrupt state when resource limits are reached.

Hi Daniel, something that might be affecting is the adoption of a compressed swap in BalenaOS v2.52. This enforces the theory that the system was running low on memory and the engine was crashing due to memory constraints. So the enabling of zram helped, but it’s likely that the problem will re-emerge if either the number of containers or the log data increases in the future.
Please let us know if your immediate problem is resolved so we can close this issue.

Hi Alex,

With 2.58.3, I am unable to reproduce the issue reliably so I’d say this is resolved. We’re working on reducing our memory footprint and making sure containers don’t restart without end. Hopefully all of this is enough to mitigate the issue. Thanks for the heads up!