balenaOS 2.46.1+rev1 crashed with pthread_create failed: Resource temporarily unavailable

Hi all,

As the title suggests, last week one of my devices (Raspberry Pi 3B+) running BalenaOS (2.46.1+rev1) failed.
This has left me with a stacktrace of about 1000 lines (balenad_crash.log (149.6 KB) ).

Of course, I would like to prevent this from happening in the future, so I was wondering if someone can point me in the right direction.
In my (3 years?) experience with Balena, this has been the first time I’ve ever seen such a crash.

My application uses a relatively large number of threads, so judging from the error message, I’m assuming I’m hitting a limit there.
Is there something of a practical limit to the number of threads used on Raspberry Pi 3B+ with Balena?
Are there any other things I can keep an eye on to prevent this from happening?

Updating the HostOS is an option, but not ideal due to some manual changes I have made to it.
I also don’t have direct access to the physical device as it has been deployed abroad.

Thanks in advance,
TJvV

Hi there, in order to be able to help you I will need the following informations:

  1. what OS version are you running, and what are the modifications you made to it?

  2. is this device on balenaCloud? if so, could you share the device uuid and grant support access to it?

  3. regarding multi-threading, afaik rpi3b+ has 4 cores, 1 thread each. How are you leveraging multi-threading? there could be edge case scenarios where if you deliberately lock a thread, you could end up with a bad collision that could justify the balena-engine crash you reported. This is why I’m interested in understanding how you are implementing multi-threading :slight_smile:

Hello @curcuz,

Thank you for reply.

  1. I am running BalenaOS v2.46.1+rev1 with a custom script in /etc/NetworkManager/dispatcher.d; this script is using for enabling Roaming on certain Huawei 4G modems when they are plugged in.
    I have also modified /etc/systemd/journald.conf.d/journald-balena-os.conf the journald logging, so that it will use more disk space.

  2. The device is indeed connected to balenaCloud; UUID is 11ac67e0f2d51f69e368a72bdebc872a, support access has been granted for a week.

  3. In my container I am running a shell script in the background that will check periodically for issues with my 4G modems.
    The main application I am running, is a Java (1.6u65, ancient I know) application, with a mix of threads spawning from opening serial ports and threads spawned from varying (Scheduled)ExecutorServices.
    Most threads are started during initialization of the application, with others being spawned to handle messages from/to the different peripheral ports, timed events and handling socket connections.
    Synchronization is done using block-level locks at the finest level, with different locks for different parts.
    It’s a bit tough going into too much detail on the forum as it’s a commercial product.

thanks, this is already good info to help investigating. I’m looking into alternative ways to achieve the configuration you currently perform as a balenaOS modification to see if you could rely on stock balenaOS and try newer releases

1 Like

Hi there – one additional bit of detail: you can check the maximum number of threads like so:

cat /proc/sys/kernel/threads-max

and you can count the number of open threads by running this command:

cat /proc/*/status | awk '/Threads/ {total += $NF} END {print "Total threads: ", total}'

Note that both these commands depend on having access to the host OS /proc filesystem. If you wanted to do this from a container, you could use the io.balena.features.procfs label. This would allow you to continuously monitor the number of threads, and see if it’s growing over time or hitting a limit. Given the description of your application and the error you reported in the crash log, this seems like a good avenue for investigation.

That brings up another question: have you had this happen at all since your initial post?

In regards to your custom OS, you may be able to duplicate the functionality of your /etc/NetworkManager/dispatcher.d script with a UDEV rule for the modem; one of our engineers has written up a good overview of UDEV and balenaOS in this article.

Unfortunately, I don’t believe we expose the ability to configure the maximum disk space used by journald logging. You may want to consider using an external service for sending application and host logs; we have had good luck with Datadog in the past. I’ve also added a feature request to our internal tracking to allow users to customize this setting.

We’ve asked our OS engineers to take a look at the stack trace to see if they have any insights to share. We’ll be sure to pass on anything they have to offer. Let us know how you get on with your investigation, or if you have any other questions.

All the best,
Hugh

Thank you for your suggestions, @saintaardvark.

As far as I can tell, this has only happened that one time.
Right now the application appears to be running nominal, and the commands you gave suggest 243/15369 threads (total/max).

I will see if I can move the modem configuration script to be triggered by a udev rule.
I believe however that I had looked into this before and had issues getting timing right, because the interface needs to be UP before the configuration (through web API) can actually be performed.

The extra disk space for logging has only been configured for this specific device because it was showing different issues (application bugs, which have mostly been resolved since) and the journal would only show the last 2 hours or so.
This is one of a few devices using a CAN bus, where messages (and by extension logging) pour in pretty much every millisecond.

Hi @TJvV – thanks for the additional information. That sounds like there was a lot of headroom at the moment you checked. I would definitely consider monitoring this, though.

I’ve created a bug here; feel free to subscribe to that issue if you’d like updates. We’re looking into the backtrace, and will let you know if we have any questions. Please let us know if this comes up again, or if you have any other questions.

All the best,
Hugh

Hi,

Just this morning it occurred again.
I have attached the stack trace together with the last few system messages.
balenad_crash_2020_11_02.log (150.9 KB)

At the time of crashing my application did not report any warnings/errors.

I sadly do not have any monitoring of the total number of threads running at the time.

Hi, thanks for the stack trace. This looks to be the same problem we have been debugging before. In summary, there are two things that can cause it:

  • Reaching the maximum number of threads. For a balena application container this is actually around 4915.
  • Insufficient memory (less likely)
    As it seems you are not able to easily reproduce this I would suggest you setup a test system where the problem can be reproduced and you remove the system wide thread limit by configuring /etc/systemd/system.conf with:
DefaultTasksMax=infinity
1 Like