Tcmalloc large alloc and subsequent out of memory crash

We’ve been having some issues running out of memory on a multi-container application running on a RPi3B+. The container that crashes is always the same. It’s running an Electron application and I haven’t been able to detect any memory leaks. Recently we upgraded to an RPi4 to see if it was truly a memory limitation and over the weekend we experienced same crash but this time with an additional log message with a large allocation warning:

17.11.19 19:13:18 (-0600) app [96:1118/011318.808760:INFO:CONSOLE(16)] “memory usage @ 15016104”, source: file:///usr/src/app/src/route.js (16)
*17.11.19 19:19:04 (-0600) app tcmalloc: large alloc 1148030976 bytes == (nil) @ *
17.11.19 19:19:04 (-0600) app [129:1118/011904.621043:FATAL:memory_linux.cc(42)] Out of memory.

As you can see the memory usage dump that I’m doing from the renderer doesn’t even touch the limits of the RPi4. I’ve done quite a bit of Googling but haven’t seen anything super concrete. I am doing a test without network connectivity now so we’ll see if that has any interesting results. On the RPi3 this has been happening very consistently on a 24 hour time frame. The RPi4 took a couple of days but it’s a new installation so take that with a grain of salt.

Has anyone run into this before by chance?

Hi there, some questions first (for both devices):

  • What version of balenaOS + supervisor is the device running?
  • Is it 64bit or 32bit?
  • Is there any error happening on the RPI3 similar to the one you linked?

@thundron

Pi3: balenaOS 2.38.0+rev1, supervisor 9.15.7
Pi4: balenaOS 2.44.0+rev3 (64 bit), supervisor 10.3.7

The Pi3 displays the same error (fatal out of memory) without the tcmalloc warning beforehand.

Hi @AdamLee, which RPi4 version are you running? How much memory does it have?

1148030976 bytes is around 1.14GB. Since RPi3 has 1GB RAM, your app would be going beyond the limit.

thundron made an internal note that he suspects you might be hitting node’s memory limit. Could you test it by increasing the default memory allocation threshold of node and your app?

Before the application crash happens, could you please take a look at the Device Diagnostics view and run diagnostics to see if anything pops up related to memory checks?

Finally, if you could enable persistent logging on the device, we could take a look at the logs across reboots to see if they tell anything extra.

@gelbal

We’re using the 4GB version, model B. I really wanted to rule out device limitations.

I had been running diagnostics and didn’t see any issues. Our Pi4 has been running for close to 23 hours now and our most resource intensive container (again, electron) is only using 364mb. One thing that’s pretty interesting though is it’s currently reading 207.49% CPU? That seems pretty weird. I don’t see anything else memory related in the diagnostic logs but there’s a lot in there to eyeball through. I can send them over if you wish.

I’ve increased the app memory limit to 3GB but it looks like it was already 2GB by default. We’ll see how it goes. I’m also going to start logging v8.getHeapStatistics to see if there’s anything else I can glean from some more granular information.

I’ve also enabled persistent logging.

Hi there @AdamLee,

Please do keep us up to date on your latest test results having bumped those limits! Obviously I realize you have only seen issues on the RPi4 after a few days, so these results may take some time.

As an additional idea, you may be able to load some proper metrics collectors like netdata/datadog/telegraf/node_exporter to better understand what leads up to these memory alloc issues. We have published a few blog posts on how to set these up (and other users have contributed some great forums posts!), see:



Increasing the memory limit didn’t help. Here’s the log for the latest crash. This did take a few days which makes me think there’s a got to be a memory leak somewhere but the usage on the diagnostics is still way lower than the available RAM.

22.11.19 23:41:46 (-0600) app “total_heap_size -> 35495936”
22.11.19 23:41:46 (-0600) app “total_heap_size_executable -> 3194880”
22.11.19 23:41:46 (-0600) app “total_physical_size -> 34353964”
22.11.19 23:41:46 (-0600) app “total_available_size -> 3136987344”
22.11.19 23:41:46 (-0600) app “used_heap_size -> 16653528”
22.11.19 23:41:46 (-0600) app “heap_size_limit -> 3162505216”
22.11.19 23:41:46 (-0600) app “malloced_memory -> 57364”
22.11.19 23:41:46 (-0600) app “peak_malloced_memory -> 4859504”
22.11.19 23:41:46 (-0600) app “does_zap_garbage -> 0”
23.11.19 00:02:48 (-0600) app tcmalloc: large alloc 1148030976 bytes == (nil) @
23.11.19 00:02:48 (-0600) app [130:1123/060248.819314:FATAL:memory_linux.cc(42)] Out of memory.

@AdamLee how are you setting the memory limit for your application? It looks like memory_linux.cc is part of the Chromium component so I wonder if the increased limit is perhaps not making it that far.

I’m setting it in electron with js flags:

app.commandLine.appendSwitch('js-flags', '--expose_gc --max-old-space-size=3000');

It seems like it is making it’s making some difference since we’re able to operate a lot longer before it crashes but that doesn’t necessarily prove correlation since this is application is in the development phase and is changing somewhat frequently.

I’ve added monitoring with Prometheus so we’ll see how that goes. Customizing Grafana is way less intuitive than I thought it would be so I’m just operating off the of the template from the link you posted. I’ll monitor for the rest of the week and see if I notice anything.

Thanks Adam, keep us updated on your findings!

So I can see so far that my memory consumption is very slowly climbing. I’m starting to wonder if perhaps the error message I was getting was a red herring and in fact there is a leak somewhere else. What I’m looking into now is breaking out monitoring by device and container. Is that possible? It looks like perhaps I’ll have to switch to a docker exporter.

Hi again,

In order to help you better profile your memory usage we’ll need to understand what you have set up in terms of metrics collection. Specifically, I am not sure what you mean by “breaking out monitoring by device and container”. Can you describe what you have set up to collect & visualize the data?

Right now I’m doing almost an exact replica of what you originally led me to here:

What I’m seeing from the node exporter looks like the statistics for each host, which is the device level. What I’d like to see is statistics for each container since these devices are running as multi-container environments. This way I potentially could tell which container is showing signs of a memory leak versus just knowing one is possible. Make sense? I suspect that one of two containers are my issue, one is just running a node back-end and the other is running electon.

What I’m seeing from the node exporter looks like the statistics for each host, which is the device level. What I’d like to see is statistics for each container […]

Is node exporter running in each container you’d like to monitor? From talking to colleagues, I gather that node exporter running in a privileged container would export device/host stats, whereas a non-privileged container would export stats for the container only. Alternatively, I understand that the cAdvisor project is able to collect container stats even for privileged containers.

Does this reply go in the right direction? Let us know where you’d like further guidance, and we’ll go from there. :slight_smile:

Hi Adam,

I was just skimming through this thread and wondering why you’re not able to profile the memory usage of the application outside the device (perhaps in your development environment)? Is there something happening on the device that prohibits you from doing this?

James.

Yes it does, thank you. I’ll take a look at cAdvisor.

Currently I do have node exporter running in its own privileged container.

Originally this problem showed up in our electron container and monitoring it locally didn’t display any issues with memory usage. After taking a look at some of the output I’m seeing from node exporter across the device I’m starting to expect that the error we were getting from the electron container may have been a red herring. I’m doing more testing locally now on our other containers to see if I can find the issue but monitoring our live development devices provides me with some more data that I can simply capture passively.

Hi,
Just a quick note regarding cAdvisor. Its docker image seems to be published for amd64 architecture only (I tried both what is published to dockerhub and to gcr.io). So it will not run on RPi unless we build it on our own for arm architecture.

Another caveat with cAdvisor on balena is that we don’t allow arbitrary bind mounts in docker-compose files. It’s done in order to minimize the risk of what containers could do with the host data. Hence, cAdvisor would expose the machine networks, but it would not have access to cgroup files to expose per-container metrics. That said, for a one-shot monitoring task, you could still ssh to the device and run cAdvisor with balena run command (exactly to how the docker run command is described in their readme). As long as you have the right image working on your architecture :slight_smile:

That said, for basic per-container stats you could use balena stats command on the machine. It works as described in Docker docs:


Maybe it will be helpful.