Balena Supervisor Not Pruning Dangling Images

alexanderkjones · September 22, 2021, 2:27pm

Hi There,

We’re experiencing a growing local image cache when building in local mode. We are currently developing on Balena Fin v 1.1, using BalenaOS 2.80.3 and Supervisor 12.7.0

We read in this post that the supervisor should be pruning images but our cache keeps growing and we are forced to either use balena system prune or eventually purge all data from the device using the BalenaCloud dashboard.

Looking for advice on how to solve this, thanks!

maggie0002 · September 23, 2021, 10:31pm

Not sure if this is an error, I’m sure the Balena guys are better situated to know. Although the behaviour you are talking about makes sense to me. In local mode I don’t think I would want auto-purging of my images or build cache, I would want them there for speedier rebuilds on my next push (local mode being for development). And would then manually purge them when I wanted to free up space.

If it is occurring outside of local mode, on a production image for example, then I would be concerned.

alexanderkjones · September 26, 2021, 6:52pm

Thanks for that thought @maggie0002 , I haven’t tested this behavior on a production device which as you said may be different.

nitish · September 27, 2021, 12:29pm

Thanks @alexanderkjones for reaching out. I am checking with supervisor devs if they have something special for “local mode” devices.

In the meantime, is it possible to check moving device to production mode to see if the automatic pruning does kick in.

Regards,
N

cywang117 · September 28, 2021, 1:46am

Hi @alexanderkjones,

As @maggie0002 says, it makes sense that images would be retained in local mode, as this is a development environment where speed is at a premium over storage capacity. From your initial post, it sounds like you are switching between local mode and managed mode frequently on the Fin that you’re developing on. In addition, you’re looking to purge images that are used only in local mode when you switch out of local mode. Finally, you’re running a development OS variant. Are all these assumptions correct?

I believe the post you linked to is in reference to this issue: On leaving local mode the supervisor should clean up dangling images · Issue #993 · balena-os/balena-supervisor · GitHub. Could you paste an example output from balena images and also include what sorts of base images you’re using during local mode? This would help me determine what’s actually going on with the device. Also, the Supervisor does not remove any volumes, user-created (“unmanaged”) or otherwise, so if the cache filling up is due to volume creation, then this would require a manual prune or a Supervisor API call. If the cache fill-up is indeed due to volumes, this design is intended to preserve user data. Thanks and let us know!

Regards,
Christina

alexanderkjones · October 6, 2021, 2:43pm

@cywang117 yes on all front on your first set of questions, we are using the dev OS variant and running back and forth on local mode.

The specific issue we’ve been seeing is the fin bogs down during development and memory continues to increase as we live push over several days of testing. We notice that the “build from cache images total” continued to increase and our memory usage from viewing the balena dashboard continues to grow as we keep using live push.

Ultimately though, this results in us losing connection to the device regularly and having to purge all data from the dashboard before being able to continue. Memory usage shortly after a purge results in only 50% memory usage after the first build, but as we continue to live push it increasing to 95% usage over a matter of a couple days of development.

Unfortunately, we’ve lost or misplaced the ability to see the number of cached images when running balena push -m --debug It used to tell us the number of images in the cache before building but for some reason the output is no longer displaying anything different than a strait balena push.

Here is fresh pull using balena images -a. Also as I’m describing this I think our root issue is around memory and that’s likely not affected by cached images correct? Thanks for all your help.

root@2a38113:~# balena images -a
REPOSITORY                             TAG                  IMAGE ID            CREATED             SIZE
local_image_ble-gatt-server            latest               0dce54fc27b8        18 hours ago        324MB
<none>                                 <none>               539ad52570aa        18 hours ago        324MB
<none>                                 <none>               64e315cbbc09        18 hours ago        324MB
<none>                                 <none>               dea73fa38431        18 hours ago        324MB
<none>                                 <none>               6ec2ee931441        18 hours ago        734MB
<none>                                 <none>               922642ac9006        18 hours ago        734MB
<none>                                 <none>               432afd1aa37e        18 hours ago        734MB
<none>                                 <none>               2931f5a436bb        18 hours ago        734MB
<none>                                 <none>               9d35fa4b4514        18 hours ago        734MB
<none>                                 <none>               93232fa90cc7        18 hours ago        734MB
<none>                                 <none>               10c03e732ab0        18 hours ago        324MB
<none>                                 <none>               6707719c4305        18 hours ago        324MB
<none>                                 <none>               6581eec858a0        18 hours ago        324MB
<none>                                 <none>               3b5ff8057013        18 hours ago        324MB
<none>                                 <none>               554bfdcd9b6c        19 hours ago        324MB
<none>                                 <none>               75645582f199        19 hours ago        324MB
<none>                                 <none>               5eca70d75a9c        19 hours ago        324MB
<none>                                 <none>               63c07f40b714        19 hours ago        324MB
<none>                                 <none>               9372a8e9607a        19 hours ago        324MB
<none>                                 <none>               02cf0298162f        19 hours ago        324MB
<none>                                 <none>               27595cab6c3b        19 hours ago        324MB
<none>                                 <none>               2767a94650dc        19 hours ago        324MB
<none>                                 <none>               02e18bfe9320        20 hours ago        324MB
<none>                                 <none>               cb10dce3a772        20 hours ago        324MB
<none>                                 <none>               1d54538f6dee        20 hours ago        324MB
<none>                                 <none>               55bfca4c7b18        20 hours ago        324MB
<none>                                 <none>               eb3b184b2e20        20 hours ago        267MB
<none>                                 <none>               e9d984cc7134        20 hours ago        263MB
<none>                                 <none>               be490c93f20b        20 hours ago        268MB
<none>                                 <none>               710fc16fe2ea        20 hours ago        261MB
<none>                                 <none>               2aa76855309b        20 hours ago        263MB
<none>                                 <none>               ae7edbedc73a        20 hours ago        267MB
<none>                                 <none>               525f692079e1        20 hours ago        268MB
<none>                                 <none>               2e22eaa6d9f9        20 hours ago        267MB
<none>                                 <none>               5a7467c7a364        20 hours ago        261MB
<none>                                 <none>               dfb6db36eaec        20 hours ago        263MB
<none>                                 <none>               ed2f3bdf0a1b        20 hours ago        261MB
<none>                                 <none>               0daab0256416        20 hours ago        268MB
<none>                                 <none>               95f0d9331932        20 hours ago        796MB
<none>                                 <none>               7c5c57302b11        20 hours ago        267MB
<none>                                 <none>               80d902902e86        20 hours ago        263MB
<none>                                 <none>               2c430d40776c        20 hours ago        268MB
<none>                                 <none>               185ef88d5a05        20 hours ago        261MB
<none>                                 <none>               28ae82bf8623        20 hours ago        796MB
<none>                                 <none>               0c80631029bd        20 hours ago        796MB
<none>                                 <none>               0d747bea9a36        20 hours ago        796MB
local_image_hal                        latest               b89f9142fc23        20 hours ago        796MB
<none>                                 <none>               3301d4458e3c        20 hours ago        796MB
<none>                                 <none>               329625c0f223        20 hours ago        796MB
<none>                                 <none>               044c1b4e55de        20 hours ago        796MB
<none>                                 <none>               b83563c021cf        20 hours ago        796MB
<none>                                 <none>               f3f5a372b672        20 hours ago        796MB
<none>                                 <none>               9e49c7d8b620        20 hours ago        796MB
<none>                                 <none>               ab41a9baaf25        20 hours ago        796MB
<none>                                 <none>               097302411b4b        21 hours ago        796MB
<none>                                 <none>               f138cbbbf1d9        21 hours ago        796MB
<none>                                 <none>               b8ad2b6df627        21 hours ago        796MB
<none>                                 <none>               3f72dd61e03b        21 hours ago        796MB
<none>                                 <none>               98a4744151df        21 hours ago        796MB
<none>                                 <none>               2d0821140ae4        21 hours ago        796MB
<none>                                 <none>               beea7a8a1716        21 hours ago        796MB
<none>                                 <none>               ca199f160bd3        21 hours ago        796MB
<none>                                 <none>               97be6c3a7bfc        21 hours ago        324MB
<none>                                 <none>               0b6f2e7c960a        21 hours ago        324MB
<none>                                 <none>               f323ee81718f        21 hours ago        324MB
<none>                                 <none>               e09b774226fc        21 hours ago        324MB
<none>                                 <none>               9c7ee3cfb7fb        21 hours ago        324MB
<none>                                 <none>               90ed8453d0d2        21 hours ago        324MB
<none>                                 <none>               bc551c2baeea        21 hours ago        250MB
<none>                                 <none>               915aed6e7529        21 hours ago        796MB
<none>                                 <none>               e387b3eb38dd        21 hours ago        796MB
<none>                                 <none>               68ed10db0300        21 hours ago        796MB
<none>                                 <none>               bd0ca4c78290        21 hours ago        307MB
<none>                                 <none>               4b71e891ca2b        21 hours ago        796MB
<none>                                 <none>               3f21824ba82a        11 days ago         796MB
<none>                                 <none>               321b5fb6d843        11 days ago         796MB
<none>                                 <none>               2940fa5b7fb9        11 days ago         796MB
<none>                                 <none>               a4a1fdd51fbe        11 days ago         796MB
<none>                                 <none>               64f5417f0953        11 days ago         796MB
<none>                                 <none>               72444ba37bb5        11 days ago         796MB
<none>                                 <none>               a11accbc9c94        11 days ago         796MB
<none>                                 <none>               d276bdee3e7a        11 days ago         796MB
local_image_map-service                latest               0886c9ea3e8a        12 days ago         268MB
<none>                                 <none>               39b82ef11ce6        12 days ago         268MB
local_image_core                       latest               6fe1208a5d58        12 days ago         261MB
<none>                                 <none>               6a866cd4a78d        12 days ago         261MB
<none>                                 <none>               ccb7ac13f7fd        12 days ago         268MB
<none>                                 <none>               1463bb3bfe91        12 days ago         268MB
<none>                                 <none>               d2c0f14bc50e        12 days ago         261MB
<none>                                 <none>               975c079129cf        12 days ago         261MB
local_image_scheduler                  latest               a157ae8caaf8        2 weeks ago         263MB
<none>                                 <none>               f0ee9da4a594        2 weeks ago         263MB
<none>                                 <none>               888c296e0cf8        2 weeks ago         263MB
local_image_cmd-shell                  latest               a955aa53a360        2 weeks ago         267MB
<none>                                 <none>               a371a017bfb8        2 weeks ago         263MB
<none>                                 <none>               febceaa2adb4        2 weeks ago         267MB
<none>                                 <none>               e22bceccf1f7        2 weeks ago         267MB
<none>                                 <none>               68d731dd341a        2 weeks ago         267MB
<none>                                 <none>               53b00658ad76        2 weeks ago         267MB
local_image_supervisor                 latest               413a3c6d3240        2 weeks ago         748MB
<none>                                 <none>               cba30dd17702        2 weeks ago         748MB
<none>                                 <none>               65c516bcdf3d        2 weeks ago         748MB
<none>                                 <none>               c15e9105c49f        2 weeks ago         748MB
<none>                                 <none>               ac18d2c15dd1        2 weeks ago         748MB
<none>                                 <none>               b5308b846c86        2 weeks ago         730MB
<none>                                 <none>               3cec7dd0041a        2 weeks ago         730MB
<none>                                 <none>               848569fff94f        2 weeks ago         730MB
<none>                                 <none>               344465121c64        2 weeks ago         730MB
<none>                                 <none>               1e66bb47d232        2 weeks ago         263MB
<none>                                 <none>               581a2b0daed3        2 weeks ago         261MB
<none>                                 <none>               246f32580f43        2 weeks ago         256MB
<none>                                 <none>               036c8532630d        2 weeks ago         256MB
<none>                                 <none>               e20613b55db0        2 weeks ago         256MB
<none>                                 <none>               cd4d0d1549a0        2 weeks ago         256MB
<none>                                 <none>               3fb036afa55e        2 weeks ago         256MB
<none>                                 <none>               9bee2d2ba94b        2 weeks ago         256MB
<none>                                 <none>               c565a9e5a946        2 weeks ago         684MB
<none>                                 <none>               b9b545aeb474        2 weeks ago         795MB
<none>                                 <none>               8f7378a5b2fe        2 weeks ago         307MB
local_image_controller                 latest               428e1c3f31cc        2 weeks ago         307MB
<none>                                 <none>               ff06bf608721        2 weeks ago         307MB
<none>                                 <none>               bcc523ba6179        2 weeks ago         307MB
<none>                                 <none>               7009d0989c2e        2 weeks ago         307MB
<none>                                 <none>               1cba93ebb2b5        2 weeks ago         307MB
<none>                                 <none>               8ace330ed48a        2 weeks ago         307MB
<none>                                 <none>               ef985c24637a        2 weeks ago         780MB
<none>                                 <none>               8de517ba0e9d        2 weeks ago         780MB
<none>                                 <none>               cd64d8082708        2 weeks ago         780MB
<none>                                 <none>               db410ee70b47        2 weeks ago         780MB
<none>                                 <none>               0fd3a340edc1        2 weeks ago         780MB
local_image_camera                     latest               5275c083b1d4        2 weeks ago         263MB
<none>                                 <none>               df02bb68f2b2        2 weeks ago         263MB
<none>                                 <none>               10ea1ee21552        2 weeks ago         263MB
<none>                                 <none>               07dc2bc201f8        2 weeks ago         263MB
<none>                                 <none>               8f5ecbac07cd        2 weeks ago         263MB
local_image_redis-websocket            latest               068338f4fde7        2 weeks ago         692MB
<none>                                 <none>               ffb99daad0fb        2 weeks ago         692MB
<none>                                 <none>               f878d5b46a44        2 weeks ago         692MB
<none>                                 <none>               dbdcc1ca3141        2 weeks ago         692MB
<none>                                 <none>               041aca63166f        2 weeks ago         692MB
local_image_redis                      latest               3f521293f8ae        2 weeks ago         76.9MB
<none>                                 <none>               054fcf09b967        2 weeks ago         76.9MB
<none>                                 <none>               9242bb41bbba        2 weeks ago         76.9MB
<none>                                 <none>               1c35cf415059        2 weeks ago         76.9MB
<none>                                 <none>               29ab01651c35        2 weeks ago         256MB
<none>                                 <none>               6d855ef70ab3        2 weeks ago         76.9MB
<none>                                 <none>               6f22972c9614        2 weeks ago         691MB
<none>                                 <none>               61a89ce99cb9        2 weeks ago         256MB
<none>                                 <none>               0d0b09c7b03b        2 weeks ago         256MB
<none>                                 <none>               59747c031fe8        2 weeks ago         691MB
<none>                                 <none>               907447e14b01        2 weeks ago         691MB
<none>                                 <none>               46110ec6e085        2 weeks ago         256MB
balenalib/fincm3-python                3.7-run              950ecca7c6a1        2 weeks ago         250MB
balenalib/fincm3-python                3.7-build            f32a6df1f337        2 months ago        684MB
balenalib/fincm3-python                3.8-run              205778f82f27        2 months ago        256MB
balenalib/fincm3-python                3.8-build            201fd4f0c698        2 months ago        691MB
arm32v7/redis                          latest               df0b976e6584        3 months ago        76.9MB
balenalib/fincm3-debian                buster-run           c363e2794ff1        3 months ago        160MB
local_image_bluetooth                  latest               188a20699f51        4 months ago        212MB
balena/armv7hf-supervisor              v12.7.0              ac4cb891dc5b        5 months ago        62.8MB
balenalib/fincm3-python                3.8-bullseye-build   d390559467c7        19 months ago       694MB
balena-healthcheck-image               latest               851163c78e4a        21 months ago       4.85kB
balena/arm-balena-multibuild-scripts   latest               cfd5d368de72        2 years ago         34.7MB

cywang117 · October 12, 2021, 1:52am

Hi @alexanderkjones,

I’ve spoken with my colleague @pipex about this, and perhaps I can provide some context about the current local mode implementation on the Supervisor. Upon receiving a command from the dashboard to enter local mode, the Supervisor on the device will take a snapshot of engine images, containers, networks, and volumes on the device. Thereafter, the Supervisor will not automatically prune images or anything else for that matter. The Supervisor will restore images, containers, etc. upon local mode exit. When the switch out of local mode happens, the Supervisor reads the engine snapshot and restores the device to the state it was in prior to local mode switch, with the exception of volumes, which remain untouched for the device’s entire lifecycle unless explicitly pruned via Supervisor API endpoint (or balena volume prune | rm). This functionality is intended, and seems like it would be fine in your case if you’re willing to run balena system prune every now and then.

What may be more interesting is that device memory increases over days of development in local mode. I’m wondering what kinds of processes are eating into the Fin’s memory – could you take a look at that the next time you notice memory increasing past levels you’re comfortable with? Thanks!

Regards,
Christina

alexanderkjones · October 21, 2021, 1:32pm

Hi @cywang117 thanks for looking into this and that makes a total sense and is great to be aware of.

I will keep digging and post back when I get more data on where the memory increases are coming from. I had used the balena system prune as we were debugging this issue and did not see significant memory reduction, it was only when using purge from the dashboard where I saw the memory utilization go down. I’m regularly watching balena stats and memory usage is well under what is reported, but there’s definitely more investigation work on my end to do to properly reproduce and debug. And as you said, since production and local mode are handled differently by the supervisor there may be something going on there as well since we don’t often take our dev units out of local mode.

I will keep an eye on it and report back here if I find something. Thanks for all your help!

alexanderkjones · November 3, 2021, 1:30pm

@cywang117 I’ve run into a similar memory bloat issue while working with only a fraction of the containers I’m developing, most being commented out in my docker compose. The discrepancy that I notice the most is that the balena dashboard suggests 498mb of memory usage while balena stats shows only a few containers running at less than 50mb each.

I tried looking on the forum and found a few mentions to this issue in 2020, however the fix appears to be upgrading to supervisor 12.0.+ and I am currently on 12.11.2 Should I be assuming balena stats as ground truth or is the dashboard illustrating some type of memory leak outside of balena stats?

Screen Shot 2021-11-02 at 9.55.56 PM

maggie0002 · November 5, 2021, 6:41pm

I assume the services seen when running balena ps are only the ones you are looking for? If there was a running process for each of those images you have then the accumulative memory use would be high. But it the prune is working then likely not that, but just being sure.

Memory thing could be the difference between free memory and used memory. From the root of the device next time there is a memory spike run top from the root of the device os (not in the container) and you will get system level stats in process use and memory use. The top few lines will tell you the difference.

Long story short, systems are designed to use all your memory. When you get a spike in use it will gradually build up and up and keep the contents in the memory in case it is used again. If you need the ram for something else though it will dump the old stuff to make room for the new. So you get used ram (what is occupied) and free ram (what is not used + what can be dumped to make room for new stuff). It’s the second one that matters.

I may have got some of the terms mixed up, I don’t think top uses the term used ram, but you get the idea. Post the output next time it spikes and have a little Google about it.

Not sure what figure supervisor is reading for the dashboard, could be explained by the same thing.

pipex · December 21, 2021, 9:07pm

Hi @alexanderkjones and @maggie0002, here is how the used memory calculation is performed for the dashboard, the supervisor tries to calculate the the actual memory usage of the whole OS discounting memory cache and buffers, this is also how other utilities like htop do it.

On the other hand balena stats only shows the memory usage of running containers, so the sum of all the containers still should be less than the figure seen on the dashboard. It seems that docker also performs the same calculation (substrating cache) in order to come up with an actual figure.

Also depending on your configuration, you should also expect a difference between the figure shown on the dashboard and the data seen on the device, as the data show on the dashboard is not real time.

As @maggie0002 points out, this figure will differ from the values shown in top as systems by default will try to use as much memory as possible and used memory shown by top is not representative of the actual memory used by running processes.

Let us know if this helps

Topic		Replies	Views
Balena OS supervisor takes too much time to start balenaOS	8	1241	June 20, 2019
Why Balena push created new container for every changes? balenaOS	31	2310	May 31, 2019
balena-engine not starting Product support docker	20	1209	January 30, 2021
Device services stuck at "stopping" state Product support	10	2017	May 8, 2019
Strange message: "Error cleaning up: (HTTP code 409) conflict - conflict: unable to delete (cannot be forced)" Product support	12	2821	February 12, 2019

Balena Supervisor Not Pruning Dangling Images

Related topics