We’re experiencing a growing local image cache when building in local mode. We are currently developing on Balena Fin v 1.1, using BalenaOS 2.80.3 and Supervisor 12.7.0
We read in this post that the supervisor should be pruning images but our cache keeps growing and we are forced to either use balena system prune or eventually purge all data from the device using the BalenaCloud dashboard.
Not sure if this is an error, I’m sure the Balena guys are better situated to know. Although the behaviour you are talking about makes sense to me. In local mode I don’t think I would want auto-purging of my images or build cache, I would want them there for speedier rebuilds on my next push (local mode being for development). And would then manually purge them when I wanted to free up space.
If it is occurring outside of local mode, on a production image for example, then I would be concerned.
As @maggie0002 says, it makes sense that images would be retained in local mode, as this is a development environment where speed is at a premium over storage capacity. From your initial post, it sounds like you are switching between local mode and managed mode frequently on the Fin that you’re developing on. In addition, you’re looking to purge images that are used only in local mode when you switch out of local mode. Finally, you’re running a development OS variant. Are all these assumptions correct?
I believe the post you linked to is in reference to this issue: On leaving local mode the supervisor should clean up dangling images · Issue #993 · balena-os/balena-supervisor · GitHub. Could you paste an example output from balena images and also include what sorts of base images you’re using during local mode? This would help me determine what’s actually going on with the device. Also, the Supervisor does not remove any volumes, user-created (“unmanaged”) or otherwise, so if the cache filling up is due to volume creation, then this would require a manual prune or a Supervisor API call. If the cache fill-up is indeed due to volumes, this design is intended to preserve user data. Thanks and let us know!
@cywang117 yes on all front on your first set of questions, we are using the dev OS variant and running back and forth on local mode.
The specific issue we’ve been seeing is the fin bogs down during development and memory continues to increase as we live push over several days of testing. We notice that the “build from cache images total” continued to increase and our memory usage from viewing the balena dashboard continues to grow as we keep using live push.
Ultimately though, this results in us losing connection to the device regularly and having to purge all data from the dashboard before being able to continue. Memory usage shortly after a purge results in only 50% memory usage after the first build, but as we continue to live push it increasing to 95% usage over a matter of a couple days of development.
Unfortunately, we’ve lost or misplaced the ability to see the number of cached images when running balena push -m --debug It used to tell us the number of images in the cache before building but for some reason the output is no longer displaying anything different than a strait balena push.
Here is fresh pull using balena images -a. Also as I’m describing this I think our root issue is around memory and that’s likely not affected by cached images correct? Thanks for all your help.
root@2a38113:~# balena images -a
REPOSITORY TAG IMAGE ID CREATED SIZE
local_image_ble-gatt-server latest 0dce54fc27b8 18 hours ago 324MB
<none> <none> 539ad52570aa 18 hours ago 324MB
<none> <none> 64e315cbbc09 18 hours ago 324MB
<none> <none> dea73fa38431 18 hours ago 324MB
<none> <none> 6ec2ee931441 18 hours ago 734MB
<none> <none> 922642ac9006 18 hours ago 734MB
<none> <none> 432afd1aa37e 18 hours ago 734MB
<none> <none> 2931f5a436bb 18 hours ago 734MB
<none> <none> 9d35fa4b4514 18 hours ago 734MB
<none> <none> 93232fa90cc7 18 hours ago 734MB
<none> <none> 10c03e732ab0 18 hours ago 324MB
<none> <none> 6707719c4305 18 hours ago 324MB
<none> <none> 6581eec858a0 18 hours ago 324MB
<none> <none> 3b5ff8057013 18 hours ago 324MB
<none> <none> 554bfdcd9b6c 19 hours ago 324MB
<none> <none> 75645582f199 19 hours ago 324MB
<none> <none> 5eca70d75a9c 19 hours ago 324MB
<none> <none> 63c07f40b714 19 hours ago 324MB
<none> <none> 9372a8e9607a 19 hours ago 324MB
<none> <none> 02cf0298162f 19 hours ago 324MB
<none> <none> 27595cab6c3b 19 hours ago 324MB
<none> <none> 2767a94650dc 19 hours ago 324MB
<none> <none> 02e18bfe9320 20 hours ago 324MB
<none> <none> cb10dce3a772 20 hours ago 324MB
<none> <none> 1d54538f6dee 20 hours ago 324MB
<none> <none> 55bfca4c7b18 20 hours ago 324MB
<none> <none> eb3b184b2e20 20 hours ago 267MB
<none> <none> e9d984cc7134 20 hours ago 263MB
<none> <none> be490c93f20b 20 hours ago 268MB
<none> <none> 710fc16fe2ea 20 hours ago 261MB
<none> <none> 2aa76855309b 20 hours ago 263MB
<none> <none> ae7edbedc73a 20 hours ago 267MB
<none> <none> 525f692079e1 20 hours ago 268MB
<none> <none> 2e22eaa6d9f9 20 hours ago 267MB
<none> <none> 5a7467c7a364 20 hours ago 261MB
<none> <none> dfb6db36eaec 20 hours ago 263MB
<none> <none> ed2f3bdf0a1b 20 hours ago 261MB
<none> <none> 0daab0256416 20 hours ago 268MB
<none> <none> 95f0d9331932 20 hours ago 796MB
<none> <none> 7c5c57302b11 20 hours ago 267MB
<none> <none> 80d902902e86 20 hours ago 263MB
<none> <none> 2c430d40776c 20 hours ago 268MB
<none> <none> 185ef88d5a05 20 hours ago 261MB
<none> <none> 28ae82bf8623 20 hours ago 796MB
<none> <none> 0c80631029bd 20 hours ago 796MB
<none> <none> 0d747bea9a36 20 hours ago 796MB
local_image_hal latest b89f9142fc23 20 hours ago 796MB
<none> <none> 3301d4458e3c 20 hours ago 796MB
<none> <none> 329625c0f223 20 hours ago 796MB
<none> <none> 044c1b4e55de 20 hours ago 796MB
<none> <none> b83563c021cf 20 hours ago 796MB
<none> <none> f3f5a372b672 20 hours ago 796MB
<none> <none> 9e49c7d8b620 20 hours ago 796MB
<none> <none> ab41a9baaf25 20 hours ago 796MB
<none> <none> 097302411b4b 21 hours ago 796MB
<none> <none> f138cbbbf1d9 21 hours ago 796MB
<none> <none> b8ad2b6df627 21 hours ago 796MB
<none> <none> 3f72dd61e03b 21 hours ago 796MB
<none> <none> 98a4744151df 21 hours ago 796MB
<none> <none> 2d0821140ae4 21 hours ago 796MB
<none> <none> beea7a8a1716 21 hours ago 796MB
<none> <none> ca199f160bd3 21 hours ago 796MB
<none> <none> 97be6c3a7bfc 21 hours ago 324MB
<none> <none> 0b6f2e7c960a 21 hours ago 324MB
<none> <none> f323ee81718f 21 hours ago 324MB
<none> <none> e09b774226fc 21 hours ago 324MB
<none> <none> 9c7ee3cfb7fb 21 hours ago 324MB
<none> <none> 90ed8453d0d2 21 hours ago 324MB
<none> <none> bc551c2baeea 21 hours ago 250MB
<none> <none> 915aed6e7529 21 hours ago 796MB
<none> <none> e387b3eb38dd 21 hours ago 796MB
<none> <none> 68ed10db0300 21 hours ago 796MB
<none> <none> bd0ca4c78290 21 hours ago 307MB
<none> <none> 4b71e891ca2b 21 hours ago 796MB
<none> <none> 3f21824ba82a 11 days ago 796MB
<none> <none> 321b5fb6d843 11 days ago 796MB
<none> <none> 2940fa5b7fb9 11 days ago 796MB
<none> <none> a4a1fdd51fbe 11 days ago 796MB
<none> <none> 64f5417f0953 11 days ago 796MB
<none> <none> 72444ba37bb5 11 days ago 796MB
<none> <none> a11accbc9c94 11 days ago 796MB
<none> <none> d276bdee3e7a 11 days ago 796MB
local_image_map-service latest 0886c9ea3e8a 12 days ago 268MB
<none> <none> 39b82ef11ce6 12 days ago 268MB
local_image_core latest 6fe1208a5d58 12 days ago 261MB
<none> <none> 6a866cd4a78d 12 days ago 261MB
<none> <none> ccb7ac13f7fd 12 days ago 268MB
<none> <none> 1463bb3bfe91 12 days ago 268MB
<none> <none> d2c0f14bc50e 12 days ago 261MB
<none> <none> 975c079129cf 12 days ago 261MB
local_image_scheduler latest a157ae8caaf8 2 weeks ago 263MB
<none> <none> f0ee9da4a594 2 weeks ago 263MB
<none> <none> 888c296e0cf8 2 weeks ago 263MB
local_image_cmd-shell latest a955aa53a360 2 weeks ago 267MB
<none> <none> a371a017bfb8 2 weeks ago 263MB
<none> <none> febceaa2adb4 2 weeks ago 267MB
<none> <none> e22bceccf1f7 2 weeks ago 267MB
<none> <none> 68d731dd341a 2 weeks ago 267MB
<none> <none> 53b00658ad76 2 weeks ago 267MB
local_image_supervisor latest 413a3c6d3240 2 weeks ago 748MB
<none> <none> cba30dd17702 2 weeks ago 748MB
<none> <none> 65c516bcdf3d 2 weeks ago 748MB
<none> <none> c15e9105c49f 2 weeks ago 748MB
<none> <none> ac18d2c15dd1 2 weeks ago 748MB
<none> <none> b5308b846c86 2 weeks ago 730MB
<none> <none> 3cec7dd0041a 2 weeks ago 730MB
<none> <none> 848569fff94f 2 weeks ago 730MB
<none> <none> 344465121c64 2 weeks ago 730MB
<none> <none> 1e66bb47d232 2 weeks ago 263MB
<none> <none> 581a2b0daed3 2 weeks ago 261MB
<none> <none> 246f32580f43 2 weeks ago 256MB
<none> <none> 036c8532630d 2 weeks ago 256MB
<none> <none> e20613b55db0 2 weeks ago 256MB
<none> <none> cd4d0d1549a0 2 weeks ago 256MB
<none> <none> 3fb036afa55e 2 weeks ago 256MB
<none> <none> 9bee2d2ba94b 2 weeks ago 256MB
<none> <none> c565a9e5a946 2 weeks ago 684MB
<none> <none> b9b545aeb474 2 weeks ago 795MB
<none> <none> 8f7378a5b2fe 2 weeks ago 307MB
local_image_controller latest 428e1c3f31cc 2 weeks ago 307MB
<none> <none> ff06bf608721 2 weeks ago 307MB
<none> <none> bcc523ba6179 2 weeks ago 307MB
<none> <none> 7009d0989c2e 2 weeks ago 307MB
<none> <none> 1cba93ebb2b5 2 weeks ago 307MB
<none> <none> 8ace330ed48a 2 weeks ago 307MB
<none> <none> ef985c24637a 2 weeks ago 780MB
<none> <none> 8de517ba0e9d 2 weeks ago 780MB
<none> <none> cd64d8082708 2 weeks ago 780MB
<none> <none> db410ee70b47 2 weeks ago 780MB
<none> <none> 0fd3a340edc1 2 weeks ago 780MB
local_image_camera latest 5275c083b1d4 2 weeks ago 263MB
<none> <none> df02bb68f2b2 2 weeks ago 263MB
<none> <none> 10ea1ee21552 2 weeks ago 263MB
<none> <none> 07dc2bc201f8 2 weeks ago 263MB
<none> <none> 8f5ecbac07cd 2 weeks ago 263MB
local_image_redis-websocket latest 068338f4fde7 2 weeks ago 692MB
<none> <none> ffb99daad0fb 2 weeks ago 692MB
<none> <none> f878d5b46a44 2 weeks ago 692MB
<none> <none> dbdcc1ca3141 2 weeks ago 692MB
<none> <none> 041aca63166f 2 weeks ago 692MB
local_image_redis latest 3f521293f8ae 2 weeks ago 76.9MB
<none> <none> 054fcf09b967 2 weeks ago 76.9MB
<none> <none> 9242bb41bbba 2 weeks ago 76.9MB
<none> <none> 1c35cf415059 2 weeks ago 76.9MB
<none> <none> 29ab01651c35 2 weeks ago 256MB
<none> <none> 6d855ef70ab3 2 weeks ago 76.9MB
<none> <none> 6f22972c9614 2 weeks ago 691MB
<none> <none> 61a89ce99cb9 2 weeks ago 256MB
<none> <none> 0d0b09c7b03b 2 weeks ago 256MB
<none> <none> 59747c031fe8 2 weeks ago 691MB
<none> <none> 907447e14b01 2 weeks ago 691MB
<none> <none> 46110ec6e085 2 weeks ago 256MB
balenalib/fincm3-python 3.7-run 950ecca7c6a1 2 weeks ago 250MB
balenalib/fincm3-python 3.7-build f32a6df1f337 2 months ago 684MB
balenalib/fincm3-python 3.8-run 205778f82f27 2 months ago 256MB
balenalib/fincm3-python 3.8-build 201fd4f0c698 2 months ago 691MB
arm32v7/redis latest df0b976e6584 3 months ago 76.9MB
balenalib/fincm3-debian buster-run c363e2794ff1 3 months ago 160MB
local_image_bluetooth latest 188a20699f51 4 months ago 212MB
balena/armv7hf-supervisor v12.7.0 ac4cb891dc5b 5 months ago 62.8MB
balenalib/fincm3-python 3.8-bullseye-build d390559467c7 19 months ago 694MB
balena-healthcheck-image latest 851163c78e4a 21 months ago 4.85kB
balena/arm-balena-multibuild-scripts latest cfd5d368de72 2 years ago 34.7MB
I’ve spoken with my colleague @pipex about this, and perhaps I can provide some context about the current local mode implementation on the Supervisor. Upon receiving a command from the dashboard to enter local mode, the Supervisor on the device will take a snapshot of engine images, containers, networks, and volumes on the device. Thereafter, the Supervisor will not automatically prune images or anything else for that matter. The Supervisor will restore images, containers, etc. upon local mode exit. When the switch out of local mode happens, the Supervisor reads the engine snapshot and restores the device to the state it was in prior to local mode switch, with the exception of volumes, which remain untouched for the device’s entire lifecycle unless explicitly pruned via Supervisor API endpoint (or balena volume prune | rm). This functionality is intended, and seems like it would be fine in your case if you’re willing to run balena system prune every now and then.
What may be more interesting is that device memory increases over days of development in local mode. I’m wondering what kinds of processes are eating into the Fin’s memory – could you take a look at that the next time you notice memory increasing past levels you’re comfortable with? Thanks!
Hi @cywang117 thanks for looking into this and that makes a total sense and is great to be aware of.
I will keep digging and post back when I get more data on where the memory increases are coming from. I had used the balena system prune as we were debugging this issue and did not see significant memory reduction, it was only when using purge from the dashboard where I saw the memory utilization go down. I’m regularly watching balena stats and memory usage is well under what is reported, but there’s definitely more investigation work on my end to do to properly reproduce and debug. And as you said, since production and local mode are handled differently by the supervisor there may be something going on there as well since we don’t often take our dev units out of local mode.
I will keep an eye on it and report back here if I find something. Thanks for all your help!
@cywang117 I’ve run into a similar memory bloat issue while working with only a fraction of the containers I’m developing, most being commented out in my docker compose. The discrepancy that I notice the most is that the balena dashboard suggests 498mb of memory usage while balena stats shows only a few containers running at less than 50mb each.
I tried looking on the forum and found a few mentions to this issue in 2020, however the fix appears to be upgrading to supervisor 12.0.+ and I am currently on 12.11.2 Should I be assuming balena stats as ground truth or is the dashboard illustrating some type of memory leak outside of balena stats?
I assume the services seen when running balena ps are only the ones you are looking for? If there was a running process for each of those images you have then the accumulative memory use would be high. But it the prune is working then likely not that, but just being sure.
Memory thing could be the difference between free memory and used memory. From the root of the device next time there is a memory spike run top from the root of the device os (not in the container) and you will get system level stats in process use and memory use. The top few lines will tell you the difference.
Long story short, systems are designed to use all your memory. When you get a spike in use it will gradually build up and up and keep the contents in the memory in case it is used again. If you need the ram for something else though it will dump the old stuff to make room for the new. So you get used ram (what is occupied) and free ram (what is not used + what can be dumped to make room for new stuff). It’s the second one that matters.
I may have got some of the terms mixed up, I don’t think top uses the term used ram, but you get the idea. Post the output next time it spikes and have a little Google about it.
Not sure what figure supervisor is reading for the dashboard, could be explained by the same thing.
Hi @alexanderkjones and @maggie0002, here is how the used memory calculation is performed for the dashboard, the supervisor tries to calculate the the actual memory usage of the whole OS discounting memory cache and buffers, this is also how other utilities like htop do it.
On the other hand balena stats only shows the memory usage of running containers, so the sum of all the containers still should be less than the figure seen on the dashboard. It seems that docker also performs the same calculation (substrating cache) in order to come up with an actual figure.
Also depending on your configuration, you should also expect a difference between the figure shown on the dashboard and the data seen on the device, as the data show on the dashboard is not real time.
As @maggie0002 points out, this figure will differ from the values shown in top as systems by default will try to use as much memory as possible and used memory shown by top is not representative of the actual memory used by running processes.