Unable to stop container with swap file

I am having trouble with a container that manages the swap file for our app. I understand that swapfiles are not natively supported by balena, but we require additional memory capacity for running Tensorflow apps on our Jetson Nano-based device, and the internal RAM is falling short with all of the other containers that we also need to run - resulting in frequent crashes of the container / host. This is fixed with the addition of a simple swap container, which I followed the helpful example provided in another similar thread on this forum (https://github.com/muchlearning/make-swap/).

The issue is that whenever I deploy a new release, the host is not able to stop the swap container. The specific error message I get is as follows:

Error removing dead container 'sys-swap sha256:<ID>' due to '(HTTP code 500) server error - container <ID>: driver "overlay2" failed to remove root filesystem: unlinkat /var/lib/docker/overlay2/<ID>/diff/var/lib/swap.img: operation not permitted '

I’m assuming the swap file has some kind of lock on it that is preventing the container from being stopped. Has anyone seen this before / have any ideas on how I can get around it? One thought I had was that if I could run a script on shutdown of the container to execute the swapoff command, but it seems that catching the SIGTERM isn’t the most straightforward approach so would like to avoid if possible.

Hi there, we do follow SIGTERM=>SIGKILL approach when restarting containers, so all you would have to do, is to implement a SIGTERM hook in your script/app to disable swap.

For example, in a bash script, you could use this approach to trap the signal and carry out clean-up.

I’ve been playing around with this with no luck… the container is a simple busybox build, and the start script is as follows:

#!/bin/sh

delete_swap() {
   echo "Caught SIGTERM, deleting swap file..."
   swapoff $SWAPFILE
   rm $SWAPFILE
   rm /etc/fstab
   echo "Swap file deleted"
   exit $?
}

trap delete_swap TERM

./create_swap.sh

sleep infinity

but for some reason it’s not running the delete_swap function on shutdown (I’m not seeing the echo in the logs, and I get the same error that the swap file is preventing the container from being killed). Any ideas what the issue could be?

Hi there - we’re checking this with our team, and will get back to you shortly. One thing to check in the meantime: if you’re not seeing the echo in the logs, I’d wonder if you are catching the signal properly. This article may help you, and it should be easy to debug this with a simpler script (skipping the swapon/swapoff steps, for example).

One other thing: I understand that your application needs more memory than the Jetson Nano provides. As you’ve said, we don’t support swapfiles on devices; this is because of the problems you’re likely to encounter if you do use them:

  • The wear on your SD card will be increased, and you risk significantly shortening the life of your storage.
  • Our supervisor generally assumes that it is free to restart containers as needed. There are ways around this, but these are meant for use during short, critical periods – not the entire time the device is up.
  • When your container traps SIGTERM and runs swapoff, it will take time to evacuate the contents of the swapfile – this may be longer than the normal timeout when restarting containers. At that point, the supervisor will switch to using SIGKILL, and you’ll be back where you started: balenaEngine is unable to remove the container, because the swapfile is still in use by the host operating system and so the container’s filesystem can’t be cleaned up.
  • When running swapoff, the evacuation of the swapfile will also add to the memory pressure on your device. It’s entirely possible that your device will not have enough memory, and the OOM-killer will be invoked. This is liable to leave your device in a very bad state – possibly requiring a power cycle to recover.

For these reasons, we strongly recommend against using swap. Your best option is to either reduce memory usage, or to use a device that is a better match for your software’s requirements.

All the best,
Hugh

Thanks for the tips and links - I’ve read through them and tried everything suggested - including using ENTRYPOINT exec /usr/src/app/start.sh instead of CMD ["/bin/sh", "/usr/src/app/start.sh"] that I was previously using to launch the startup script. I even removed everything but the echo line from the delete_swap function to see if it works - which it doesn’t. It seems the busybox container is just not catching the signal for some reason. Has anyone done this successfully in the past? If so I would really appreciate any feedback on what I am missing in the script / why it’s not catching the signal.

Hi,
looking at yor example I think the issue is with the last line - sleep infinity. Bash will block waiting for the sleep process to exit and will not handle the SIGTERM it receives. You should be able to use one of the following to get the behavior you expect:

...
sleep infinity &
wait

or

...
while true
do
  sleep 1
done

I also think $SWAPFILE is undefined in your example so that is going to trigger another set of errors.
Please let us know whether this helps.

hi @drcnyc, just faced a similar issue. Maybe it can help?

Hi there, just to add some info, due to this issue, the supervisor might be killed before the app containers, so if you try to log in after catching the SIGTERM signal you might not see it in the logs, so you shouldn’t judge whether the signal was caught or not based on that. Let us know if Tomasso or Michal’s suggestions work for you.

@mtoman your suggestion worked - thank you! I used the sleep infinity & wait approach.

@tgirotto I had actually set it up the Dockerfile to use ENTRYPOINT already - so I suspect this is also necessary to make the solution work. Thank you for pointing this out - hopefully this will all help the next person who gets stuck on something like this.

@sradevski It actually does print out the messages to the logs on restart of the container, but good to know that it might not in some cases.