I do NOT see the SIGTERM when I stop the Raspberry PI.
We have a STOP button connected to the Raspberry PI GPIO. We use the dtoverlay, to have the kernel start the shutdown process when that signal is pulled low.
Did that answer your question…
(If I click exit on the dashboard or directly send SIGTERM, I can see the process shutdown).
When we stop the kernel with the STOP button, I do NOT see the SIGTERM.
Great, thank you, this clears a huge amount of confusion up. Yes, I think there’s going to be an issue here as to the way that this occurs because if it doesn’t actually go through the shutdown properly, then systemd won’t start cleaning up.
balenaOS uses systemd, which does not respond to a SIGTERM signal, but instead looks for SIGRTMIN+3 to start the shutdown process. My working hypothesis here, depending on how the shutdown is actually performed, is you’re just watching the kernel start shutdown procedure, systemd doesn’t do anything as it doesn’t respond to the signal, and then the entire thing just gets rebooted in the current state.
The best way to do this is to tie in whichever service is responding to the GPIO to use DBus to communicate with systemd in the host directly and request a shutdown. This would itself then go through the correct shutdown procedure, sending the right signals to the processes including balenaEngine, which itself will then issue the correct signals to each of the service containers.
Sorry, it’s the first time I’ve seen this ticket, and looking back it seemed to have diverged massively, which is why I was trying to find the crux of the problem.
I’m really sorry you’ve had these issues. We carry out support rotation at balena, and every engineer takes part. I think part of the problem here looking back was that there seemed to be an issue also with the CPU resource utilisation and that became a bit of a rabbit hole and the actual underlying signal issue was unfortunately lost in communication.
I have now flagged this thread internally to our product and support heads, as I’d like to carry out a post-mortem and determine why this has taken so long to resolve and work towards ensuring this situation does not happen again.
I can only apologise again for your frustration.
If we can be of any further help with this, please let us know. If you’d like, if you do need further help and you’d like to ping me directly, I will be more than happy to step in to answer questions for you.
Thank you Heds, I truly appreciate it. I have to admit my confidence in support was almost gone.
Especially when we have a significant number of machines in other countries, I was not feeling like your team was going to be able to help us out much …
We don’t have a dedicated support number for balena support, as it goes through many different systems.
You’re using gpio-shutdown in the dtoverlay? I’ve not any experience with doing this, though from what I understand this should shutdown systemd cleanly, so my first worry is why it sounds like it isn’t. Unfortunately I’d need to carry out some research into using this to understand why it doesn’t work.
From my own personal viewpoint, I’d carry this out in a slightly different way, and use one of the application services to listen for the GPIO pin and then carry out the reboot:
Ensure that GPIO pins on the /sys interface are bound into a service along with the DBus socket from the host balenaOS. You can do this with the io.balena.features.dbus and io.balena.features.sysfs labels, as documented here: https://www.balena.io/docs/learn/develop/multicontainer/#labels .
I have noticed that in our docs we’re suggesting org.freedesktop.systemd1.Manager.Reboot which as far as I know performs an immediate reboot cycle (and doesn’t attempt to carry out the unit shutdown procedure). I think this will still result in SIGTERMs to the service containers, but a more elegant way is probably to use org.freedesktop.login1.Manager.Reboot which I think actually carries out a full shutdown procedure.
Ok at this time, I am reluctant to repeat myself on this forum any longer, I have thus far wasted far too many hours iterating over many different “attempts” at getting this to work.
I have already stated my observations, I have answered every question. And I have done each of these more than once due to “your rotating” support.
I have been told by different people that this “will work”, that should “work”, only to have my observations and statements ignored.
I do not want to repeat myself again.
At this point, I have ZERO confidence I am going to get any useful support if/when we run into problems in the field.
Y’all have built a support system that refuses to provide much more efficient phone calls.
My recomendation to my boss is that we cut our losses and move to a different solution before we launch production and get stuck with this support.
But it is my Bosses call, so I will do what he says.
Sorry, I really do NOT mean to be an asshole ,but just go over this thread and Imagine how much money it must of cost my boss to get right back here, no further along than when I sent the very first support message regarding this problem…
Again, I understand your exasperation, but unfortunately as I say, reading and parsing this thread has been very difficult, and unfortunately whilst you said you were using the dtoverlay I hadn’t seen anywhere where you’ve explicitly said how the shutdown then occurs (which is why I asked about gpio-shutdown, some customers add their own circuitry for events like this, and I couldn’t assume you were using this). I’ve tried to use my knowledge of balena to suggest an alternative which I will work unless the reason you’re using the overlay is specifically because you think the container or balenaEngine is getting into a situation where it no longer works.
I have modified my service to catch SIGTERM and carry out log output to stdio and also write to a new file in the container layer. On running this POWER_OFF simulation I, like you, do not see SIGTERM signals in the container. This possibly might be because it’s missing a udev rule, and if you set one up it might work as expected.
However, on trying a reboot with what we suggest in our docs:
I don’t see logging (which I expect, the Supervisor is also terminated and the logging connection between the service container and the Supervisor is no longer valid) but I do see the file getting written, thus this is ensuring SIGTERM is being sent to containers. This is why I’d highly recommend the method I proposed beforehand for examining the GPIO input to shutdown the system.
Interesting! Please humor me and do a search for “dtoverlay” in this thread! You will see it hopefully. This just proves my point regarding the “rotating” support not working.
Regarding GIPO shutdown, you are correct. That is what I’m using and it’s not working in the container. The alternative solution you provided is not a “clean” solution since if my container is not running for what ever reason (doing an OTA, crashed? etc…) the system won’t work as designed. Therefore is not acceptable in our application. And finally we are now in SYNC! My question from the very beginning is why is SIGTERM not being sent to containers after a KEY_POWER event and how do we fix it?
I have already flagged to our OS and device team this particular issue, so they can look at it and try and determine why this doesn’t work. Hopefully they can come back to you with some more information, although they are all based in Europe so this may be some time tomorrow.
To be clear, you actually said “I am using dt_overlays to program the RPI kernel to begin the shutdown process, including send SIGTERM signals to every container.” This was not immediately obvious to me that you were using gpio-shutdown, as you had not mentioned it and it’s not actually what gpio-shutdown does (in fact firing a KEY_POWER event which logind then captures to start the shutdown process).
As mentioned in my last message, it’s entirely possible this is down to a missing udev rule , which if setup may solve your problem (for example as referenced in this thread: https://www.raspberrypi.org/forums/viewtopic.php?t=185571#p1172933 ). You will have to set this rule in a service container to add it, and if this does work, it’s something we can look at adding back into the OS if for some reason we missed it (and as said, hopefully the OS folks will get back to you tomorrow).
I’m sorry my efforts have not lead to the solution you currently require, although I have tried to propose something else that, having tested myself, I believe would work.
Whilst I also understand, to a degree, your concerns about containers, these still run whilst OTA updates are occurring until all of the new images have been saved and the Supervisor is ready to start the new version of the application. Should a container terminate, it will also restart it. In situations where crash restarts which do not lead to the successful execution of a container may occur, sending a SIGTERM to them becomes moot, as they will not deal with the signal cleanly anyway. We very rarely see problems in the field with the download or execution of images/containers, which is why we recommend this type of approach.
I still hope we can come through for you, and we shall see what response the OS team can give us tomorrow.