Container missing shutdown signal

rustyeddy · March 7, 2020, 12:24am

I do NOT see the SIGTERM when I stop the Raspberry PI.

We have a STOP button connected to the Raspberry PI GPIO. We use the dtoverlay, to have the kernel start the shutdown process when that signal is pulled low.

Did that answer your question…

(If I click exit on the dashboard or directly send SIGTERM, I can see the process shutdown).

When we stop the kernel with the STOP button, I do NOT see the SIGTERM.

rustyeddy · March 7, 2020, 12:27am

hedss · March 7, 2020, 12:27am

Great, thank you, this clears a huge amount of confusion up. Yes, I think there’s going to be an issue here as to the way that this occurs because if it doesn’t actually go through the shutdown properly, then systemd won’t start cleaning up.

balenaOS uses systemd, which does not respond to a SIGTERM signal, but instead looks for SIGRTMIN+3 to start the shutdown process. My working hypothesis here, depending on how the shutdown is actually performed, is you’re just watching the kernel start shutdown procedure, systemd doesn’t do anything as it doesn’t respond to the signal, and then the entire thing just gets rebooted in the current state.

rustyeddy · March 7, 2020, 12:28am

And here is just the container with your code…

hedss · March 7, 2020, 12:29am

The best way to do this is to tie in whichever service is responding to the GPIO to use DBus to communicate with systemd in the host directly and request a shutdown. This would itself then go through the correct shutdown procedure, sending the right signals to the processes including balenaEngine, which itself will then issue the correct signals to each of the service containers.

rustyeddy · March 7, 2020, 12:29am

YES! Finally after 1 month, I think you have basically concluded what I have been trying to say…

hedss · March 7, 2020, 12:31am

Sorry, it’s the first time I’ve seen this ticket, and looking back it seemed to have diverged massively, which is why I was trying to find the crux of the problem.

hedss · March 7, 2020, 12:33am

What you basically want to do is this: https://www.balena.io/docs/learn/develop/runtime/#rebooting-the-device
This’ll send the reboot command, and you should see everything work as you want it to.

rustyeddy · March 7, 2020, 12:36am

Yes, massively. I feel like I have just been explaining and re-explaining myself for the last two days, not only to support, but my boss also …

Finallly, I’m glad we are at least on the same page, I think…

rustyeddy · March 7, 2020, 12:40am

Ooofff, it is unfortunant we will have to make this change to work with Balena. But if there is no other choice I will talk it over with my boss.

Thanks for your help.

Any idea why do so many people seem to think the shutdown and signal process work differently than the actually do?

anyway,
Again thanks for your help…

hedss · March 7, 2020, 12:44am

I’m really sorry you’ve had these issues. We carry out support rotation at balena, and every engineer takes part. I think part of the problem here looking back was that there seemed to be an issue also with the CPU resource utilisation and that became a bit of a rabbit hole and the actual underlying signal issue was unfortunately lost in communication.

I have now flagged this thread internally to our product and support heads, as I’d like to carry out a post-mortem and determine why this has taken so long to resolve and work towards ensuring this situation does not happen again.

I can only apologise again for your frustration.

If we can be of any further help with this, please let us know. If you’d like, if you do need further help and you’d like to ping me directly, I will be more than happy to step in to answer questions for you.

Best regards,

Heds

rustyeddy · March 7, 2020, 12:53am

Thank you Heds, I truly appreciate it. I have to admit my confidence in support was almost gone.

Especially when we have a significant number of machines in other countries, I was not feeling like your team was going to be able to help us out much …

Anyway, thanks again.

now to figure out how we want to go forward…

hedss · March 7, 2020, 1:05am

You’re most welcome!

Just as an aside, SIGRTMIN+3 is actually the halt command, what you probably want is +5 or +6 for a full reboot. There’s a full systemd signal guide here: https://www.freedesktop.org/software/systemd/man/systemd.html#Signals

Although as noted, if you use the command from https://www.balena.io/docs/learn/develop/runtime/#rebooting-the-device this will do it for you.

Have a good weekend!

rustyeddy · March 9, 2020, 7:11pm

OK, Heds I get how DBUS can be used by an application in a container to cause a reboot.

But what I am missing, is How does the SIGNAL get from the kernel to the application that in turn uses DBUS to reboot the system.?.

The GPIO dtoverlay causes the kernel to read the signal and produce SIGTERMs.

Is there any way we can get on a quick call to nail this down?

I am literally 3 days away from a release and still do NOT know if I am going to solve this SHOW STOPPER.

hedss · March 9, 2020, 8:18pm

Hi @rustyeddy,

We don’t have a dedicated support number for balena support, as it goes through many different systems.

You’re using gpio-shutdown in the dtoverlay? I’ve not any experience with doing this, though from what I understand this should shutdown systemd cleanly, so my first worry is why it sounds like it isn’t. Unfortunately I’d need to carry out some research into using this to understand why it doesn’t work.

From my own personal viewpoint, I’d carry this out in a slightly different way, and use one of the application services to listen for the GPIO pin and then carry out the reboot:

Ensure that GPIO pins on the /sys interface are bound into a service along with the DBus socket from the host balenaOS. You can do this with the io.balena.features.dbus and io.balena.features.sysfs labels, as documented here: https://www.balena.io/docs/learn/develop/multicontainer/#labels .
From this service container, when the relevant GPIO is set then the DBus call mentioned here: https://www.balena.io/docs/learn/develop/runtime/#rebooting-the-device would be invoked, and part of this should be balenaEngine sending SIGTERM to all your other services containers.

I have noticed that in our docs we’re suggesting org.freedesktop.systemd1.Manager.Reboot which as far as I know performs an immediate reboot cycle (and doesn’t attempt to carry out the unit shutdown procedure). I think this will still result in SIGTERMs to the service containers, but a more elegant way is probably to use org.freedesktop.login1.Manager.Reboot which I think actually carries out a full shutdown procedure.

Best regards,

Heds

rustyeddy · March 9, 2020, 8:40pm

Hi Heds,

Ok at this time, I am reluctant to repeat myself on this forum any longer, I have thus far wasted far too many hours iterating over many different “attempts” at getting this to work.

I have already stated my observations, I have answered every question. And I have done each of these more than once due to “your rotating” support.

I have been told by different people that this “will work”, that should “work”, only to have my observations and statements ignored.

I do not want to repeat myself again.

At this point, I have ZERO confidence I am going to get any useful support if/when we run into problems in the field.

Y’all have built a support system that refuses to provide much more efficient phone calls.

My recomendation to my boss is that we cut our losses and move to a different solution before we launch production and get stuck with this support.

But it is my Bosses call, so I will do what he says.

Sorry, I really do NOT mean to be an asshole ,but just go over this thread and Imagine how much money it must of cost my boss to get right back here, no further along than when I sent the very first support message regarding this problem…

Sincerly Rusty,

rusty

rustyeddy · March 9, 2020, 8:48pm

One last thing: please keep in mind, this problem only exists when we use Balena. Our software running natively on Raspian works just fine.

So it’s not like we are asking you guys to fix a problem that we created …

hedss · March 9, 2020, 10:03pm

Hi Rusty,

Again, I understand your exasperation, but unfortunately as I say, reading and parsing this thread has been very difficult, and unfortunately whilst you said you were using the dtoverlay I hadn’t seen anywhere where you’ve explicitly said how the shutdown then occurs (which is why I asked about gpio-shutdown, some customers add their own circuitry for events like this, and I couldn’t assume you were using this). I’ve tried to use my knowledge of balena to suggest an alternative which I will work unless the reason you’re using the overlay is specifically because you think the container or balenaEngine is getting into a situation where it no longer works.

gpio-shutdown (https://github.com/raspberrypi/firmware/blob/9f4983548584d4f70e6eec5270125de93a081483/boot/overlays/README#L775) sends a KEY_POWER event to logind. For what it’s worth, I’ve just simulated this on balenaOS (same version as you, on an RPi3) in the host OS with:

dbus-send \
  --system \
  --print-reply \
  --dest=org.freedesktop.login1 \
  /org/freedesktop/login1 \
  org.freedesktop.login1.Manager.Reboot \
  boolean:"false"

I have modified my service to catch SIGTERM and carry out log output to stdio and also write to a new file in the container layer. On running this POWER_OFF simulation I, like you, do not see SIGTERM signals in the container. This possibly might be because it’s missing a udev rule, and if you set one up it might work as expected.

However, on trying a reboot with what we suggest in our docs:

dbus-send \
  --system \
  --print-reply \
  --dest=org.freedesktop.systemd1 \
  /org/freedesktop/systemd1 \
  org.freedesktop.systemd1.Manager.Reboot

I don’t see logging (which I expect, the Supervisor is also terminated and the logging connection between the service container and the Supervisor is no longer valid) but I do see the file getting written, thus this is ensuring SIGTERM is being sent to containers. This is why I’d highly recommend the method I proposed beforehand for examining the GPIO input to shutdown the system.

Best regards,

Heds

rustyeddy · March 9, 2020, 10:54pm

Interesting! Please humor me and do a search for “dtoverlay” in this thread! You will see it hopefully. This just proves my point regarding the “rotating” support not working.

Regarding GIPO shutdown, you are correct. That is what I’m using and it’s not working in the container. The alternative solution you provided is not a “clean” solution since if my container is not running for what ever reason (doing an OTA, crashed? etc…) the system won’t work as designed. Therefore is not acceptable in our application. And finally we are now in SYNC! My question from the very beginning is why is SIGTERM not being sent to containers after a KEY_POWER event and how do we fix it?

hedss · March 9, 2020, 11:56pm

Hi again,

I have already flagged to our OS and device team this particular issue, so they can look at it and try and determine why this doesn’t work. Hopefully they can come back to you with some more information, although they are all based in Europe so this may be some time tomorrow.

To be clear, you actually said “I am using dt_overlays to program the RPI kernel to begin the shutdown process, including send SIGTERM signals to every container.” This was not immediately obvious to me that you were using gpio-shutdown, as you had not mentioned it and it’s not actually what gpio-shutdown does (in fact firing a KEY_POWER event which logind then captures to start the shutdown process).

As mentioned in my last message, it’s entirely possible this is down to a missing udev rule , which if setup may solve your problem (for example as referenced in this thread: https://www.raspberrypi.org/forums/viewtopic.php?t=185571#p1172933 ). You will have to set this rule in a service container to add it, and if this does work, it’s something we can look at adding back into the OS if for some reason we missed it (and as said, hopefully the OS folks will get back to you tomorrow).

I’m sorry my efforts have not lead to the solution you currently require, although I have tried to propose something else that, having tested myself, I believe would work.

Whilst I also understand, to a degree, your concerns about containers, these still run whilst OTA updates are occurring until all of the new images have been saved and the Supervisor is ready to start the new version of the application. Should a container terminate, it will also restart it. In situations where crash restarts which do not lead to the successful execution of a container may occur, sending a SIGTERM to them becomes moot, as they will not deal with the signal cleanly anyway. We very rarely see problems in the field with the download or execution of images/containers, which is why we recommend this type of approach.

I still hope we can come through for you, and we shall see what response the OS team can give us tomorrow.

Best regards,

Heds

Topic		Replies	Views
When using DT Overlay gpio-shutdown to do gpio pin based shutdown - containers don't receive SIGTERM balenaOS support , raspberrypi3 , gpio	26	2106	July 28, 2020
Find reason for automatic reboots in Balena OS balenaOS	1	281	November 24, 2023
Container quit and won't restart Product support	3	804	April 15, 2020
Container access is only available for running containers balenaOS raspberrypi4	10	1358	October 29, 2020
Catching Signals inside Docker Product support	4	1521	February 17, 2017

Container missing shutdown signal

Related topics