Container missing shutdown signal

OK, we changed to a different power supply with 3Amp supply…

By the way, we have another test device, a real device with a laser, that is displaying this same behavior.

Again, both of these were working fine until mid day yesterday… Just for fun I am turning on the other device.

The new device is UUID: e84f657041f89aee499e564898933203

Is the device, by any chance:

  • Mounting the host DBus and sending restart commands there
  • Overloading and crashing the device
  • Performing any privileged action that could make balena’s health checks fail

?

Ummm.

  1. No, no DBUS involved.
  2. I do not know, how to we test it?
  3. Not on purpose, the only thing I was going to try to do in privilege mode is change the host name, but I moved that to the preloaded images.

So, I have no idea what is not causing the host OS to act this way.

As I mentioned previously, the ‘sig’ program I wrote can run in a tight loop in it’s container, i could see that container spinning. But how that effects the host OS is beyond me.

As matter of fact, this is one of dockers big selling points correct?

I can tell you, this software works as expected when we run it outside the containers. This problems we are dealing with now have only been issues in this last month+ I have been trying to get the containers to work properly…

Hello, you still looking at this? I need to take off, and power down the device. Will be back tomorrow.

In the meantime, I created another image, booted a different device and I am able to reproduce my original problem…

Anyway, I can turn this back on and let you have access tomorrow…

We’d be happy to continue looking at it for you; when you get the devices back online please enable support access again and let us know.

It would be helpful to know more detail about the sig container, since you mentioned that the problems started after that was added to the device. Is there any way you’re able to share the reproduction of the issue with us? If not I’d suggest working to isolate the smallest change possible which starts to cause the problem.

I am curious what did you all find out yesterday? You mentioned it was continuing to reboot because the supervisor was not running, but I have nothing after that?

Did you guys have a chance to look at it any further?

This thread is turning into a mess: We have two problems:

  1. Original BIG Problem with SIGTERM
  2. New FIRE erupted when BalenaOS continued to fail preventing containers from being installed, and updated, sending us stuck in this unrecoverable perpetual reboot.

For Problem 1: I have done everything I have been asked to do, and have at one time or another described the result. Observation-ally, I think I know what is happening, but I need your team to determine if it can be fixed at all.

Please keep in mind: Our software works when we run it native on raspbian. It only breaks since we have been trying to integrate it with balena.

This has been going on for over a month (among other things), but I am now really coming down to the wire and I need to know if Balena is capable of supporting our application.

I am beginning to doubt you will be able to support us…

Second Problem - Persistent Failing of BalenaOS

The second problem you all had a peek at yesterday, came out of nowhere. I wanted to push a new version to balena, when its failures became persistent.

Now this is NOT our primary problem, but It scares the bejebus out of us. The thought of this happening to one of our devices in a remote foreign country, having to send somebody on emergency is also giving us Pause.

Perhaps as much pause to the failure, is the fact that, to me as a customer sitting in a semi-stressful situation has to this point had very little to no help resolving either of these issues.

Heck, I left the rig running with with support mode enabled for 3+ hours afterwords, but have heard nothing new.

Moving forward: I am going to put a new SD image in my device and attempt to get the SIGTERM problem ready to debug, and hopefully make some headway.

If you would like to me to provide support mode again, I will be happy to do so, however I would like to request that you: a) let me know if you make any progress. b) let me know if you are finished using it so I can reclaim it for work over here.

Thanks.

  • rusty

Hey Rusty,

Thanks for your patience so far in this forums thread. I’ve just pinged the engineers who have looked at this thread internally to get all our heads together at once, and I’ve also summarized what I believe are the main points so far and our lines of thinking for possible explanations / further debug pathways. Feel free to let us know if you have anything to add to this, or if anything could be clarified.

The state of things so far:

  • we are seeing (1) SIGTERMs in a specific container and (2) the device itself crashlooping

  • none of this happens when run on raspbian outside a container context

  • the release which coincided with the issue contains a process, sig which can end up in a tight loop which tries to cpumax within whatever limits cgroups have allowed to it

  • it’s been on the same power supply it always has

  • containers are running as pirivileged

Ideas:

  • power supply issues (swapped power supply, no change, and I’m not aware of any “under voltage” messages in logs being found to begin with; usually these are the telltale sign of such)

  • privileged container is overwriting some resource causing the host OS / supervisor to break in some way (as an example of this species of error: when a privileged container bind-mounts the system dbus socket directory and inside the container runs its own dbus, overwriting the system dbus socket, causing general havoc)

  • the sig CPUmaxing is somehow causing this

Next steps

I can’t get in to their device/app UUIDs as shared above, as you point out the support access expired. If you successfully flash that new SD and get the device up, we’d be happy to jump in parallel to you or otherwise, as we’d only need read-access type operations. I have a suspicion myself that something may hide in the logs we can make use of, as I don’t see any logs dumps so far in this thread.

Thanks a ton for getting back to me. Just to be clear, with an update:

Update:

  1. The reboot problem is a SCARY distraction. The real problem is SIGTERM.

  2. I have burned a new SD card and have rebooted the rig. It boots fine now and has loaded the latest version of the software. Hence the reboot problem does not currently exist.

  3. I did save the SD with the broken image that causes, reboots).

Next Step

I have rebooted with a new SD card, I have also turned on Support Mode for 6 hours.

uuid: e8f5596a4ef43cad28de82bfb1559b6f

Feel free to have a look. I will be testing, and pushing new versions. If you all want to freeze on a version, and do your own thing with it, I will be happy to accomodate.

The SIGTERM Problem (MY GUESS).

I do not want to summarize everything from this thread and a previous, but as a result of my experimentation I THINK I am seeing the following happen:

  1. Written a C program that catches SIGTERM, prints to STDOUT, then calls a function called seeya() to shut various items down.

  2. I have started said program in a container doing this:

CMD ["./ABC"]

Which according to the Dockerfile documentation should exec (run the program in the parent process, NOT spawn a child process to run the program).

I believe I can verify that is infact what happens, I can see that my program is running with procid 1 (no parent).

Fastforward to Running Containers

As per previous thread, I understand that a terminate signal sent to the kernel (from a GPIO trigger) tells the linux/balena kernel to begin the shutdown process with involved sending SIGTERM to its containers.

In theory the container, should receive the signal propagate it to our program “./ABC” which in turn handles the signal as described above. The Kernel will supply about a 10s wait about allowing children to clean up before actually terminating.

What I SEE Happen

It appears to me, the container is getting SIGTERM, terminating directly (and immediately restarting).
However, I am pretty certain the problem ABC never gets the signal because the function seeya() is never called.

I ascertain that, since the string we write to STDOUT never is written to the log file…

Thats All

that what I have right now.

Hey Rusty!

Thanks for that detailed summary. My colleague points out it may be a simple matter of the format of the CMD directive. You may need to use the exec form of CMD, which is:
CMD ["/usr/local/bin/ABC"]

This may be behind the container failures in itself, hopefully. As to the machine itself, it’s possible in some cases extremely high cpu will trigger a shutdown. Is your device possibly a raspberry Pi? These especially are biased toward such defensive measures.

That difference is only because I changed where I copied the executable to in more recent versions.

COPY ABC /usr/local/bin/ABC
CMD ["/usr/local/bin/ABC"]

chaned to

COPY ABC ./ABC
CMD ["./ABC"]

Those changes happen at the same time.

It is a Raspberry Pi, I can’t imagine a point where we ever exceed 25% - 40% CPU usage across all CPUs for even a few seconds, let alone enough to bring it down… It is controlling the physical movement of motors and a laser…

Additionally, as stated previously. Our software is fine away from Balena …

Now, My concern is less on the crashing than it’s inability to shutdown.

If I can’t get SIGTERM to work, it is a show stopper.

A quick test just now and I’ve confirmed what ought to be documented in the docker “Dockerfile” reference, but seems not to be… regardless of the location used for COPY directives being an absolute or relative path, the full path will still be needed for the CMD line, which cannot use the ./script.sh syntax (the

. char). I’m watching the device for whenever it comes online, at any rate, and hope to be able to grab some diagnostics or offload some logs.

Hi there,

I’m sorry you’re having these issues, but I’d like to get to a simple example where SIGTERM definitely works for you. As such, I’ve just written a very basic example which should catch a terminating signal on container restart/device reboot, etc.

I’ve quickly tried this out on a Pi3 running 2.43.0+rev1 and I do indeed see SIGTERM being correctly sent from balenaEngine to the running container when a restart/reboot is initiated.

As these are not very large files, I’m going to post them here so you can try them on your same setup and see if they work. All three files should go in a new directory before you balena push to a relevant application:

Dockerfile

FROM balenalib/%%BALENA_MACHINE_NAME%%-debian:buster

RUN install_packages build-essential

WORKDIR /usr/src/app
COPY hello-world.c startup.sh ./

RUN gcc -static -o hello-world hello-world.c

CMD ["/usr/src/app/startup.sh"]

hello-world.c:

#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

volatile sig_atomic_t terminated = 0;

void sigtermAction(int signum)
{
   printf("SIGTERM!\n");
   terminated = 1;
}

int main()
{
    struct sigaction act;
    memset(&act, 0, sizeof(act));
    act.sa_handler = sigtermAction;

    sigaction(SIGTERM, &act, NULL);

    while (!terminated)
    {
        printf("Main loop...\n");
        sleep(5);
    }

    return 0;
}

startup.sh

#!/bin/bash
exec /usr/src/app/hello-world

A couple of points to note:

  • Ensure startup.sh has the execution permission bit (chmod a+x startup.sh) before pushing
  • As mentioned previously (although I see you are doing this), the exec form of CMD is vital to receive signals
  • Any script used to run a binary needs to invoke using exec else signals will again not get caught (and again, the script must be called using the exec form of the CMD directive)

I apologise for pairing back this to such a simple test case, but if you wouldn’t mind trying this yourself to ensure that your device is seeing SIGTERM, it would give us a good foundation to start working back on top of.

Many thanks and best regards,

Heds