Not on purpose, the only thing I was going to try to do in privilege mode is change the host name, but I moved that to the preloaded images.
So, I have no idea what is not causing the host OS to act this way.
As I mentioned previously, the ‘sig’ program I wrote can run in a tight loop in it’s container, i could see that container spinning. But how that effects the host OS is beyond me.
As matter of fact, this is one of dockers big selling points correct?
I can tell you, this software works as expected when we run it outside the containers. This problems we are dealing with now have only been issues in this last month+ I have been trying to get the containers to work properly…
We’d be happy to continue looking at it for you; when you get the devices back online please enable support access again and let us know.
It would be helpful to know more detail about the sig container, since you mentioned that the problems started after that was added to the device. Is there any way you’re able to share the reproduction of the issue with us? If not I’d suggest working to isolate the smallest change possible which starts to cause the problem.
I am curious what did you all find out yesterday? You mentioned it was continuing to reboot because the supervisor was not running, but I have nothing after that?
Did you guys have a chance to look at it any further?
This thread is turning into a mess: We have two problems:
Original BIG Problem with SIGTERM
New FIRE erupted when BalenaOS continued to fail preventing containers from being installed, and updated, sending us stuck in this unrecoverable perpetual reboot.
For Problem 1: I have done everything I have been asked to do, and have at one time or another described the result. Observation-ally, I think I know what is happening, but I need your team to determine if it can be fixed at all.
Please keep in mind: Our software works when we run it native on raspbian. It only breaks since we have been trying to integrate it with balena.
This has been going on for over a month (among other things), but I am now really coming down to the wire and I need to know if Balena is capable of supporting our application.
I am beginning to doubt you will be able to support us…
Second Problem - Persistent Failing of BalenaOS
The second problem you all had a peek at yesterday, came out of nowhere. I wanted to push a new version to balena, when its failures became persistent.
Now this is NOT our primary problem, but It scares the bejebus out of us. The thought of this happening to one of our devices in a remote foreign country, having to send somebody on emergency is also giving us Pause.
Perhaps as much pause to the failure, is the fact that, to me as a customer sitting in a semi-stressful situation has to this point had very little to no help resolving either of these issues.
Heck, I left the rig running with with support mode enabled for 3+ hours afterwords, but have heard nothing new.
Moving forward: I am going to put a new SD image in my device and attempt to get the SIGTERM problem ready to debug, and hopefully make some headway.
If you would like to me to provide support mode again, I will be happy to do so, however I would like to request that you: a) let me know if you make any progress. b) let me know if you are finished using it so I can reclaim it for work over here.
Thanks for your patience so far in this forums thread. I’ve just pinged the engineers who have looked at this thread internally to get all our heads together at once, and I’ve also summarized what I believe are the main points so far and our lines of thinking for possible explanations / further debug pathways. Feel free to let us know if you have anything to add to this, or if anything could be clarified.
The state of things so far:
we are seeing (1) SIGTERMs in a specific container and (2) the device itself crashlooping
none of this happens when run on raspbian outside a container context
the release which coincided with the issue contains a process, sig which can end up in a tight loop which tries to cpumax within whatever limits cgroups have allowed to it
it’s been on the same power supply it always has
containers are running as pirivileged
Ideas:
power supply issues (swapped power supply, no change, and I’m not aware of any “under voltage” messages in logs being found to begin with; usually these are the telltale sign of such)
privileged container is overwriting some resource causing the host OS / supervisor to break in some way (as an example of this species of error: when a privileged container bind-mounts the system dbus socket directory and inside the container runs its own dbus, overwriting the system dbus socket, causing general havoc)
the sig CPUmaxing is somehow causing this
Next steps
I can’t get in to their device/app UUIDs as shared above, as you point out the support access expired. If you successfully flash that new SD and get the device up, we’d be happy to jump in parallel to you or otherwise, as we’d only need read-access type operations. I have a suspicion myself that something may hide in the logs we can make use of, as I don’t see any logs dumps so far in this thread.
Thanks a ton for getting back to me. Just to be clear, with an update:
Update:
The reboot problem is a SCARY distraction. The real problem is SIGTERM.
I have burned a new SD card and have rebooted the rig. It boots fine now and has loaded the latest version of the software. Hence the reboot problem does not currently exist.
I did save the SD with the broken image that causes, reboots).
Next Step
I have rebooted with a new SD card, I have also turned on Support Mode for 6 hours.
uuid: e8f5596a4ef43cad28de82bfb1559b6f
Feel free to have a look. I will be testing, and pushing new versions. If you all want to freeze on a version, and do your own thing with it, I will be happy to accomodate.
I do not want to summarize everything from this thread and a previous, but as a result of my experimentation I THINK I am seeing the following happen:
Written a C program that catches SIGTERM, prints to STDOUT, then calls a function called seeya() to shut various items down.
I have started said program in a container doing this:
CMD ["./ABC"]
Which according to the Dockerfile documentation should exec (run the program in the parent process, NOT spawn a child process to run the program).
I believe I can verify that is infact what happens, I can see that my program is running with procid 1 (no parent).
Fastforward to Running Containers
As per previous thread, I understand that a terminate signal sent to the kernel (from a GPIO trigger) tells the linux/balena kernel to begin the shutdown process with involved sending SIGTERM to its containers.
In theory the container, should receive the signal propagate it to our program “./ABC” which in turn handles the signal as described above. The Kernel will supply about a 10s wait about allowing children to clean up before actually terminating.
What I SEE Happen
It appears to me, the container is getting SIGTERM, terminating directly (and immediately restarting).
However, I am pretty certain the problem ABC never gets the signal because the function seeya() is never called.
I ascertain that, since the string we write to STDOUT never is written to the log file…
Thanks for that detailed summary. My colleague points out it may be a simple matter of the format of the CMD directive. You may need to use the exec form of CMD, which is:
CMD ["/usr/local/bin/ABC"]
This may be behind the container failures in itself, hopefully. As to the machine itself, it’s possible in some cases extremely high cpu will trigger a shutdown. Is your device possibly a raspberry Pi? These especially are biased toward such defensive measures.
It is a Raspberry Pi, I can’t imagine a point where we ever exceed 25% - 40% CPU usage across all CPUs for even a few seconds, let alone enough to bring it down… It is controlling the physical movement of motors and a laser…
A quick test just now and I’ve confirmed what ought to be documented in the docker “Dockerfile” reference, but seems not to be… regardless of the location used for COPY directives being an absolute or relative path, the full path will still be needed for the CMD line, which cannot use the ./script.sh syntax (the
I’m sorry you’re having these issues, but I’d like to get to a simple example where SIGTERM definitely works for you. As such, I’ve just written a very basic example which should catch a terminating signal on container restart/device reboot, etc.
I’ve quickly tried this out on a Pi3 running 2.43.0+rev1 and I do indeed see SIGTERM being correctly sent from balenaEngine to the running container when a restart/reboot is initiated.
As these are not very large files, I’m going to post them here so you can try them on your same setup and see if they work. All three files should go in a new directory before you balena push to a relevant application:
Dockerfile
FROM balenalib/%%BALENA_MACHINE_NAME%%-debian:buster
RUN install_packages build-essential
WORKDIR /usr/src/app
COPY hello-world.c startup.sh ./
RUN gcc -static -o hello-world hello-world.c
CMD ["/usr/src/app/startup.sh"]
Ensure startup.sh has the execution permission bit (chmod a+x startup.sh) before pushing
As mentioned previously (although I see you are doing this), the exec form of CMD is vital to receive signals
Any script used to run a binary needs to invoke using exec else signals will again not get caught (and again, the script must be called using the exec form of the CMD directive)
I apologise for pairing back this to such a simple test case, but if you wouldn’t mind trying this yourself to ensure that your device is seeing SIGTERM, it would give us a good foundation to start working back on top of.