'balena ps' shows container restarting for many hours; reboot required to clear condition

jweide · January 7, 2021, 8:18pm

Executive Summary: After upgrading to supervisor version 11.14.0, one of our service containers consistently ends up in a state where it is “Restarting” and unable to transition out of that state. Once this condition is reached, the container cannot be closed using balena kill . The only recovery that we have found to work is to use reboot -f from the host shell.

Another symptom we observed is that our logging service, which consumes the supervisor API, stops logging when the issue is first encountered. Other services on the device appear to continue working, so the problem looks isolated to the supervisor.

In a deployment of 25 devices, we have encountered this issue no less than 10 times within the last 2 weeks. Once the issue is encountered, it will also cause deploy’s to fail due to the inability to close the existing container. Oddly, containers that are not impacted are still able to be managed (can be stopped/restarted) and so deploys that do not include changes to our “agent” container are not impacted.

In all cases, the same container is impacted (agent). More than likely there is something about this container and it’s behavior that induce the issue most reliably, but our expectation is that no matter what the behavior of the application in the container, it should not cause the balena service or supervisor to stop functioning.

Here is a sample output from a device experiencing the issue:

Here is terminal output from my first encounter with the issue a couple weeks ago:

Welcome to balenaOS

=============================================================
root@e0a1c10:~# balena ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ead66de702d3 54a9ef2adff5 “/trex/edge capture” 25 hours ago Up 25 hours capture_3019399_1618759
5db9eba2d086 registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest “/usr/src/app/entry.…” 37 hours ago Up 37 hours (healthy) resin_supervisor
5e8a3fe22dde 54a9ef2adff5 “/trex/edge control” 3 days ago Up 37 hours control_3019397_1618759
f05c7c6849f6 54a9ef2adff5 “/trex/edge agent” 7 days ago Restarting (1) 37 hours ago agent_3019398_1618759
2250f78d28c1 8f63f9e805d7 “sleep infinity” 7 days ago Up 5 days co-processor-flasher_3019396_1618759
2b39c0652c1d 4cae659624d8 “/entrypoint.sh” 7 days ago Up 5 days (healthy) hw-init_3019403_1618759
2400de462e9e dc0ec1501a28 “/bin/sh -c /app/con…” 7 days ago Up 5 days connection-watchdog_3019406_1618759
30fb207864b5 d39267e33fea “/entrypoint.sh” 7 days ago Up 37 hours 0.0.0.0:2222->22/tcp scp_3019402_1618759
c7c2f9c6903a ff0c7181a848 “python3 logger.py” 7 days ago Up 37 hours logger_3019404_1618759

root@e0a1c10:~# balena kill f05c7c6849f6
^C // I aborted the command after waiting maybe 60-120 seconds.
root@e0a1c10:~# ps | grep edge
15820 root 2276 S grep edge
18108 root 978m S /trex/edge capture
18517 root 890m S /trex/edge control
// The absense of /trex/edge agent from this output indicates the service we were running in the agent container is not presently running
root@e0a1c10:~#

Here are details from a representative/identically configured system in our CI environment:

root@efdbd72:~# cat /etc/*-release
ID=“balena-os”
NAME=“balenaOS”
VERSION=“2.58.3+rev1”
VERSION_ID=“2.58.3+rev1”
PRETTY_NAME=“balenaOS 2.58.3+rev1”
MACHINE=“fincm3”
VARIANT=“Production”
VARIANT_ID=“prod”
META_BALENA_VERSION=“2.58.3”
RESIN_BOARD_REV=“6a40e45”
META_RESIN_REV=“eabb6e1”
SLUG=“fincm3”

I am happy to provide additional information. I’m unsure how to really debug this further and I’m also unsure what logs or outputs would be useful, so I was hoping to get some suggestions from the community or support from balena.

jtonello · January 8, 2021, 12:34am

Hi,

Are you able to restart the supervisor by opening the HostOS shell and entering the command systemctl restart resin-supervisor? It sounds like the supervisor isn’t restarting properly after the upgrade, which would explain why your services aren’t impacted and why you can no longer interact via the dashboard.

Can you give that a try?

John

jtonello · January 8, 2021, 12:36am

Also, can you tell us what kind of devices you’re running and if you updated the systems using the dashboard (or some other mechanism)? Generally, when updating the OS version via the dashboard, the supervisor is updated at the same time. Perhaps that didn’t happen.

John

jweide · January 8, 2021, 6:26pm

Sorry, I didn’t mean to distract from the original issue by mentioning deployments. This issue can occur naturally without any external interaction with the balena device.

For example: I have a device XYZ. This device’s supervisor was upgraded 6 weeks ago. The most recent deploy was perhaps 2 weeks ago. The device was fully operational at the conclusion of the deploy and has had no external intervention since, but at midnight last night when the device was supposed to upload metrics to the cloud, this device did not. An investigation the next morning shows that the “agent” service has been in the “restarting” state for 27 hours, meaning that sometime in the early morning the prior day, this device got into a state that caused some services to fail and not restart, but was healthy enough to show green status on the balena dashboard. An attempt to look at the logs collected through the supervisor API show that no logs have come through for the same 27 hours, even though many of the services are still running.

So now, getting away from the example story of the problem, I have a few updates from our internal efforts and I must mention that the original problem remains but some of our symptoms are a little different.

On this device we use InfluxDB. When influxDB starts on an initial deployment, before any database is created, it’s necessary to delay the start of the “agent” service until a separate staging script runs to do the initial configuration of influxDB. We modified the influx entrypoint to start influx, test when it becomes ready, test that is has been correctly staged and ready for the agent to connect, and then we open a tcp listen port using socat in the influx container. Once the socat listen port is open, the agent container is able to see that it’s safe to connect to the influx instance. We had a problem where if something went wrong in the influx container (for example, it was killed due to a resource problem) the socat socket was in the CLOSE_WAIT state and could not be terminated. This caused the influx container to be in a bad state. We corrected the problem by cleanly closing the socat socket, which no allows the influx container to cleanly close and restart. We believe that this issue was unrelated to our agent container problem (getting stuck in the restarting state) but somehow, by fixing this problem with socat in the influx container, the agent container no longer gets stuck “restarting” state, but does continue to fail to restart if the entrypoint process exits.

I think a new and more appropriate headline for this issue would be “Container entrypoint terminates, but the container is not restarted”.

You can see this behavior from the following log output on on of our devices that reproduced the issue this morning:

Observations:

The container “agent_3134344_1651895” shows “Up 13 hours” upon initial login
grepping ps for the agent container entrypoint shows no running process matching the container.
The ‘balena kill’ command succeeds but a subsequent “balena ps” shows the container still “Up 13 hours”
balena inspect shows the state is “Status”: “running”, and “Running”: true, and “Pid”: 27829,
‘ps | grep 27829’ shows nothing using PID 27829
‘systemctl restart balena’ resulted in things working again.
Contrary to earlier reports, the logging service actually continued to work now that we resolved the socat issue in the influx container.

Logs are below

=============================================================
Welcome to balenaOS

root@4ef2f72:~# balena ps
CONTAINER ID IMAGE 472710a7a738 2049 root 890m S 2339 root 987m S 19492 root 2276 S grep edge
root@4ef2f72:~# balena 1f78b10e5b5d
root@4ef2f72:~# balena ps
CONTAINER ID IMAGE 472710a7a738 2049 root 890m S 2339 root 987m S 19697 root 2276 S grep edge COMMAND CREATED STATUS PORTS NAMES
href="http://registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest" rel="noopener nofollow ugc">registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest “/usr/src/app/entry.…” 13 hours ago Up 13 hours (healthy) resin_supervisor
“/entrypoint.sh” 23 hours ago Up 23 hours 0.0.0.0:2222->22/tcp scp_3134349_1651895
“/entrypoint.sh” 8 days ago Up 2 days grafana_3134353_1651895
“/entrypoint_with_wa…” 8 days ago Up 2 days telegraf_3134352_1651895
“sleep infinity” 8 days ago Up 2 days inspect_3134348_1651895
“/entrypoint.sh infl…” 8 days ago Up 2 days influxdb_3134354_1651895
“/trex/edge agent” 8 days ago Up 13 hours agent_3134344_1651895
“/trex/edge control” 8 days ago Up 2 days control_3134341_1651895
“/trex/edge capture” 8 days ago Up 2 days capture_3134342_1651895
“sleep infinity” 11 days ago Up 2 days co-processor-flasher_3134343_1651895
“./grpc-dfuTool” 11 days ago Up 2 days sensor-firmware-flasher_3134345_1651895
“python3 logger.py” 11 days ago Up 13 hours logger_3134347_1651895
“/entrypoint.sh” 11 days ago Up 13 hours (healthy) hw-init_3134346_1651895
“/bin/sh -c /app/con…” 11 days ago Up 2 days connection-watchdog_3134350_1651895
/trex/edge control
/trex/edge capture
kill 1f78b10e5b5d
COMMAND CREATED STATUS PORTS NAMES
href="http://registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest" rel="noopener nofollow ugc">registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest “/usr/src/app/entry.…” 13 hours ago Up 13 hours (healthy) resin_supervisor
“/entrypoint.sh” 23 hours ago Up 23 hours 0.0.0.0:2222->22/tcp scp_3134349_1651895
“/entrypoint.sh” 8 days ago Up 2 days grafana_3134353_1651895
“/entrypoint_with_wa…” 8 days ago Up 2 days telegraf_3134352_1651895
“sleep infinity” 8 days ago Up 2 days inspect_3134348_1651895
“/entrypoint.sh infl…” 8 days ago Up 2 days influxdb_3134354_1651895
“/trex/edge agent” 8 days ago Up 13 hours agent_3134344_1651895
“/trex/edge control” 8 days ago Up 2 days control_3134341_1651895
“/trex/edge capture” 8 days ago Up 2 days capture_3134342_1651895
“sleep infinity” 11 days ago Up 2 days co-processor-flasher_3134343_1651895
“./grpc-dfuTool” 11 days ago Up 2 days sensor-firmware-flasher_3134345_1651895
“python3 logger.py” 11 days ago Up 13 hours logger_3134347_1651895
“/entrypoint.sh” 11 days ago Up 13 hours (healthy) hw-init_3134346_1651895
“/bin/sh -c /app/con…” 11 days ago Up 2 days connection-watchdog_3134350_1651895
/trex/edge control
/trex/edge capture

root@4ef2f72:~# balena inspect agent_3134344_1651895
[
{
“Id”: “1f78b10e5b5d644c2a51d39189ee6babf2bc55dc08654a8c69b55ad58463b934”,
“Created”: “2020-12-30T20:44:28.231746912Z”,
“Path”: “/trex/edge”,
“Args”: [
“agent”
],
“State”: {
“Status”: “running”,
“Running”: true,
“Paused”: false,
“Restarting”: false,
“OOMKilled”: false,
“Dead”: false,
“Pid”: 27829,
“ExitCode”: 0,
“Error”: “”,
“StartedAt”: “2021-01-08T03:51:03.489862922Z”,
“FinishedAt”: “2021-01-08T03:50:58.401232618Z”
},
“Image”: “sha256:158780b9dd8a4569fa86dfa8d134f584dea1b7e1dc51170634f32a94b841e1e0”,
“ResolvConfPath”: “/var/lib/docker/containers/1f78b10e5b5d644c2a51d39189ee6babf2bc55dc08654a8c69b55ad58463b934/resolv.conf”,
“HostnamePath”: “/var/lib/docker/containers/1f78b10e5b5d644c2a51d39189ee6babf2bc55dc08654a8c69b55ad58463b934/hostname”,
“HostsPath”: “/var/lib/docker/containers/1f78b10e5b5d644c2a51d39189ee6babf2bc55dc08654a8c69b55ad58463b934/hosts”,
“LogPath”: “”,
“Name”: “/agent_3134344_1651895”,
“RestartCount”: 34,
“Driver”: “aufs”,
“Platform”: “linux”,
“MountLabel”: “”,
“ProcessLabel”: “”,
“AppArmorProfile”: “”,
“ExecIDs”: null,
“HostConfig”: {
“Binds”: [
“/tmp/balena-supervisor/services/1499460/agent:/tmp/resin”,
“/tmp/balena-supervisor/services/1499460/agent:/tmp/balena”
],
“ContainerIDFile”: “”,
“ContainerIDEnv”: “”,
“LogConfig”: {
“Type”: “journald”,
“Config”: {}
},
“NetworkMode”: “host”,
“PortBindings”: {},
“RestartPolicy”: {
“Name”: “always”,
“MaximumRetryCount”: 0
},
“AutoRemove”: false,
“VolumeDriver”: “”,
“VolumesFrom”: null,
“CapAdd”: ,
“CapDrop”: ,
“Capabilities”: null,
“Dns”: ,
“DnsOptions”: ,
“DnsSearch”: ,
“ExtraHosts”: ,
“GroupAdd”: ,
“IpcMode”: “shareable”,
“Cgroup”: “”,
“Links”: null,
“OomScoreAdj”: 0,
“PidMode”: “”,
“Privileged”: true,
“PublishAllPorts”: false,
“ReadonlyRootfs”: false,
“SecurityOpt”: ,
“UTSMode”: “”,
“UsernsMode”: “”,
“ShmSize”: 67108864,
“Runtime”: “runc”,
“ConsoleSize”: [
0,
0
],
“Isolation”: “”,
“CpuShares”: 0,
“Memory”: 0,
“NanoCpus”: 0,
“CgroupParent”: “”,
“BlkioWeight”: 0,
“BlkioWeightDevice”: null,
“BlkioDeviceReadBps”: null,
“BlkioDeviceWriteBps”: null,
“BlkioDeviceReadIOps”: null,
“BlkioDeviceWriteIOps”: null,
“CpuPeriod”: 0,
“CpuQuota”: 0,
“CpuRealtimePeriod”: 0,
“CpuRealtimeRuntime”: 0,
“CpusetCpus”: “”,
“CpusetMems”: “”,
“Devices”: ,
“DeviceCgroupRules”: null,
“DeviceRequests”: ,
“KernelMemory”: 0,
“KernelMemoryTCP”: 0,
“MemoryReservation”: 0,
“MemorySwap”: 0,
“MemorySwappiness”: null,
“OomKillDisable”: false,
“PidsLimit”: null,
“Ulimits”: ,
“CpuCount”: 0,
“CpuPercent”: 0,
“IOMaximumIOps”: 0,
“IOMaximumBandwidth”: 0,
“MaskedPaths”: null,
“ReadonlyPaths”: null
},
“GraphDriver”: {
“Data”: null,
“Name”: “aufs”
},
“Mounts”: [
{
“Type”: “bind”,
“Source”: “/tmp/balena-supervisor/services/1499460/agent”,
“Destination”: “/tmp/balena”,
“Mode”: “”,
“RW”: true,
“Propagation”: “rprivate”
},
{
“Type”: “bind”,
“Source”: “/tmp/balena-supervisor/services/1499460/agent”,
“Destination”: “/tmp/resin”,
“Mode”: “”,
“RW”: true,
“Propagation”: “rprivate”
}
],
“Config”: {
“Hostname”: “4ef2f72”,
“Domainname”: “”,
“User”: “”,
“AttachStdin”: false,
“AttachStdout”: false,
“AttachStderr”: false,
“Tty”: true,
“OpenStdin”: false,
“StdinOnce”: false,
“Env”: [
“RESIN_DEVICE_NAME_AT_INIT=silent-sea”,
“BALENA_DEVICE_NAME_AT_INIT=silent-sea”,
“INFLUXDB_HOST=127.0.0.1:8087”,
“WAIT_FOR_INFLUXDB=true”,
“AGENT_CAPTURE_URL=tcp://localhost:10000”,
“AGENT_INFLUXDB_URL=http://localhost:8086”,
“TREX_LOG_LEVEL=INFO”,
“WELL_API_NUMBER=00-000-00000”,
“AGENT_KAFKA_URL=tcp://broker.trex.petropower.com:9092”,
“AGENT_LOG_LEVEL=INFO”,
“AGENT_PARTIAL_SAMPLES_DATABASE=partial_samples/autogen”,
“BALENA_APP_ID=1499460”,
“BALENA_APP_NAME=Trex”,
“BALENA_SERVICE_NAME=agent”,
"BALENA_DEVICE_UUID=// removed
“BALENA_DEVICE_TYPE=fincm3”,
“BALENA_DEVICE_ARCH=armv7hf”,
“BALENA_HOST_OS_VERSION=balenaOS 2.58.3+rev1”,
“BALENA_SUPERVISOR_VERSION=11.14.0”,
“BALENA_APP_LOCK_PATH=/tmp/balena/updates.lock”,
“BALENA=1”,
“RESIN_APP_ID=1499460”,
“RESIN_APP_NAME=Trex”,
“RESIN_SERVICE_NAME=agent”,
"RESIN_DEVICE_UUID=// removed
“RESIN_DEVICE_TYPE=fincm3”,
“RESIN_DEVICE_ARCH=armv7hf”,
“RESIN_HOST_OS_VERSION=balenaOS 2.58.3+rev1”,
“RESIN_SUPERVISOR_VERSION=11.14.0”,
“RESIN_APP_LOCK_PATH=/tmp/balena/updates.lock”,
“RESIN=1”,
“RESIN_SERVICE_KILL_ME_PATH=/tmp/balena/handover-complete”,
“BALENA_SERVICE_HANDOVER_COMPLETE_PATH=/tmp/balena/handover-complete”,
“USER=root”,
“PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin”
],
“Cmd”: [
“agent”
],
“Healthcheck”: {
“Test”: [
“NONE”
]
},
“Image”: “sha256:158780b9dd8a4569fa86dfa8d134f584dea1b7e1dc51170634f32a94b841e1e0”,
“Volumes”: null,
“WorkingDir”: “”,
“Entrypoint”: [
“/trex/edge”
],
“OnBuild”: null,
“Labels”: {
“com.petropower.component.name”: “agent”,
“com.petropower.component.version”: “421”,
“io.balena.app-id”: “1499460”,
“io.balena.service-id”: “682262”,
“io.balena.service-name”: “agent”,
“io.balena.supervised”: “true”
},
“StopSignal”: “SIGTERM”,
“StopTimeout”: 10
},
“NetworkSettings”: {
“Bridge”: “”,
“SandboxID”: “81b2b80e3248f5c239fc339b28caab06797e22e566b963b328c680b16d2ccc62”,
“HairpinMode”: false,
“LinkLocalIPv6Address”: “”,
“LinkLocalIPv6PrefixLen”: 0,
“Ports”: {},
“SandboxKey”: “/var/run/balena-engine/netns/default”,
“SecondaryIPAddresses”: null,
“SecondaryIPv6Addresses”: null,
“EndpointID”: “”,
“Gateway”: “”,
“GlobalIPv6Address”: “”,
“GlobalIPv6PrefixLen”: 0,
“IPAddress”: “”,
“IPPrefixLen”: 0,
“IPv6Gateway”: “”,
“MacAddress”: “”,
“Networks”: {
“host”: {
“IPAMConfig”: null,
“Links”: null,
“Aliases”: null,
“NetworkID”: “a1b0d489f69d651e7b7054b5bb74622acbb3254cc18986ba57befbdb5d798287”,
“EndpointID”: “31a8e9ff9518e27430dbe5f4b4840a04f2398d00117225d41eca3e934ffc3b8f”,
“Gateway”: “”,
“IPAddress”: “”,
“IPPrefixLen”: 0,
“IPv6Gateway”: “”,
“GlobalIPv6Address”: “”,
“GlobalIPv6PrefixLen”: 0,
“MacAddress”: “”,
“DriverOpts”: null
}
}
}
}
]
root@4ef2f72:~# ps |grep edge
2049 root 890m S /trex/edge control
2339 root 987m S /trex/edge capture
21894 root 2276 S grep edge

root@4ef2f72:~# ps |grep 27829
22911 root 2276 S grep 27829
root@4ef2f72:~# service balena restart
bash: service: command not found
root@4ef2f72:~# systemctl restart balena

root@4ef2f72:~#
root@4ef2f72:~#
root@4ef2f72:~# balena ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
472710a7a738 registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest “/usr/src/app/entry.…” 13 hours ago Up 21 seconds (health: starting) resin_supervisor
71d597882a28 02f603c54d98 “/entrypoint.sh” 24 hours ago Up 15 seconds 0.0.0.0:2222->22/tcp scp_3134349_1651895
68bf0adab5d4 e5364636fadf “/entrypoint.sh” 8 days ago Restarting (143) 3 seconds ago grafana_3134353_1651895
b96312a92de1 e4ccd877f611 “/entrypoint_with_wa…” 8 days ago Up Less than a second telegraf_3134352_1651895
da24521e0155 8d80652d5ac6 “sleep infinity” 8 days ago Up 18 seconds inspect_3134348_1651895
50ad55275d8c e36fce813fa7 “/entrypoint.sh infl…” 8 days ago Up 19 seconds influxdb_3134354_1651895
1f78b10e5b5d 158780b9dd8a “/trex/edge agent” 8 days ago Up 21 seconds agent_3134344_1651895
4fa6876d14be 158780b9dd8a “/trex/edge control” 8 days ago Up 21 seconds control_3134341_1651895
9fb70291c035 158780b9dd8a “/trex/edge capture” 8 days ago Up 19 seconds capture_3134342_1651895
9441093ec07d 860c57b07dd9 “sleep infinity” 11 days ago Up 15 seconds co-processor-flasher_3134343_1651895
3af1f6e5d436 c9767211b5b7 “./grpc-dfuTool” 11 days ago Up 19 seconds sensor-firmware-flasher_3134345_1651895
d15ed98e4d06 9f49ac54288d “python3 logger.py” 11 days ago Up 6 seconds logger_3134347_1651895
9c1c8640aede 48a011276016 “/entrypoint.sh” 11 days ago Up 19 seconds (health: starting) hw-init_3134346_1651895
23abad88c563 5b31debf8912 “/bin/sh -c /app/con…” 11 days ago Up 15 seconds connection-watchdog_3134350_1651895
root@4ef2f72:~# ps |grep agent
24348 root 872m S /trex/edge agent
27160 root 2276 S grep agent
root@4ef2f72:~# ^C
root@4ef2f72:~#

ab77 · January 9, 2021, 12:50am

Hi there, it’s a bit difficult to read the logs you posted since they’ve not been quoted in tripple back-tics.

Generally speaking, if a container can not be stopped and killed with SIGTERM/SIGKILL, that suggests the processes running within are doing something potentially unstable. The “green” status on the dashboard you are refering to doesn’t consider the health of the user containers, only of the device’s ability to talk to the balena API and VPN.

Are you able to provision a test device and grant us support access, so we can take a look and troubleshoot further?

ab77 · January 9, 2021, 1:00am

In addition, you should instead implement a health-check at the docker-compose level and establish dependency between relevant containers. So that the dependent container will not start until the Influx is ready/passing health checks.

jweide · January 11, 2021, 3:55pm

A few comments in reply:

We use container dependencies, but in the case of Influx specifically, a newly deployed device does not yet have it’s databases configured. You can only configure those initial settings after the service is fully ready.
We decided to use socat to establish “full readiness” in the influx container. Perhaps we should choose something else that does not create stability issues. We did resolve the CLOSE_WAIT problem in our influx container already. Oddly, it’s not the influx container we have a problem restarting, it’s the agent container.
Internally we agree we need to implement health-check. I’m evaluating what that would look like.

I am working on provisioning a new device now and I will update the ticket when it’s ready. Even after provisioning, it could take some time to hit the bug. Do we need to set up support access before we hit the issue? Is granting support access something that would be destructive to the system state? I would like to make sure we are able to make use of the first reproduce since it could take awhile and we don’t presently have a procedure to induce the problem.

klutchell · January 11, 2021, 9:34pm

Hey @jweide, granting support access can be done at any time and does not impact the device state at all. It would be best to do it after the issue is reproduced to avoid leaving support access enabled for longer than required. Keep us posted!

jweide · January 12, 2021, 4:45pm

We’ve encountered this issue on a field device. This particular device is connected over a 4G cell network so sometimes connectivity can be a little troublesome, but I can assure you that it’s totally safe to manipulate however we like because it’s function has been discontinued for our alpha partner at this particular site. It’s a passive device and presently we are not making use of it.

Who should I communicate with about further debugging the issue? Is there a secure way for me to transmit the devices identity to an internal engineer?

Here is the signature we look for to know we hit the bug (it’s like the original report, that the container is “restarting” for several hours)

‘ps’ shows the edge agent is not running
‘balena ps’ shows the container is running or restarting for several hours

~# ps | grep edge
 9772 root      882m S    /trex/edge control
 9864 root      962m S    /trex/edge capture
14491 root      2276 S    grep edge
~# balena ps

<truncated>

177de4fa331f        158780b9dd8a                                                            "/trex/edge agent"       12 days ago         Restarting (1) 9 hours ago

<truncated>

~# balena inspect agent_3134344_1651895

<truncated>
"State": {
            "Status": "restarting",
            "Running": true,
            "Paused": false,
            "Restarting": true,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 1,
            "Error": "",
            "StartedAt": "2021-01-12T07:43:58.087910729Z",
            "FinishedAt": "2021-01-12T07:44:51.445698099Z"
<truncated>

We did confirm on a different device that using “systemctl restart balena” recovered the issue, but since opening this ticket it’s been a little more quiet and we haven’t had an opportunity to try “systemctl restart resin-supervisor” like was suggested before.

I can hold this device in this state and I will open support access as soon as I can communicate the device identity.

20k-ultra · January 12, 2021, 8:32pm

Hey, thanks for the extensive information you have provided. You’ve done a great job debugging the issue so far.

I’ll address a few of the points mentioned in your original post followed by some questions…

The Supervisor will deploy your release which consists of downloading the images, creating containers, and running them. The fact that this container is in a Restarting state means that it has been started. Once a container is running the Supervisor no longer performs any steps to make sure it is running. This means once the Supervisor successfully makes a container run it will not monitor it and trigger any further starts. This condition though has 1 stipulation though, if the Supervisor container is restarted either directly (systemctl restart resin-supervisor), via balena kill, or host reboot then the Supervisor does not know that it previously started the container and will try to start it if not running. I am mentioning this as the Supervisor is not causing the container to be restarting, that responsibility falls onto the engine based on the restart-policy you have configured in the docker-compose. By default the restart policy is always.

From your inspection of the agent container we can see that it has ExitCode 0. This would indicate that the container is gracefully terminating itself because there is no longer a foreground process attached. This is not to be confused with an application error which would return an ExitCode value of 1. Additionally, if the engine or Supervisor were terminating this container the ExitCode would be 143 (SIGTERM) or 137 for SIGKILL which would be used when restarting the engine.

As for deployments being blocked, since the Supervisor cannot kill the running release (can’t stop the agent container) then it cannot begin to deploy the new release. Once we figure out why the agent container is having issues then deployments will be able to be applied.

Time for some questions and next steps to try…

You mention that once the error state is established ( agent_3019398_1618759 container is stuck restarting) the Supervisor API no longer provides container logs. Is this just for the agent container or all your containers ? If I had to guess I would say the API is not passing logs for any container. This might be a bug with the implementation that wants ALL container logs or none… If it is all your containers could you SSH to the device and check the logs yourself to see if they are being generated via balena logs <container-name> . This would help us figure out if the restarting container state breaks the log pipe to the Supervisor or within the Engine.

Considering everything I’ve just said I believe the issue is something within the application container. That might be due to some configuration change that the new Supervisor version introduced or just a coincidence of timing. I would say that we should figure out why that container is Restarting. With a device that is in this state SSH to the device and run balena logs agent_3134344_1651895. Do you see any messages ? Does your application contain any logic to make it exit gracefully also ?

I am confident that it’s nothing from balenaOS that is telling the container to stop so we need to find out why the container itself is deciding to exit.

20k-ultra · January 12, 2021, 8:42pm

You can also check if the Supervisor is stopping the container by SSHing to the device and running journalctl -fn 100 -u resin-supervisor. In the following example we can see that when I press the restart action on the dashboard this is the output from the Supervisor:

Jan 12 20:38:54 0ee913a resin-supervisor[2871]: [event]   Event: Service kill {"service":{"appId":1748670,"serviceId":864492,"serviceName":"main","releaseId":1651873}}
Jan 12 20:39:21 0ee913a resin-supervisor[2871]: [event]   Event: Service exit {"service":{"appId":1748670,"serviceId":864492,"serviceName":"main","releaseId":1651873}}
Jan 12 20:39:22 0ee913a resin-supervisor[2871]: [event]   Event: Service stop {"service":{"appId":1748670,"serviceId":864492,"serviceName":"main","releaseId":1651873}}
Jan 12 20:39:22 0ee913a resin-supervisor[2871]: [event]   Event: Service install {"service":{"appId":1748670,"serviceId":864492,"serviceName":"main","releaseId":1651873}}
Jan 12 20:39:24 0ee913a resin-supervisor[2871]: [event]   Event: Service installed {"service":{"appId":1748670,"serviceId":864492,"serviceName":"main","releaseId":1651873}}
Jan 12 20:39:25 0ee913a resin-supervisor[2871]: [event]   Event: Service start {"service":{"appId":1748670,"serviceId":864492,"serviceName":"main","releaseId":1651873}}
Jan 12 20:39:43 0ee913a resin-supervisor[2871]: [api]     POST /v2/applications/1748670/restart-service  -  ms

You’ll see similar events if the Supervisor is actually stopping/starting the container over and over.

20k-ultra · January 12, 2021, 8:47pm

Additionally, you can also check if the Engine is having issues by checking for logs there with: journalctl -fn 50 -u balena.service -t balenad

jweide · January 12, 2021, 11:34pm

Thank you for comprehensively addressing many of the messages and themes ongoing with this issue, because it helps bring into focus what aspects of the issue are important.

In reply to many of the components of your message

Regarding the supervisors role in starting containers:
Good to know that the supervisors role is only to download, create, and run containers, but NOT to restart them. This redirects my focus away from the supervisor and onto the balena/docker engine. There is one caveat, which is the supervisors role in logging, which I will address further down in my reply.

Regarding the exit code of the agent container, I’ve attached two of those to this defect. In the first case, we see:

“State”: {
“Status”: “running”,
“Running”: true,
“Paused”: false,
“Restarting”: false,
“OOMKilled”: false,
“Dead”: false,
“Pid”: 27829,
“ExitCode”: 0,
“Error”: “”,
“StartedAt”: “2021-01-08T03:51:03.489862922Z”,
“FinishedAt”: “2021-01-08T03:50:58.401232618Z”
},

In the second case, we see:

"State": {
            "Status": "restarting",
            "Running": true,
            "Paused": false,
            "Restarting": true,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 1,
            "Error": "",
            "StartedAt": "2021-01-12T07:43:58.087910729Z",
            "FinishedAt": "2021-01-12T07:44:51.445698099Z"
},

I’m making a couple observations, contrasting each of those events:

In the case of the 1st event:

The Status=”running”; Running=true; Restarting=false
The PID is non-zero (yet cannot be found in output of ‘ps’)
The Exit Code is 0 (graceful termination)
The “StartedAt” time occurs AFTER the “FinishedAt” time.

In the case of the 2nd event:

The Status=“restarting” (vs running)
The PID = 0 (vs. non-zero)
The Exit code is 1 (the application exited with error) (vs 0 = graceful exit)
The “StartedAt” time occurs BEFORE the “FinishedAt” time. (vs. After)

If I were to characterize the runtime behavior of the agent application, it would be that it fails or exits upon almost any condition that might require more than the most straightforward handling code. It does not handle or retry most things, it simply dies and depends on the engine to restart it. The reason is that we want to avoid making stateful decisions in the agent, and if something about the running environment or related services changes, we would prefer to die and restart. This being the case, the agent is routinely crashing or exiting and restarting by design. We rely on the engine restarting the agent for it to accomplish it’s purpose.

Regarding the logging service:
We have a separate logging container. It runs a python script that makes use of the supervisor api defined here: https://www.balena.io/docs/reference/supervisor/supervisor-api/#journald-logs

Relevant part of the implementation:

    request_body = {"follow":"true","all":"true"}
    requests.post(f"{balena_supervisor_address}/v2/journal-logs?apikey={api_key}", data=request_body, stream=True, hooks={'response': write_logs})

So we set follow=true and all=true. If I’m following you correctly in your reply, by setting all=true it sounds like perhaps we can get into trouble? To check, I looked at the logging container on the device I have held in this condition (where the container says it’s restarting) and verified that in fact at a minimum the resin-supervisor, balenad, and our influx container still have logs streaming into our local log files. This was not the case on some of our initial encounters with this issue, but I will continue to monitor and see if the failure mode is consistent.

Balena log does show the complete log history for two containers I checked - the influx container and the agent container (this is contrary to my early encounters with this issue). In fact, I can see that for this latest hit of the issue, the agent container was basically in a restart loop because it couldn’t reach the influx database. The agent container was successfully restarted many times, but then finally hit this bug and for some reason wasn’t able to restart.

In the case of the device I’m presently holding in this condition, the agent shut down intentionally, but set the exitcode to 1. Here are the logs we see:

2021/01/12 07:44:30 influxdb2client E! Write error: Post "http://localhost:8086/api/v2/write?bucket=partial_samples%2Fautogen&org=&precision=ns": context deadline exceeded
2021-01-12 07:44:30 CRITICAL agent util.go:47 "persister" encountered a critical error: Post "http://localhost:8086/api/v2/write?bucket=partial_samples%2Fautogen&org=&precision=ns": context deadline exceeded
2021-01-12 07:44:30 CRITICAL agent util.go:52 Signaling application to shutdown
2021-01-12 07:44:30 INFO agent cmd.go:483 Exiting partial sample persister goroutine
2021-01-12 07:44:30 INFO agent cmd.go:501 Exiting cloud persister goroutine
2021-01-12 07:44:31 INFO agent cmd.go:465 Exiting metrics server goroutine
2021-01-12 07:44:31 CRITICAL  main.go:27 Post "http://localhost:8086/api/v2/write?bucket=partial_samples%2Fautogen&org=&precision=ns": context deadline exceeded

Here is the source that generated those last logs:

27		rootLogger.Criticalf(err.Error())
28		os.Exit(1)

Clearly I’m not asking you to debug my application, but I do want to bring the focus back to the core problem we are trying to investigate. Regardless of how the container exits, the engine should always be able to stop and start containers. In it’s present state, on the device I am holding, using “balena kill” cannot kill my container:

:~# balena kill agent_3134344_1651895
^Z
[1]+  Stopped                 balena kill agent_3134344_1651895
:~# bg
[1]+ balena kill agent_3134344_1651895 & // < I backgrounded this process for 2+ minutes
:~# balena ps -a
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS                        PORTS                  NAMES
798624c85fad        registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest   "/usr/src/app/entry.…"   15 hours ago        Up 15 hours (healthy)                                resin_supervisor
6207a243973e        02f603c54d98                                                            "/entrypoint.sh"         5 days ago          Up 16 hours                   0.0.0.0:2222->22/tcp   scp_3134349_1651895
93f60708b243        e5364636fadf                                                            "/entrypoint.sh"         6 days ago          Up 16 hours                                          grafana_3134353_1651895
03963d43d718        e36fce813fa7                                                            "/entrypoint.sh infl…"   7 days ago          Up 6 days                                            influxdb_3134354_1651895
c53024946cea        e4ccd877f611                                                            "/entrypoint_with_wa…"   13 days ago         Up 16 hours                                          telegraf_3134352_1651895
8ff49cf7f1bf        8d80652d5ac6                                                            "sleep infinity"         13 days ago         Up 6 days                                            inspect_3134348_1651895
ab33e5ce9662        158780b9dd8a                                                            "/trex/edge control"     13 days ago         Up 16 hours                                          control_3134341_1651895
f80ff8dd101f        158780b9dd8a                                                            "/trex/edge capture"     13 days ago         Up 16 hours                                          capture_3134342_1651895
177de4fa331f        158780b9dd8a                                                            "/trex/edge agent"       13 days ago         Restarting (1) 16 hours ago                          agent_3134344_1651895
59b1406b0185        c9767211b5b7                                                            "./grpc-dfuTool"         2 weeks ago         Up 16 hours                                          sensor-firmware-flasher_3134345_1651895
69218d46f0d6        860c57b07dd9                                                            "sleep infinity"         3 weeks ago         Up 6 days                                            co-processor-flasher_3134343_1651895
db27076ca499        48a011276016                                                            "/entrypoint.sh"         3 weeks ago         Up 6 days (healthy)                                  hw-init_3134346_1651895
603ab8c1c607        9f49ac54288d                                                            "python3 logger.py"      3 weeks ago         Up 15 hours                                          logger_3134347_1651895
9177ecb1cf3c        5b31debf8912                                                            "/bin/sh -c /app/con…"   3 weeks ago         Up 6 days                                            connection-watchdog_3134350_1651895
~#

Am I wrong to expect the ‘balena kill’ command to always be successful?
Am I wrong to expect the balena engine to ‘always’ restart containers that have exited based on the policy defined in docker-compose?

I just took a look at our docker-compose file and noticed that only 3 services (out of 13) specify any restart policy, and those all explicitly specify “restart: always”. I believe that the default policy for the others is also “always”. I’m going to go ahead an explicitly specify “always” for the others but I’m not expecting that to fix anything, especially since I can’t manually restart this container without restarting the balena service.

20k-ultra · January 13, 2021, 4:10am

These are some great observations! I applaud you for being so analytical

Just to confirm it seems that the container is exitting as a result of being unable to connect to the InfluxDB service. Once that is resolved then everything should be functional again.

However, you want to point out that even though that issues is going on, when the engine has a container in this volatile state of rapidly restarting it is not possible to send a stop or kill to halt the restarting. Additionally the discrepancies pointed out in your inspection of a container shows some illogical results (ie: StartedAt being after FinishedAt).

Can you confirm I am understanding what I should focus on helping with ?

jweide · January 13, 2021, 3:30pm

Here is what I want to debug:

Why is this agent container “stuck” in the restarting state for many many hours without actually restarting? (Alternatively, why is ‘balena ps’ showing the container state is “running” with a non-zero PID, but the ‘ps’ command does not show any host process for the agent application, or any application with that PID)
Why can I not use ‘balena kill’ to manually clean up the stuck container, regardless of its state being “restarting” or “running” (This bit also impacts deploys - because the supervisor also cannot manage these containers)
Why is it that sometimes the logging API from the supervisor stops sending logs when we get into this state? (I need to reproduce this bit again - because the device I’m holding did not exhibit that behavior. This point is also far less important than the first two)

If we can address the first two items I think that will go a long way.

ab77 · January 13, 2021, 11:20pm

Hi Jeff, is simply (re)flashing this problematic device a realistic option for you at this stage? It would reset it back to a known good state, and potentially give us some pointers whether or not there are any external issues, networking, hardware, etc…

jweide · January 13, 2021, 11:34pm

As already stated in the original bug report, "In a deployment of 25 devices, we have encountered this issue no less than 10 times within the last 2 weeks.:

So unless there is something broken in common with all these device, I don’t think going down that path will yield fruit.

I do however have a device that I am holding in this bad state, if an engineer would like to log in and take a look.

I am also willing to have a phone call if that helps clarify things.

ab77 · January 13, 2021, 11:38pm

No worries, this ticket now has quite a lot of information, so starting to miss things. Are you able to grant support access to one or more of the affected devices/applications, so we can take a look?

ab77 · January 14, 2021, 12:34am

Also, if you could please share your docker-compose file in a private message (DM) as well as the device uuid.

jweide · January 15, 2021, 9:26pm

I’ve hit this issue on several other devices today. Here is some debugging of one of them. I’m really not so much debugging as I am documenting that it’s following a consistent pattern of symptoms.

In short:

‘balena ps’ shows agent is running
‘balena inspect’ shows “Status”: “running”, “Pid”: 13423, “ExitCode”: 0, “StartedAt” occurs chronogically AFTER “FinishedAt”
‘ps | grep 13423’ shows no process with PID 13423 (from balena inspect output)
‘ps | grep agent’ shows no agent process running
‘systemctl restart resin-supervisor’ did not recover the problem. In fact it might have caused the logging container to break for a short time but I only noticed after going through the backscroll
‘systemctl restart balena’ actually does recover the issue without a system reboot.

Supporting logs from the debugged device:

Connecting to <edited>
Spawning shell...
=============================================================
    Welcome to balenaOS
=============================================================
root@:~# balena ps
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS                 PORTS                  NAMES
8953abfc2c7e        158780b9dd8a                                                            "/trex/edge capture"     4 minutes ago       Up 4 minutes                                  capture_3134342_1651895
bcf6a225289a        registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest   "/usr/src/app/entry.…"   6 hours ago         Up 6 hours (healthy)                          resin_supervisor
5c75768374fa        02f603c54d98                                                            "/entrypoint.sh"         8 days ago          Up 2 days              0.0.0.0:2222->22/tcp   scp_3134349_1651895
9511189c9075        e5364636fadf                                                            "/entrypoint.sh"         2 weeks ago         Up 2 days                                     grafana_3134353_1651895
06149a25efd4        8d80652d5ac6                                                            "sleep infinity"         2 weeks ago         Up 2 days                                     inspect_3134348_1651895
00d0665c3a4a        e4ccd877f611                                                            "/entrypoint_with_wa…"   2 weeks ago         Up 2 days                                     telegraf_3134352_1651895
37573393002b        e36fce813fa7                                                            "/entrypoint.sh infl…"   2 weeks ago         Up 2 days                                     influxdb_3134354_1651895
77d3bc897dfa        158780b9dd8a                                                            "/trex/edge agent"       2 weeks ago         Up 30 hours                                   agent_3134344_1651895
42959deacde3        158780b9dd8a                                                            "/trex/edge control"     2 weeks ago         Up 2 days                                     control_3134341_1651895
0322245e1c9d        860c57b07dd9                                                            "sleep infinity"         2 weeks ago         Up 2 days                                     co-processor-flasher_3134343_1651895
47734dad0887        c9767211b5b7                                                            "./grpc-dfuTool"         2 weeks ago         Up 2 days                                     sensor-firmware-flasher_3134345_1651895
abf182a00346        9f49ac54288d                                                            "python3 logger.py"      2 weeks ago         Up 6 hours                                    logger_3134347_1651895
edc0a36d0059        5b31debf8912                                                            "/bin/sh -c /app/con…"   2 weeks ago         Up 2 days                                     connection-watchdog_3134350_1651895
d70dc3638c7f        48a011276016                                                            "/entrypoint.sh"         2 weeks ago         Up 2 days (healthy)                           hw-init_3134346_1651895
root@:~# ps | grep agent
 7287 root      2276 S    grep agent
root@:~# system 
systemctl                       systemd-detect-virt             systemd-mount                   systemd-sysusers
systemd-analyze                 systemd-escape                  systemd-notify                  systemd-tmpfiles
systemd-ask-password            systemd-firstboot               systemd-nspawn                  systemd-tty-ask-password-agent
systemd-cat                     systemd-hwdb                    systemd-path                    systemd-umount
systemd-cgls                    systemd-id128                   systemd-run                     
systemd-cgtop                   systemd-inhibit                 systemd-socket-activate         
systemd-delta                   systemd-machine-id-setup        systemd-stdio-bridge            
root@:~# system
systemctl                       systemd-detect-virt             systemd-mount                   systemd-sysusers
systemd-analyze                 systemd-escape                  systemd-notify                  systemd-tmpfiles
systemd-ask-password            systemd-firstboot               systemd-nspawn                  systemd-tty-ask-password-agent
systemd-cat                     systemd-hwdb                    systemd-path                    systemd-umount
systemd-cgls                    systemd-id128                   systemd-run                     
systemd-cgtop                   systemd-inhibit                 systemd-socket-activate         
systemd-delta                   systemd-machine-id-setup        systemd-stdio-bridge            
root@:~# systemctl restart resin-supervisor
root@:~# balena ps -a
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS                            PORTS                  NAMES
20d1b8efebf9        registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest   "/usr/src/app/entry.…"   5 seconds ago       Up 2 seconds (health: starting)                          resin_supervisor
8953abfc2c7e        158780b9dd8a                                                            "/trex/edge capture"     5 minutes ago       Up 5 minutes                                             capture_3134342_1651895
5c75768374fa        02f603c54d98                                                            "/entrypoint.sh"         8 days ago          Up 2 days                         0.0.0.0:2222->22/tcp   scp_3134349_1651895
9511189c9075        e5364636fadf                                                            "/entrypoint.sh"         2 weeks ago         Up 2 days                                                grafana_3134353_1651895
06149a25efd4        8d80652d5ac6                                                            "sleep infinity"         2 weeks ago         Up 2 days                                                inspect_3134348_1651895
00d0665c3a4a        e4ccd877f611                                                            "/entrypoint_with_wa…"   2 weeks ago         Up 2 days                                                telegraf_3134352_1651895
37573393002b        e36fce813fa7                                                            "/entrypoint.sh infl…"   2 weeks ago         Up 2 days                                                influxdb_3134354_1651895
77d3bc897dfa        158780b9dd8a                                                            "/trex/edge agent"       2 weeks ago         Up 30 hours                                              agent_3134344_1651895
42959deacde3        158780b9dd8a                                                            "/trex/edge control"     2 weeks ago         Up 2 days                                                control_3134341_1651895
0322245e1c9d        860c57b07dd9                                                            "sleep infinity"         2 weeks ago         Up 2 days                                                co-processor-flasher_3134343_1651895
47734dad0887        c9767211b5b7                                                            "./grpc-dfuTool"         2 weeks ago         Up 2 days                                                sensor-firmware-flasher_3134345_1651895
abf182a00346        9f49ac54288d                                                            "python3 logger.py"      2 weeks ago         Restarting (1) 1 second ago                              logger_3134347_1651895
edc0a36d0059        5b31debf8912                                                            "/bin/sh -c /app/con…"   2 weeks ago         Up 2 days                                                connection-watchdog_3134350_1651895
d70dc3638c7f        48a011276016                                                            "/entrypoint.sh"         2 weeks ago         Up 2 days (healthy)                                      hw-init_3134346_1651895
root@:~# ps | grep agent
 8283 root      2276 R    grep agent
root@:~# balena inspect agent_3134344_1651895
[
    {
        "Id": "77d3bc897dfa281fd94329bff555fea801679c8a996c383da0687e8e714bb19e",
        "Created": "2020-12-30T20:44:33.7734142Z",
        "Path": "/trex/edge",
        "Args": [
            "agent"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 13423,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2021-01-14T15:42:09.338221817Z",
            "FinishedAt": "2021-01-14T15:42:05.859593611Z"
        },
        "Image": "sha256:158780b9dd8a4569fa86dfa8d134f584dea1b7e1dc51170634f32a94b841e1e0",
        "ResolvConfPath": "/var/lib/docker/containers/77d3bc897dfa281fd94329bff555fea801679c8a996c383da0687e8e714bb19e/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/77d3bc897dfa281fd94329bff555fea801679c8a996c383da0687e8e714bb19e/hostname",
        "HostsPath": "/var/lib/docker/containers/77d3bc897dfa281fd94329bff555fea801679c8a996c383da0687e8e714bb19e/hosts",
        "LogPath": "",
        "Name": "/agent_3134344_1651895",
        "RestartCount": 3,
        "Driver": "aufs",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "/tmp/balena-supervisor/services/1499460/agent:/tmp/resin",
                "/tmp/balena-supervisor/services/1499460/agent:/tmp/balena"
            ],
            "ContainerIDFile": "",
            "ContainerIDEnv": "",
            "LogConfig": {
                "Type": "journald",
                "Config": {}
            },
            "NetworkMode": "host",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "always",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": [],
            "CapDrop": [],
            "Capabilities": null,
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": [],
            "GroupAdd": [],
            "IpcMode": "shareable",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": true,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": [],
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DeviceRequests": [],
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": false,
            "PidsLimit": null,
            "Ulimits": [],
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": null,
            "ReadonlyPaths": null
        },
        "GraphDriver": {
            "Data": null,
            "Name": "aufs"
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/tmp/balena-supervisor/services/1499460/agent",
                "Destination": "/tmp/balena",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/tmp/balena-supervisor/services/1499460/agent",
                "Destination": "/tmp/resin",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
        "Config": {
            "Hostname": "<edited>",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": true,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "RESIN_DEVICE_NAME_AT_INIT=crimson-fire",
                "BALENA_DEVICE_NAME_AT_INIT=crimson-fire",
                "INFLUXDB_HOST=127.0.0.1:8087",
                "WAIT_FOR_INFLUXDB=true",
                "AGENT_CAPTURE_URL=tcp://localhost:10000",
                "AGENT_INFLUXDB_URL=http://localhost:8086",
                "TREX_LOG_LEVEL=INFO",
                "WELL_API_NUMBER=00-000-00000",
                "AGENT_KAFKA_URL=tcp://broker.trex.petropower.com:9092",
                "AGENT_LOG_LEVEL=INFO",
                "AGENT_PARTIAL_SAMPLES_DATABASE=partial_samples/autogen",
                "BALENA_APP_ID=1499460",
                "BALENA_APP_NAME=Trex",
                "BALENA_SERVICE_NAME=agent",
                "BALENA_DEVICE_UUID=<edited>",
                "BALENA_DEVICE_TYPE=fincm3",
                "BALENA_DEVICE_ARCH=armv7hf",
                "BALENA_HOST_OS_VERSION=balenaOS 2.58.3+rev1",
                "BALENA_SUPERVISOR_VERSION=11.14.0",
                "BALENA_APP_LOCK_PATH=/tmp/balena/updates.lock",
                "BALENA=1",
                "RESIN_APP_ID=1499460",
                "RESIN_APP_NAME=Trex",
                "RESIN_SERVICE_NAME=agent",
                "RESIN_DEVICE_UUID=<edited>",
                "RESIN_DEVICE_TYPE=fincm3",
                "RESIN_DEVICE_ARCH=armv7hf",
                "RESIN_HOST_OS_VERSION=balenaOS 2.58.3+rev1",
                "RESIN_SUPERVISOR_VERSION=11.14.0",
                "RESIN_APP_LOCK_PATH=/tmp/balena/updates.lock",
                "RESIN=1",
                "RESIN_SERVICE_KILL_ME_PATH=/tmp/balena/handover-complete",
                "BALENA_SERVICE_HANDOVER_COMPLETE_PATH=/tmp/balena/handover-complete",
                "USER=root",
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": [
                "agent"
            ],
            "Healthcheck": {
                "Test": [
                    "NONE"
                ]
            },
            "Image": "sha256:158780b9dd8a4569fa86dfa8d134f584dea1b7e1dc51170634f32a94b841e1e0",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": [
                "/trex/edge"
            ],
            "OnBuild": null,
            "Labels": {
                "com.petropower.component.name": "agent",
                "com.petropower.component.version": "421",
                "io.balena.app-id": "1499460",
                "io.balena.service-id": "682262",
                "io.balena.service-name": "agent",
                "io.balena.supervised": "true"
            },
            "StopSignal": "SIGTERM",
            "StopTimeout": 10
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "c2adbf09584caaeb34781d739ac340c1cec24ff71026ae3e006435e750ea6136",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "/var/run/balena-engine/netns/default",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "host": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "a1b0d489f69d651e7b7054b5bb74622acbb3254cc18986ba57befbdb5d798287",
                    "EndpointID": "1f7a116d7379c0ee8ce7e93d8d1d022716176d0bef32f8e63b7bed7319e9ada9",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "",
                    "DriverOpts": null
                }
            }
        }
    }
]
root@:~# ps | grep 13423
 8381 root      2276 S    grep 13423
root@:~# systemctl restart balena
root@:~# balena ps
CONTAINER ID        IMAGE                                                                   COMMAND                  CREATED             STATUS                             PORTS                  NAMES
20d1b8efebf9        registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest   "/usr/src/app/entry.…"   2 minutes ago       Up 15 seconds (health: starting)                          resin_supervisor
8953abfc2c7e        158780b9dd8a                                                            "/trex/edge capture"     7 minutes ago       Up 20 seconds                                             capture_3134342_1651895
5c75768374fa        02f603c54d98                                                            "/entrypoint.sh"         8 days ago          Up 12 seconds                      0.0.0.0:2222->22/tcp   scp_3134349_1651895
9511189c9075        e5364636fadf                                                            "/entrypoint.sh"         2 weeks ago         Up 15 seconds                                             grafana_3134353_1651895
06149a25efd4        8d80652d5ac6                                                            "sleep infinity"         2 weeks ago         Up 18 seconds                                             inspect_3134348_1651895
00d0665c3a4a        e4ccd877f611                                                            "/entrypoint_with_wa…"   2 weeks ago         Restarting (124) 1 second ago                             telegraf_3134352_1651895
37573393002b        e36fce813fa7                                                            "/entrypoint.sh infl…"   2 weeks ago         Up 15 seconds                                             influxdb_3134354_1651895
77d3bc897dfa        158780b9dd8a                                                            "/trex/edge agent"       2 weeks ago         Up 16 seconds                                             agent_3134344_1651895
42959deacde3        158780b9dd8a                                                            "/trex/edge control"     2 weeks ago         Up 15 seconds                                             control_3134341_1651895
0322245e1c9d        860c57b07dd9                                                            "sleep infinity"         2 weeks ago         Up 12 seconds                                             co-processor-flasher_3134343_1651895
47734dad0887        c9767211b5b7                                                            "./grpc-dfuTool"         2 weeks ago         Up 16 seconds                                             sensor-firmware-flasher_3134345_1651895
abf182a00346        9f49ac54288d                                                            "python3 logger.py"      2 weeks ago         Up 4 seconds                                              logger_3134347_1651895
edc0a36d0059        5b31debf8912                                                            "/bin/sh -c /app/con…"   2 weeks ago         Up 11 seconds                                             connection-watchdog_3134350_1651895
d70dc3638c7f        48a011276016                                                            "/entrypoint.sh"         2 weeks ago         Up 16 seconds (health: starting)                          hw-init_3134346_1651895
root@:~# ps | grep agent
 9822 root      873m S    /trex/edge agent
12300 root      2276 S    grep agent
root@:~# ^C
root@:~#

Topic		Replies	Views
Stopping balena container Product support	3	88	August 9, 2024
Supervisor constantly restarting balenaOS docker	3	233	October 10, 2023
Services are in a constant restart loop! balenaOS support , balenafin	51	4016	January 15, 2021
Local-mode Supervisor kills containers randomly balenaOS	17	796	December 28, 2021
Rebooting a device using the supervisor API vs DBUS balenaOS	1	163	August 26, 2024

'balena ps' shows container restarting for many hours; reboot required to clear condition

Here is terminal output from my first encounter with the issue a couple weeks ago:

============================================================= Welcome to balenaOS

Related topics

=============================================================
Welcome to balenaOS