Sorry, I didn’t mean to distract from the original issue by mentioning deployments. This issue can occur naturally without any external interaction with the balena device.
For example: I have a device XYZ. This device’s supervisor was upgraded 6 weeks ago. The most recent deploy was perhaps 2 weeks ago. The device was fully operational at the conclusion of the deploy and has had no external intervention since, but at midnight last night when the device was supposed to upload metrics to the cloud, this device did not. An investigation the next morning shows that the “agent” service has been in the “restarting” state for 27 hours, meaning that sometime in the early morning the prior day, this device got into a state that caused some services to fail and not restart, but was healthy enough to show green status on the balena dashboard. An attempt to look at the logs collected through the supervisor API show that no logs have come through for the same 27 hours, even though many of the services are still running.
So now, getting away from the example story of the problem, I have a few updates from our internal efforts and I must mention that the original problem remains but some of our symptoms are a little different.
On this device we use InfluxDB. When influxDB starts on an initial deployment, before any database is created, it’s necessary to delay the start of the “agent” service until a separate staging script runs to do the initial configuration of influxDB. We modified the influx entrypoint to start influx, test when it becomes ready, test that is has been correctly staged and ready for the agent to connect, and then we open a tcp listen port using socat in the influx container. Once the socat listen port is open, the agent container is able to see that it’s safe to connect to the influx instance. We had a problem where if something went wrong in the influx container (for example, it was killed due to a resource problem) the socat socket was in the CLOSE_WAIT state and could not be terminated. This caused the influx container to be in a bad state. We corrected the problem by cleanly closing the socat socket, which no allows the influx container to cleanly close and restart. We believe that this issue was unrelated to our agent container problem (getting stuck in the restarting state) but somehow, by fixing this problem with socat in the influx container, the agent container no longer gets stuck “restarting” state, but does continue to fail to restart if the entrypoint process exits.
I think a new and more appropriate headline for this issue would be “Container entrypoint terminates, but the container is not restarted”.
You can see this behavior from the following log output on on of our devices that reproduced the issue this morning:
Observations:
- The container “agent_3134344_1651895” shows “Up 13 hours” upon initial login
- grepping ps for the agent container entrypoint shows no running process matching the container.
- The ‘balena kill’ command succeeds but a subsequent “balena ps” shows the container still “Up 13 hours”
- balena inspect shows the state is “Status”: “running”, and “Running”: true, and “Pid”: 27829,
- ‘ps | grep 27829’ shows nothing using PID 27829
- ‘systemctl restart balena’ resulted in things working again.
- Contrary to earlier reports, the logging service actually continued to work now that we resolved the socat issue in the influx container.
Logs are below
=============================================================
Welcome to balenaOS
root@4ef2f72:~# balena ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
472710a7a738 registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest “/usr/src/app/entry.…” 13 hours ago Up 13 hours (healthy) resin_supervisor
71d597882a28 02f603c54d98 “/entrypoint.sh” 23 hours ago Up 23 hours 0.0.0.0:2222->22/tcp scp_3134349_1651895
68bf0adab5d4 e5364636fadf “/entrypoint.sh” 8 days ago Up 2 days grafana_3134353_1651895
b96312a92de1 e4ccd877f611 “/entrypoint_with_wa…” 8 days ago Up 2 days telegraf_3134352_1651895
da24521e0155 8d80652d5ac6 “sleep infinity” 8 days ago Up 2 days inspect_3134348_1651895
50ad55275d8c e36fce813fa7 “/entrypoint.sh infl…” 8 days ago Up 2 days influxdb_3134354_1651895
1f78b10e5b5d 158780b9dd8a “/trex/edge agent” 8 days ago Up 13 hours agent_3134344_1651895
4fa6876d14be 158780b9dd8a “/trex/edge control” 8 days ago Up 2 days control_3134341_1651895
9fb70291c035 158780b9dd8a “/trex/edge capture” 8 days ago Up 2 days capture_3134342_1651895
9441093ec07d 860c57b07dd9 “sleep infinity” 11 days ago Up 2 days co-processor-flasher_3134343_1651895
3af1f6e5d436 c9767211b5b7 “./grpc-dfuTool” 11 days ago Up 2 days sensor-firmware-flasher_3134345_1651895
d15ed98e4d06 9f49ac54288d “python3 logger.py” 11 days ago Up 13 hours logger_3134347_1651895
9c1c8640aede 48a011276016 “/entrypoint.sh” 11 days ago Up 13 hours (healthy) hw-init_3134346_1651895
23abad88c563 5b31debf8912 “/bin/sh -c /app/con…” 11 days ago Up 2 days connection-watchdog_3134350_1651895
root@4ef2f72:~# ps | grep edge
2049 root 890m S /trex/edge control
2339 root 987m S /trex/edge capture
19492 root 2276 S grep edge
root@4ef2f72:~# balena kill 1f78b10e5b5d
1f78b10e5b5d
root@4ef2f72:~# balena ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
472710a7a738 registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest “/usr/src/app/entry.…” 13 hours ago Up 13 hours (healthy) resin_supervisor
71d597882a28 02f603c54d98 “/entrypoint.sh” 23 hours ago Up 23 hours 0.0.0.0:2222->22/tcp scp_3134349_1651895
68bf0adab5d4 e5364636fadf “/entrypoint.sh” 8 days ago Up 2 days grafana_3134353_1651895
b96312a92de1 e4ccd877f611 “/entrypoint_with_wa…” 8 days ago Up 2 days telegraf_3134352_1651895
da24521e0155 8d80652d5ac6 “sleep infinity” 8 days ago Up 2 days inspect_3134348_1651895
50ad55275d8c e36fce813fa7 “/entrypoint.sh infl…” 8 days ago Up 2 days influxdb_3134354_1651895
1f78b10e5b5d 158780b9dd8a “/trex/edge agent” 8 days ago Up 13 hours agent_3134344_1651895
4fa6876d14be 158780b9dd8a “/trex/edge control” 8 days ago Up 2 days control_3134341_1651895
9fb70291c035 158780b9dd8a “/trex/edge capture” 8 days ago Up 2 days capture_3134342_1651895
9441093ec07d 860c57b07dd9 “sleep infinity” 11 days ago Up 2 days co-processor-flasher_3134343_1651895
3af1f6e5d436 c9767211b5b7 “./grpc-dfuTool” 11 days ago Up 2 days sensor-firmware-flasher_3134345_1651895
d15ed98e4d06 9f49ac54288d “python3 logger.py” 11 days ago Up 13 hours logger_3134347_1651895
9c1c8640aede 48a011276016 “/entrypoint.sh” 11 days ago Up 13 hours (healthy) hw-init_3134346_1651895
23abad88c563 5b31debf8912 “/bin/sh -c /app/con…” 11 days ago Up 2 days connection-watchdog_3134350_1651895
root@4ef2f72:~# ps | grep edge
2049 root 890m S /trex/edge control
2339 root 987m S /trex/edge capture
19697 root 2276 S grep edge
root@4ef2f72:~# balena inspect agent_3134344_1651895
[
{
“Id”: “1f78b10e5b5d644c2a51d39189ee6babf2bc55dc08654a8c69b55ad58463b934”,
“Created”: “2020-12-30T20:44:28.231746912Z”,
“Path”: “/trex/edge”,
“Args”: [
“agent”
],
“State”: {
“Status”: “running”,
“Running”: true,
“Paused”: false,
“Restarting”: false,
“OOMKilled”: false,
“Dead”: false,
“Pid”: 27829,
“ExitCode”: 0,
“Error”: “”,
“StartedAt”: “2021-01-08T03:51:03.489862922Z”,
“FinishedAt”: “2021-01-08T03:50:58.401232618Z”
},
“Image”: “sha256:158780b9dd8a4569fa86dfa8d134f584dea1b7e1dc51170634f32a94b841e1e0”,
“ResolvConfPath”: “/var/lib/docker/containers/1f78b10e5b5d644c2a51d39189ee6babf2bc55dc08654a8c69b55ad58463b934/resolv.conf”,
“HostnamePath”: “/var/lib/docker/containers/1f78b10e5b5d644c2a51d39189ee6babf2bc55dc08654a8c69b55ad58463b934/hostname”,
“HostsPath”: “/var/lib/docker/containers/1f78b10e5b5d644c2a51d39189ee6babf2bc55dc08654a8c69b55ad58463b934/hosts”,
“LogPath”: “”,
“Name”: “/agent_3134344_1651895”,
“RestartCount”: 34,
“Driver”: “aufs”,
“Platform”: “linux”,
“MountLabel”: “”,
“ProcessLabel”: “”,
“AppArmorProfile”: “”,
“ExecIDs”: null,
“HostConfig”: {
“Binds”: [
“/tmp/balena-supervisor/services/1499460/agent:/tmp/resin”,
“/tmp/balena-supervisor/services/1499460/agent:/tmp/balena”
],
“ContainerIDFile”: “”,
“ContainerIDEnv”: “”,
“LogConfig”: {
“Type”: “journald”,
“Config”: {}
},
“NetworkMode”: “host”,
“PortBindings”: {},
“RestartPolicy”: {
“Name”: “always”,
“MaximumRetryCount”: 0
},
“AutoRemove”: false,
“VolumeDriver”: “”,
“VolumesFrom”: null,
“CapAdd”: ,
“CapDrop”: ,
“Capabilities”: null,
“Dns”: ,
“DnsOptions”: ,
“DnsSearch”: ,
“ExtraHosts”: ,
“GroupAdd”: ,
“IpcMode”: “shareable”,
“Cgroup”: “”,
“Links”: null,
“OomScoreAdj”: 0,
“PidMode”: “”,
“Privileged”: true,
“PublishAllPorts”: false,
“ReadonlyRootfs”: false,
“SecurityOpt”: ,
“UTSMode”: “”,
“UsernsMode”: “”,
“ShmSize”: 67108864,
“Runtime”: “runc”,
“ConsoleSize”: [
0,
0
],
“Isolation”: “”,
“CpuShares”: 0,
“Memory”: 0,
“NanoCpus”: 0,
“CgroupParent”: “”,
“BlkioWeight”: 0,
“BlkioWeightDevice”: null,
“BlkioDeviceReadBps”: null,
“BlkioDeviceWriteBps”: null,
“BlkioDeviceReadIOps”: null,
“BlkioDeviceWriteIOps”: null,
“CpuPeriod”: 0,
“CpuQuota”: 0,
“CpuRealtimePeriod”: 0,
“CpuRealtimeRuntime”: 0,
“CpusetCpus”: “”,
“CpusetMems”: “”,
“Devices”: ,
“DeviceCgroupRules”: null,
“DeviceRequests”: ,
“KernelMemory”: 0,
“KernelMemoryTCP”: 0,
“MemoryReservation”: 0,
“MemorySwap”: 0,
“MemorySwappiness”: null,
“OomKillDisable”: false,
“PidsLimit”: null,
“Ulimits”: ,
“CpuCount”: 0,
“CpuPercent”: 0,
“IOMaximumIOps”: 0,
“IOMaximumBandwidth”: 0,
“MaskedPaths”: null,
“ReadonlyPaths”: null
},
“GraphDriver”: {
“Data”: null,
“Name”: “aufs”
},
“Mounts”: [
{
“Type”: “bind”,
“Source”: “/tmp/balena-supervisor/services/1499460/agent”,
“Destination”: “/tmp/balena”,
“Mode”: “”,
“RW”: true,
“Propagation”: “rprivate”
},
{
“Type”: “bind”,
“Source”: “/tmp/balena-supervisor/services/1499460/agent”,
“Destination”: “/tmp/resin”,
“Mode”: “”,
“RW”: true,
“Propagation”: “rprivate”
}
],
“Config”: {
“Hostname”: “4ef2f72”,
“Domainname”: “”,
“User”: “”,
“AttachStdin”: false,
“AttachStdout”: false,
“AttachStderr”: false,
“Tty”: true,
“OpenStdin”: false,
“StdinOnce”: false,
“Env”: [
“RESIN_DEVICE_NAME_AT_INIT=silent-sea”,
“BALENA_DEVICE_NAME_AT_INIT=silent-sea”,
“INFLUXDB_HOST=127.0.0.1:8087”,
“WAIT_FOR_INFLUXDB=true”,
“AGENT_CAPTURE_URL=tcp://localhost:10000”,
“AGENT_INFLUXDB_URL=http://localhost:8086”,
“TREX_LOG_LEVEL=INFO”,
“WELL_API_NUMBER=00-000-00000”,
“AGENT_KAFKA_URL=tcp://broker.trex.petropower.com:9092”,
“AGENT_LOG_LEVEL=INFO”,
“AGENT_PARTIAL_SAMPLES_DATABASE=partial_samples/autogen”,
“BALENA_APP_ID=1499460”,
“BALENA_APP_NAME=Trex”,
“BALENA_SERVICE_NAME=agent”,
"BALENA_DEVICE_UUID=// removed
“BALENA_DEVICE_TYPE=fincm3”,
“BALENA_DEVICE_ARCH=armv7hf”,
“BALENA_HOST_OS_VERSION=balenaOS 2.58.3+rev1”,
“BALENA_SUPERVISOR_VERSION=11.14.0”,
“BALENA_APP_LOCK_PATH=/tmp/balena/updates.lock”,
“BALENA=1”,
“RESIN_APP_ID=1499460”,
“RESIN_APP_NAME=Trex”,
“RESIN_SERVICE_NAME=agent”,
"RESIN_DEVICE_UUID=// removed
“RESIN_DEVICE_TYPE=fincm3”,
“RESIN_DEVICE_ARCH=armv7hf”,
“RESIN_HOST_OS_VERSION=balenaOS 2.58.3+rev1”,
“RESIN_SUPERVISOR_VERSION=11.14.0”,
“RESIN_APP_LOCK_PATH=/tmp/balena/updates.lock”,
“RESIN=1”,
“RESIN_SERVICE_KILL_ME_PATH=/tmp/balena/handover-complete”,
“BALENA_SERVICE_HANDOVER_COMPLETE_PATH=/tmp/balena/handover-complete”,
“USER=root”,
“PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin”
],
“Cmd”: [
“agent”
],
“Healthcheck”: {
“Test”: [
“NONE”
]
},
“Image”: “sha256:158780b9dd8a4569fa86dfa8d134f584dea1b7e1dc51170634f32a94b841e1e0”,
“Volumes”: null,
“WorkingDir”: “”,
“Entrypoint”: [
“/trex/edge”
],
“OnBuild”: null,
“Labels”: {
“com.petropower.component.name”: “agent”,
“com.petropower.component.version”: “421”,
“io.balena.app-id”: “1499460”,
“io.balena.service-id”: “682262”,
“io.balena.service-name”: “agent”,
“io.balena.supervised”: “true”
},
“StopSignal”: “SIGTERM”,
“StopTimeout”: 10
},
“NetworkSettings”: {
“Bridge”: “”,
“SandboxID”: “81b2b80e3248f5c239fc339b28caab06797e22e566b963b328c680b16d2ccc62”,
“HairpinMode”: false,
“LinkLocalIPv6Address”: “”,
“LinkLocalIPv6PrefixLen”: 0,
“Ports”: {},
“SandboxKey”: “/var/run/balena-engine/netns/default”,
“SecondaryIPAddresses”: null,
“SecondaryIPv6Addresses”: null,
“EndpointID”: “”,
“Gateway”: “”,
“GlobalIPv6Address”: “”,
“GlobalIPv6PrefixLen”: 0,
“IPAddress”: “”,
“IPPrefixLen”: 0,
“IPv6Gateway”: “”,
“MacAddress”: “”,
“Networks”: {
“host”: {
“IPAMConfig”: null,
“Links”: null,
“Aliases”: null,
“NetworkID”: “a1b0d489f69d651e7b7054b5bb74622acbb3254cc18986ba57befbdb5d798287”,
“EndpointID”: “31a8e9ff9518e27430dbe5f4b4840a04f2398d00117225d41eca3e934ffc3b8f”,
“Gateway”: “”,
“IPAddress”: “”,
“IPPrefixLen”: 0,
“IPv6Gateway”: “”,
“GlobalIPv6Address”: “”,
“GlobalIPv6PrefixLen”: 0,
“MacAddress”: “”,
“DriverOpts”: null
}
}
}
}
]
root@4ef2f72:~# ps |grep edge
2049 root 890m S /trex/edge control
2339 root 987m S /trex/edge capture
21894 root 2276 S grep edge
root@4ef2f72:~# ps |grep 27829
22911 root 2276 S grep 27829
root@4ef2f72:~# service balena restart
bash: service: command not found
root@4ef2f72:~# systemctl restart balena
root@4ef2f72:~#
root@4ef2f72:~#
root@4ef2f72:~# balena ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
472710a7a738 registry2.balena-cloud.com/v2/e5a9b95f08373b03979ee007184e4753:latest “/usr/src/app/entry.…” 13 hours ago Up 21 seconds (health: starting) resin_supervisor
71d597882a28 02f603c54d98 “/entrypoint.sh” 24 hours ago Up 15 seconds 0.0.0.0:2222->22/tcp scp_3134349_1651895
68bf0adab5d4 e5364636fadf “/entrypoint.sh” 8 days ago Restarting (143) 3 seconds ago grafana_3134353_1651895
b96312a92de1 e4ccd877f611 “/entrypoint_with_wa…” 8 days ago Up Less than a second telegraf_3134352_1651895
da24521e0155 8d80652d5ac6 “sleep infinity” 8 days ago Up 18 seconds inspect_3134348_1651895
50ad55275d8c e36fce813fa7 “/entrypoint.sh infl…” 8 days ago Up 19 seconds influxdb_3134354_1651895
1f78b10e5b5d 158780b9dd8a “/trex/edge agent” 8 days ago Up 21 seconds agent_3134344_1651895
4fa6876d14be 158780b9dd8a “/trex/edge control” 8 days ago Up 21 seconds control_3134341_1651895
9fb70291c035 158780b9dd8a “/trex/edge capture” 8 days ago Up 19 seconds capture_3134342_1651895
9441093ec07d 860c57b07dd9 “sleep infinity” 11 days ago Up 15 seconds co-processor-flasher_3134343_1651895
3af1f6e5d436 c9767211b5b7 “./grpc-dfuTool” 11 days ago Up 19 seconds sensor-firmware-flasher_3134345_1651895
d15ed98e4d06 9f49ac54288d “python3 logger.py” 11 days ago Up 6 seconds logger_3134347_1651895
9c1c8640aede 48a011276016 “/entrypoint.sh” 11 days ago Up 19 seconds (health: starting) hw-init_3134346_1651895
23abad88c563 5b31debf8912 “/bin/sh -c /app/con…” 11 days ago Up 15 seconds connection-watchdog_3134350_1651895
root@4ef2f72:~# ps |grep agent
24348 root 872m S /trex/edge agent
27160 root 2276 S grep agent
root@4ef2f72:~# ^C
root@4ef2f72:~#