Service sometimes fails to restart after crash (error 500)

Hello everyone,

I’m encountering a seemingly “random” problem with the Balena Engine when one of my services crashes and needs to be started again.

Most of the time, this behaves as expected on the majority of my devices. However, on certains occasions, the service is stuck and cannot restart.
In those occasions, there are several errors that appears in the logs of the Balena Engine:

[e[34minfoe[39m]    Applying target state
[e[35mdebuge[39m]   Found unmanaged Volume: b8b019f912b56107e4dfe23ac896ac0b56599dd08ce14bb6ea899584698438c7
[e[35mdebuge[39m]   Found unmanaged Volume: e5857af62c6d1f458ae212f7eb76bda4635e4c2fb3b2d9acf1b0808745de8132
[e[33mwarne[39m]    Ignoring unsupported or unknown compose fields: containerName
[e[35mdebuge[39m]   Found unmanaged Volume: b8b019f912b56107e4dfe23ac896ac0b56599dd08ce14bb6ea899584698438c7
[e[35mdebuge[39m]   Found unmanaged Volume: e5857af62c6d1f458ae212f7eb76bda4635e4c2fb3b2d9acf1b0808745de8132
[e[36mevente[39m]   Event: Service start {"service":{"appId":1835016,"serviceId":1083089,"serviceName":"cloud-connector","commit":"662af403dd1b8e1f1ef8f75cdf90ee3c","releaseId":1942744}}
[e[31merrore[39m]   Scheduling another update attempt in 1800000ms due to failure:  Error: Failed to apply state transition steps. (HTTP code 500) server error - task 71b657d018ca3d28e9ef3079f593eae207ed414a9ec0ab60c2f0b294bfd51158 already exists: unknown  Steps:["start"]
[e[31merrore[39m]         at fn (/usr/src/app/dist/app.js:6:8594)
[e[31merrore[39m]       at runMicrotasks (<anonymous>)
[e[31merrore[39m]       at processTicksAndRejections (internal/process/task_queues.js:97:5)
[e[31merrore[39m]   Device state apply error Error: Failed to apply state transition steps. (HTTP code 500) server error - task 71b657d018ca3d28e9ef3079f593eae207ed414a9ec0ab60c2f0b294bfd51158 already exists: unknown  Steps:["start"]
[e[31merrore[39m]         at fn (/usr/src/app/dist/app.js:6:8594)
[e[31merrore[39m]       at runMicrotasks (<anonymous>)
[e[31merrore[39m]       at processTicksAndRejections (internal/process/task_queues.js:97:5)

Clicking on the start button on BalenaCloud does not resolve the issue, but calling the Balena Supervisor API through /restart-service endpoint resolves the issue.

I am not able to reproduce voluntarily the problem yet, so I don’t think I’ll be able to provide support access on one of my devices for further investigation, but I have a diagnostic report from one of the devices which caused the problem :
215508c94253ee8609a796b391e816cc46b285334f7f1d31a928b3e92795e4_diagnostics_2021.12.14_14.43.58+0000.txt (2.7 MB)

Balena Supervisor version on my devices : 12.10.3

I would be glad if someone could help me on this matter, as it is critical for me to have this service up at all time.

Thanks in advance,
Christopher

Hi,

The Service start event is triggered by the supervisor when it applies the target state because it attempts to make the service reach a healthystate. However, the engine sees that the the container is already running/started so it rejects the task with an error. I see in your diagnostics that the lorawan-service is in a starting state: Up 18 hours (health: starting)

We will need to find a solution to prevent this issue. I have created the following Github issue to track tihs: Supervisor could not apply target state when a service is stuck in a `health: starting` state · Issue #1855 · balena-os/balena-supervisor · GitHub

Thank you for reporting this and for finding a temporary solution. We will update you again as soon as we have a fix for this.

Regards,
Carlo