We’ve had 3 remote production devices get into a strange state. We first noticed this state since our application containers appeared to have stopped running.
This strange state is characterized by: application containers are stopped and device shows “Online” in balena site. Device seems to accept a “Restart” request, but then goes into an infinite loop of trying to start the container and failing, giving the message in the balena logs:
22.01.19 09:04:20 (-0800) Failed to start service ‘main sha256:d190a908c4728f103821c160c827ceed4a6e94be932abd2fcad7ac4972f5f928’ due to '(HTTP code 500) server error - error while creating mount source path ‘/tmp/balena-supervisor/services/485241/main’: mkdir /tmp/balena-supervisor: file exists ’
Issuing a “Reboot” request restores the devices to a fully operational status.
Why is this happening, and how can we prevent it? What can we look into?
is any of your devices experiencing this issue right now? Can you please enable support access for this device and share the dashboard link with me (you can use direct message)?
Hi @troyvsvs,
is any of your devices experiencing this issue right now? Can you please enable support access for this device and share the dashboard link with me (you can use direct message)?
Thanks,
Robert
Hi zrzka,
Downtime in production is unacceptable for us, so in the past we’ve rebooted these devices as soon as we noticed this state. However, for troubleshooting, we’ll do our best to try to forward you support access to a device that gets into this state the next time we see it.
Out of curiosity, is there any way for us to go in and capture the low level system logs ourselves? Maybe we can look at them too in addition to the full support access option.
okay, thanks. Next time the problem appears, please, enable support access and share the dashboard link with us.
If you’d like to investigate by yourself, you can Start terminal session to the Host OS and you can use standard tools to examine logs (journalctl, …).