Local-mode Supervisor kills containers randomly

bversluijs · December 21, 2020, 1:06pm

Hi all,

I’m using a RPI4 (balenaOS 2.58.6+rev1, supervisor v11.14.0) with local mode enabled to develop apps. This was the latest version at the time (I see now that 2.65.0 is available, I’ll check that out). But sometimes, the supervisor just kills the containers out of the blue. I didn’t do an extra balena push, I didn’t reboot the device. It just stops all the containers.

Here are some logs from the supervisor:

I’m using openBalena by the way.

Also, when rebooting, the containers don’t start automatically. On another RPI4 with local mode enabled, also supervisor v11.14.0, this behaviour doesn’t happen. Also on reboot, the containers are starting just fine. I’ve tried a different SD card, reflashed the current SD card, but it keeps doing weird stuff.

Thanks for helping out!

pipex · December 22, 2020, 8:29pm

Hi, could you try running balena push using the --debug option and share the output of the command? I’m particularly interested in the part of the output that says

[Debug]   Setting device state...
[Debug]   Sending request to http://192.168.1.110:48484/v2/local/target-state
[Debug]   Sending target state: {"local": {....

just before the erorr appears. Hopefully that can provide some more information to find the source of the issue.

Also, can you share the version of balena CLI you are using? Thank you!

bversluijs · December 23, 2020, 9:41am

Hi Felipe,

I’m using balena-cli 12.26.1 at the moment. I’m fine with updating it to 12.36.0, but I’m using openBalena and I’m not certain if the new version works with openBalena. But I’ll try.

About the logs:

[Debug]   Setting device state...
[Debug]   Sending request to http://192.168.10.40:48484/v2/local/target-state
[Debug]   Sending target state: {"local":{"name":"development-rpi","config":{"SUPERVISOR_LOCAL_MODE":"1","HOST_CONFIG_dtoverlay":"pi3-miniuart-bt","HOST_CONFIG_enable_uart":"1","HOST_CONFIG_arm_64bit":"1","HOST_CONFIG_disable_splash":"1","HOST_CONFIG_dtparam":"\"i2c_arm=on\",\"spi=on\",\"audio=on\"","HOST_CONFIG_gpu_mem":"16","HOST_FIREWALL_MODE":"","SUPERVISOR_POLL_INTERVAL":"600000","SUPERVISOR_VPN_CONTROL":"true","SUPERVISOR_INSTANT_UPDATE_TRIGGER":"true","SUPERVISOR_CONNECTIVITY_CHECK":"true","SUPERVISOR_LOG_CONTROL":"true","SUPERVISOR_DELTA":"false","SUPERVISOR_DELTA_REQUEST_TIMEOUT":"30000","SUPERVISOR_DELTA_APPLY_TIMEOUT":"0","SUPERVISOR_DELTA_RETRY_COUNT":"30","SUPERVISOR_DELTA_RETRY_INTERVAL":"10000","SUPERVISOR_DELTA_VERSION":"2","SUPERVISOR_OVERRIDE_LOCK":"false","SUPERVISOR_PERSISTENT_LOGGING":"false","HOST_DISCOVERABILITY":"true"},"apps":{"1":{"name":"localapp","commit":"localrelease","releaseId":"1","services":{"1":{"environment":{},"labels":{},"restart":"always","network_mode":"host","imageId":1,"serviceName":"redis","serviceId":1,"image":"local_image_redis:latest","running":true},"2":{"environment":{"RS485_ENABLED":"1","SOCKET_ADDR":"/var/lib/container/communication.sock"},"labels":{},"restart":"always","devices":["/dev/ttyS0:/dev/ttyAMA0"],"cap_add":["SYS_RAWIO"],"volumes":["container_lib:/var/lib/container"],"imageId":2,"serviceName":"serial","serviceId":2,"image":"local_image_serial:latest","running":true},"3":{"environment":{"NODE_ENV":"development","UDEV":"on","DBUS_SYSTEM_BUS_ADDRESS":"unix:path=/host/run/dbus/system_bus_socket","SOCKET_ADDR":"/var/lib/container/communication.sock","EXPRESS_HOSTNAME":"127.0.0.1","QUEUE_WORKER_SPOTS":"4","MAX_WORKERS":"1","SERVER_API_ENDPOINT":"http://192.168.10.10:3000"},"labels":{"io.balena.features.supervisor-api":"1","io.balena.features.dbus":"1"},"restart":"always","network_mode":"host","cap_add":["NET_ADMIN"],"ports":["80:80"],"volumes":["container_lib:/var/lib/container"],"imageId":3,"serviceName":"system","serviceId":3,"image":"local_image_system:latest","running":true}},"volumes":{"container_lib":{}},"networks":{}}}},"dependent":{"apps":[],"devices":[]}}
[Debug]   Sending request to http://192.168.10.40:48484/v2/local/target-state

The only problem is, the build and push is going fine. The containers are running fine. But after some hours (let’s say 5), the supervisor just kills the container. The balena push command has already stopped at that time. It happens randomly afaik.

anujdeshpande · December 23, 2020, 10:02am

Hi

I wonder if network settings are involved in here somehow since this is local mode. Maybe there are some router timeouts that are kicking in. For example there is a field for TCP Session Timeout in which could kill off “inactive” sessions.

Can you check - for your router make and model - how to tweak these settings? Iirc the max timeout you can set it 24 hours. You could try that and see if it makes any difference

bversluijs · December 23, 2020, 10:09am

Hi Anuj,

The balena push command isn’t really the problem I think. It’s not that the connection is lost with the device. Because after a push, I kill the command. The device still runs the containers at that time.

For example, I push the latest changes to the device at the end of the day to see if everything works properly for a night and if I have some error logs. But when I come back in the morning and I SSH onto the device, the containers are stopped. If I then check the supervisor logs, it shows that the supervisor has killed the containers for some reason. You can find those logs in the TS. At that time, the balena push command is already killed some hours before.

dtischler · December 24, 2020, 9:32pm

Hi Bart, it might take a few days due to the holidays, but we’ll be sure to have look at this one. Thanks as always!

pipex · December 29, 2020, 7:17pm

Hi again Bart, I’m still investigating this. Thank you for sharing the logs by the way.

The target state provided by the CLI looks fine. That state should remain as the source of truth for the device state even if you kill the balena push command or reboot the device. It can only change through a POST request to the /v2/local/target-state (see supervisor APi docs for more info) or if local mode is disabled on the cloud.

I see at the the beginning of the screenshot you shared an Applying target state message, that is not necessarily related, but is there anything on the logs before that? Is there anything else interacting with this device?

Could you try to replicate the issue and share the logs that you get by running the following command on your device?

journalctl -u resin-supervisor -a --no-pager

I need to see if anything is triggering a change in the local state. Thanks again for your help

bversluijs · January 4, 2021, 9:25am

Hi Felipe,

I’ve updated the Balena CLI last week to 12.36.1, and I thought it didn’t happen anymore. But over the weekend, the processes stopped again. I’ve started the process again, but didn’t think about saving gathering the logs first, so sorry for that.

I’ve checked the command and it had logs from the 31st of December, so it probably contains the logs.

Thanks for looking into it!

resin-supervisor.log (408.0 KB)

ab77 · January 5, 2021, 7:38pm

Hi there, have you inspected the device running the openBalena container stack for memory, disk, and CPU pressure? Is network connectivity to the API stable?

bversluijs · January 6, 2021, 8:48am

Hi Anton,

I’m not monitoring the memory, disk and CPU pressure constantly. In the beginning, I’ve checked the CPU and memory every now and then for memory leaks or high CPU usage over time. It didn’t show a significant load or significant memory consumption. I’m trying to limit the disk writes as much as possible.

The weird thing is that I have a RPI4 running the same application, same OS and also a development image, which doesn’t kill the containers after a few hours. Only difference is, that RPI4 isn’t in local-mode, but uses the releases.

bversluijs · January 6, 2021, 9:14am

It happened again about 30 minutes ago. I can share the supervisor logs if you want, but it just says it has killed the containers. So maybe there are some other logs that I can give you?

pipex · January 6, 2021, 7:23pm

Hi Bart,

Yes, please share the logs again so I can see if there is some consistency in what is being reported by the supervisor.

The logs you shared intermittently show

Event: Device state report failure {"error":{"message":""}}

Which suggest some network issues reaching your server. This doesn’t gives a cause for the removal of local mode services, though.

When the removal occurs, there is an InternalInconsistencyError shown in the logs, which points to a conflict between the in-memory and the supervisor database information. I would like to check for database corruption. Would you mind performing the following steps?

Start the local mode application with balena push <local ip>
On a different terminal ssh into the device: balena ssh <local ip>
On the device. Open the supervisor node console: balena exec -ti resin_supervisor node
On the node console, enter the following commands

sqlite3 = require('sqlite3');
db = new sqlite3.Database('/data/database.sqlite');
db.all("SELECT * FROM app;", console.log);

The last instruction should return a list of two applications, the local mode application and the cloud application. If there is no sensitive information you can share the output with us.
If the command fails or the output is not correct, it might be an issue of database corruption, in which case you can follow the steps outlined here to remove the database
https://www.balena.io/docs/learn/more/masterclasses/device-debugging/#83-the-supervisor-database

Please let us know about the result of this test
Felipe

bversluijs · January 7, 2021, 8:54am

Hi Felipe,

I didn’t do anything with the device yesterday, waiting for your response. So I’ll share the logs I have now! Because of some sensitive information, I’ll message you privately with the logs.

I’ll do the test now let you know!

I’ve completed your steps, and this is the output:

> db.all("SELECT * FROM app;", console.log);
Database { open: true, filename: '/data/database.sqlite', mode: 65542 }
> null [
  {
    id: 1,
    name: 'localapp',
    releaseId: 1,
    commit: 'localrelease',
    appId: 1,
    services: '[{"environment":{},"labels":{},"restart":"always","network_mode":"host","imageId":1,"serviceName":"redis","serviceId":1,"image":"local_image_redis:latest","running":true,"appId":1,"releaseId":"1","commit":"localrelease"},{"environment":{"RS485_ENABLED":"1","SOCKET_ADDR":"/var/lib/data/communication.sock"},"labels":{},"restart":"always","devices":["/dev/ttyS0:/dev/ttyAMA0"],"cap_add":["SYS_RAWIO"],"volumes":["data_lib:/var/lib/data"],"imageId":2,"serviceName":"serial","serviceId":2,"image":"local_image_serial:latest","running":true,"appId":1,"releaseId":"1","commit":"localrelease"},{"environment":{"NODE_ENV":"development","UDEV":"on","DBUS_SYSTEM_BUS_ADDRESS":"unix:path=/host/run/dbus/system_bus_socket","SOCKET_ADDR":"/var/lib/data/communication.sock","EXPRESS_HOSTNAME":"127.0.0.1","QUEUE_WORKER_SPOTS":"10","MAX_WORKERS":"1","SERVER_API_ENDPOINT":"http://192.168.10.10:3000"},"labels":{"io.balena.features.supervisor-api":"1","io.balena.features.dbus":"1"},"restart":"always","network_mode":"host","cap_add":["NET_ADMIN"],"ports":["80:80"],"volumes":["data_lib:/var/lib/data"],"imageId":3,"serviceName":"system","serviceId":3,"image":"local_image_system:latest","running":true,"appId":1,"releaseId":"1","commit":"localrelease"}]',
    networks: '{}',
    volumes: '{"data_lib":{}}',
    source: 'local'
  }
]

bversluijs · January 7, 2021, 12:35pm

It happened again, so I’ve gone through your steps again, and now it shows the app from my openBalena server (so name is the name of the app and source is my openBalena domain). The localapp is gone by the way.

Something I’ve noticed, but I don’t know if this is always the case, is that I’ve deployed a new release of that app. I don’t know if it was instantly, but it’s probably worth mentioning.

pipex · January 7, 2021, 7:12pm

Hi again,

Thanks for sharing the logs and going through the steps.

To be honest, I’m a bit stumped by this, the supervisor should not remove the local mode services unless it is taken out of local mode or something changes the target state through patching the /v2/local/target-statesupervisor endpoint. I don’t see any of that happening in the logs here and my database corruption hypothesis didn’t pan out.

I will create a supervisor issue to improve observability of target state changes on the device logs, so we can better track the sequence of events leading to a service being deleted.

If you are up to it, you could try replicating the issue on your other device with local mode enabled, or upgrading to a new OS (hence supervisor version) and seeing if the issue continues to happen with newer supervisor versions.

Please let us know about your findings.

bversluijs · January 8, 2021, 9:39am

Hi Felipe,

Thanks for looking into this issue.
I’ll check if I can replicate this issue on another device in the near future. We have another application running on openBalena where I didn’t see this behaviour (also RPI4, same OS and same supervisor version). But I’m not sure if we ever tested that the device ran for 24 hours or more and if the containers were killed.

I’ll update to a new OS version / supervisor version once it’s available on https://www.balena.io/os/, because we only want to work with what you guys publish instead of running ‘beta’/unreleased versions. But as soon as it’s available, I’ll definitely upgrade!

After the weekend, the containers stopped again. I’d just like to mention that there is no new release created, so my previous post should be ignored. It’s not related.

bversluijs · December 28, 2021, 10:35am

Hi,

Coming back to this, I think I have an indication of what’s happening.
It only occurs with 1 of my projects so far, running on openBalena. I think it has to do with the fact that the project ID is 1. I’ve seen some logs (didn’t write down which) that indicates that project ID 1 is causing those issues.

Maybe someone of the Balena team will know if project ID 1 is a special case? This is probably also why balenaCloud users don’t recognize this problem, because project ID 1 isn’t popping up for them.

builder555 · December 28, 2021, 5:35pm

Hello,

Indeed, fleet ID 1 is reserved for cli when pushing a local application. We flagged it with our team and will ensure that fleet ID counting starts with 2 in openBalena.

Topic		Replies	Views
Bricking the supervisor when switching to local mode balenaOS support , docker , diagnostics , raspberrypi4	2	422	April 16, 2021
supervisor copntainer suddenly stopped balenaOS raspberrypi3	5	387	September 11, 2020
'balena ps' shows container restarting for many hours; reboot required to clear condition balenaEngine	30	2558	January 26, 2021
Supervisor constantly restarting balenaOS docker	3	245	October 10, 2023
Cannot push to Balena device that appears to be functional: Supervisor API ECONNREFUSED Product support balena-cli , raspberrypi4	29	1940	April 21, 2022

Local-mode Supervisor kills containers randomly

Related topics