Local-mode Supervisor kills containers randomly

Hi all,

I’m using a RPI4 (balenaOS 2.58.6+rev1, supervisor v11.14.0) with local mode enabled to develop apps. This was the latest version at the time (I see now that 2.65.0 is available, I’ll check that out). But sometimes, the supervisor just kills the containers out of the blue. I didn’t do an extra balena push, I didn’t reboot the device. It just stops all the containers.

Here are some logs from the supervisor:

I’m using openBalena by the way.

Also, when rebooting, the containers don’t start automatically. On another RPI4 with local mode enabled, also supervisor v11.14.0, this behaviour doesn’t happen. Also on reboot, the containers are starting just fine. I’ve tried a different SD card, reflashed the current SD card, but it keeps doing weird stuff.

Thanks for helping out!

Hi, could you try running balena push using the --debug option and share the output of the command? I’m particularly interested in the part of the output that says

[Debug]   Setting device state...
[Debug]   Sending request to http://192.168.1.110:48484/v2/local/target-state
[Debug]   Sending target state: {"local": {....

just before the erorr appears. Hopefully that can provide some more information to find the source of the issue.

Also, can you share the version of balena CLI you are using? Thank you!

Hi Felipe,

I’m using balena-cli 12.26.1 at the moment. I’m fine with updating it to 12.36.0, but I’m using openBalena and I’m not certain if the new version works with openBalena. But I’ll try.

About the logs:

[Debug]   Setting device state...
[Debug]   Sending request to http://192.168.10.40:48484/v2/local/target-state
[Debug]   Sending target state: {"local":{"name":"development-rpi","config":{"SUPERVISOR_LOCAL_MODE":"1","HOST_CONFIG_dtoverlay":"pi3-miniuart-bt","HOST_CONFIG_enable_uart":"1","HOST_CONFIG_arm_64bit":"1","HOST_CONFIG_disable_splash":"1","HOST_CONFIG_dtparam":"\"i2c_arm=on\",\"spi=on\",\"audio=on\"","HOST_CONFIG_gpu_mem":"16","HOST_FIREWALL_MODE":"","SUPERVISOR_POLL_INTERVAL":"600000","SUPERVISOR_VPN_CONTROL":"true","SUPERVISOR_INSTANT_UPDATE_TRIGGER":"true","SUPERVISOR_CONNECTIVITY_CHECK":"true","SUPERVISOR_LOG_CONTROL":"true","SUPERVISOR_DELTA":"false","SUPERVISOR_DELTA_REQUEST_TIMEOUT":"30000","SUPERVISOR_DELTA_APPLY_TIMEOUT":"0","SUPERVISOR_DELTA_RETRY_COUNT":"30","SUPERVISOR_DELTA_RETRY_INTERVAL":"10000","SUPERVISOR_DELTA_VERSION":"2","SUPERVISOR_OVERRIDE_LOCK":"false","SUPERVISOR_PERSISTENT_LOGGING":"false","HOST_DISCOVERABILITY":"true"},"apps":{"1":{"name":"localapp","commit":"localrelease","releaseId":"1","services":{"1":{"environment":{},"labels":{},"restart":"always","network_mode":"host","imageId":1,"serviceName":"redis","serviceId":1,"image":"local_image_redis:latest","running":true},"2":{"environment":{"RS485_ENABLED":"1","SOCKET_ADDR":"/var/lib/container/communication.sock"},"labels":{},"restart":"always","devices":["/dev/ttyS0:/dev/ttyAMA0"],"cap_add":["SYS_RAWIO"],"volumes":["container_lib:/var/lib/container"],"imageId":2,"serviceName":"serial","serviceId":2,"image":"local_image_serial:latest","running":true},"3":{"environment":{"NODE_ENV":"development","UDEV":"on","DBUS_SYSTEM_BUS_ADDRESS":"unix:path=/host/run/dbus/system_bus_socket","SOCKET_ADDR":"/var/lib/container/communication.sock","EXPRESS_HOSTNAME":"127.0.0.1","QUEUE_WORKER_SPOTS":"4","MAX_WORKERS":"1","SERVER_API_ENDPOINT":"http://192.168.10.10:3000"},"labels":{"io.balena.features.supervisor-api":"1","io.balena.features.dbus":"1"},"restart":"always","network_mode":"host","cap_add":["NET_ADMIN"],"ports":["80:80"],"volumes":["container_lib:/var/lib/container"],"imageId":3,"serviceName":"system","serviceId":3,"image":"local_image_system:latest","running":true}},"volumes":{"container_lib":{}},"networks":{}}}},"dependent":{"apps":[],"devices":[]}}
[Debug]   Sending request to http://192.168.10.40:48484/v2/local/target-state

The only problem is, the build and push is going fine. The containers are running fine. But after some hours (let’s say 5), the supervisor just kills the container. The balena push command has already stopped at that time. It happens randomly afaik.

Hi

I wonder if network settings are involved in here somehow since this is local mode. Maybe there are some router timeouts that are kicking in. For example there is a field for TCP Session Timeout in which could kill off “inactive” sessions.

Can you check - for your router make and model - how to tweak these settings? Iirc the max timeout you can set it 24 hours. You could try that and see if it makes any difference

Hi Anuj,

The balena push command isn’t really the problem I think. It’s not that the connection is lost with the device. Because after a push, I kill the command. The device still runs the containers at that time.

For example, I push the latest changes to the device at the end of the day to see if everything works properly for a night and if I have some error logs. But when I come back in the morning and I SSH onto the device, the containers are stopped. If I then check the supervisor logs, it shows that the supervisor has killed the containers for some reason. You can find those logs in the TS. At that time, the balena push command is already killed some hours before.

Hi Bart, it might take a few days due to the holidays, but we’ll be sure to have look at this one. Thanks as always!

Hi again Bart, I’m still investigating this. Thank you for sharing the logs by the way.

The target state provided by the CLI looks fine. That state should remain as the source of truth for the device state even if you kill the balena push command or reboot the device. It can only change through a POST request to the /v2/local/target-state (see supervisor APi docs for more info) or if local mode is disabled on the cloud.

I see at the the beginning of the screenshot you shared an Applying target state message, that is not necessarily related, but is there anything on the logs before that? Is there anything else interacting with this device?

Could you try to replicate the issue and share the logs that you get by running the following command on your device?

journalctl -u resin-supervisor -a --no-pager

I need to see if anything is triggering a change in the local state. Thanks again for your help

Hi Felipe,

I’ve updated the Balena CLI last week to 12.36.1, and I thought it didn’t happen anymore. But over the weekend, the processes stopped again. I’ve started the process again, but didn’t think about saving gathering the logs first, so sorry for that.

I’ve checked the command and it had logs from the 31st of December, so it probably contains the logs.

Thanks for looking into it!

resin-supervisor.log (408.0 KB)

Hi there, have you inspected the device running the openBalena container stack for memory, disk, and CPU pressure? Is network connectivity to the API stable?

Hi Anton,

I’m not monitoring the memory, disk and CPU pressure constantly. In the beginning, I’ve checked the CPU and memory every now and then for memory leaks or high CPU usage over time. It didn’t show a significant load or significant memory consumption. I’m trying to limit the disk writes as much as possible.

The weird thing is that I have a RPI4 running the same application, same OS and also a development image, which doesn’t kill the containers after a few hours. Only difference is, that RPI4 isn’t in local-mode, but uses the releases.

It happened again about 30 minutes ago. I can share the supervisor logs if you want, but it just says it has killed the containers. So maybe there are some other logs that I can give you?

Hi Bart,

Yes, please share the logs again so I can see if there is some consistency in what is being reported by the supervisor.

The logs you shared intermittently show

Event: Device state report failure {"error":{"message":""}}

Which suggest some network issues reaching your server. This doesn’t gives a cause for the removal of local mode services, though.

When the removal occurs, there is an InternalInconsistencyError shown in the logs, which points to a conflict between the in-memory and the supervisor database information. I would like to check for database corruption. Would you mind performing the following steps?

  • Start the local mode application with balena push <local ip>
  • On a different terminal ssh into the device: balena ssh <local ip>
  • On the device. Open the supervisor node console: balena exec -ti resin_supervisor node
  • On the node console, enter the following commands
sqlite3 = require('sqlite3');
db = new sqlite3.Database('/data/database.sqlite');
db.all("SELECT * FROM app;", console.log);

Please let us know about the result of this test
Felipe

Hi Felipe,

I didn’t do anything with the device yesterday, waiting for your response. So I’ll share the logs I have now! Because of some sensitive information, I’ll message you privately with the logs.

I’ll do the test now let you know!


I’ve completed your steps, and this is the output:

> db.all("SELECT * FROM app;", console.log);
Database { open: true, filename: '/data/database.sqlite', mode: 65542 }
> null [
  {
    id: 1,
    name: 'localapp',
    releaseId: 1,
    commit: 'localrelease',
    appId: 1,
    services: '[{"environment":{},"labels":{},"restart":"always","network_mode":"host","imageId":1,"serviceName":"redis","serviceId":1,"image":"local_image_redis:latest","running":true,"appId":1,"releaseId":"1","commit":"localrelease"},{"environment":{"RS485_ENABLED":"1","SOCKET_ADDR":"/var/lib/data/communication.sock"},"labels":{},"restart":"always","devices":["/dev/ttyS0:/dev/ttyAMA0"],"cap_add":["SYS_RAWIO"],"volumes":["data_lib:/var/lib/data"],"imageId":2,"serviceName":"serial","serviceId":2,"image":"local_image_serial:latest","running":true,"appId":1,"releaseId":"1","commit":"localrelease"},{"environment":{"NODE_ENV":"development","UDEV":"on","DBUS_SYSTEM_BUS_ADDRESS":"unix:path=/host/run/dbus/system_bus_socket","SOCKET_ADDR":"/var/lib/data/communication.sock","EXPRESS_HOSTNAME":"127.0.0.1","QUEUE_WORKER_SPOTS":"10","MAX_WORKERS":"1","SERVER_API_ENDPOINT":"http://192.168.10.10:3000"},"labels":{"io.balena.features.supervisor-api":"1","io.balena.features.dbus":"1"},"restart":"always","network_mode":"host","cap_add":["NET_ADMIN"],"ports":["80:80"],"volumes":["data_lib:/var/lib/data"],"imageId":3,"serviceName":"system","serviceId":3,"image":"local_image_system:latest","running":true,"appId":1,"releaseId":"1","commit":"localrelease"}]',
    networks: '{}',
    volumes: '{"data_lib":{}}',
    source: 'local'
  }
]

It happened again, so I’ve gone through your steps again, and now it shows the app from my openBalena server (so name is the name of the app and source is my openBalena domain). The localapp is gone by the way.

Something I’ve noticed, but I don’t know if this is always the case, is that I’ve deployed a new release of that app. I don’t know if it was instantly, but it’s probably worth mentioning.

Hi again,

Thanks for sharing the logs and going through the steps.

To be honest, I’m a bit stumped by this, the supervisor should not remove the local mode services unless it is taken out of local mode or something changes the target state through patching the /v2/local/target-statesupervisor endpoint. I don’t see any of that happening in the logs here and my database corruption hypothesis didn’t pan out.

I will create a supervisor issue to improve observability of target state changes on the device logs, so we can better track the sequence of events leading to a service being deleted.

If you are up to it, you could try replicating the issue on your other device with local mode enabled, or upgrading to a new OS (hence supervisor version) and seeing if the issue continues to happen with newer supervisor versions.

Please let us know about your findings.

1 Like

Hi Felipe,

Thanks for looking into this issue.
I’ll check if I can replicate this issue on another device in the near future. We have another application running on openBalena where I didn’t see this behaviour (also RPI4, same OS and same supervisor version). But I’m not sure if we ever tested that the device ran for 24 hours or more and if the containers were killed.

I’ll update to a new OS version / supervisor version once it’s available on https://www.balena.io/os/, because we only want to work with what you guys publish instead of running ‘beta’/unreleased versions. But as soon as it’s available, I’ll definitely upgrade!


After the weekend, the containers stopped again. I’d just like to mention that there is no new release created, so my previous post should be ignored. It’s not related.

Hi,

Coming back to this, I think I have an indication of what’s happening.
It only occurs with 1 of my projects so far, running on openBalena. I think it has to do with the fact that the project ID is 1. I’ve seen some logs (didn’t write down which) that indicates that project ID 1 is causing those issues.

Maybe someone of the Balena team will know if project ID 1 is a special case? This is probably also why balenaCloud users don’t recognize this problem, because project ID 1 isn’t popping up for them.

Hello,

Indeed, fleet ID 1 is reserved for cli when pushing a local application. We flagged it with our team and will ensure that fleet ID counting starts with 2 in openBalena.