We’ve had a couple devices unable to update to new releases. I can’t speak to the exact timing of the first device as I didn’t handle its update, but the device I updated went like this: I pushed a new release to the application with balena push
, then updated my development device from local mode to application mode. Looking at the device’s summary page, the services from the new release were downloaded and started. But the current release showed it was still on the previous release, and the target release was the new one. The first device has the same issue, but it was upgrading from two releases ago to the previous release. Neither devices are pinned to a specific release. Restarting and rebooting do not seem to have an effect. We are running a modified OS version balenaOS 2.38.3+rev3
so it can run on an Odroid-C2.
Another thing I noticed was that in the logs for both devices, two named volumes we defined in the compose file, lo3energy
and readings
were continually being created (Creating volume 'lo3energy'
and Creating volume 'readings'
). The former is used by all four of our services, and the latter used by two of them. These are not the only two named volumes we create.
I’m wondering if the switch from local mode to application mode during a new release update somehow caused some crossed wires.
Hi there. Thats is very strange. If you have access to one of the devices in this state currently, would it be possible to pull some logs from the supervisor:
journalctl --no-pager -a -n 400 -u resin-supervisor.service
and from the balena-engine.
journalctl --no-pager -a -n 500 -u balena.service
that should give us some errors to start looking for the problem.
I’ve posted the logs to Pastebin here to keep this thread clean. If you’d like me to instead post them directly here, I can do that as well.
thanks, that is super weird it seems like the supervisor is struggling to create volumes correctly and just gets stuck doing that over and over until the supervisor healthcheck fails and it gets restarted. Could you also grab the logs from the balena engine daemon using journalctl -u balena.service -t balenad --no-pager
as that might show us some errors in the volume creation. The other thing to try is stop the supervisor systemctl stop resin-supervisor
and then try run a container with a volume option manually and see if it throws any more specific errors
When I initially ran journalctl -u balena.service -t balenad --no-pager
, it said there were no logs. Then I tried to stop the resin-supervisor and another container, but they kept starting back up. I know with the containers, I set restart: always
in the compose file, but I’m not sure why the supervisor keeps restarting.
After trying a couple times to stop the containers, checking the logs with the command above gave me
-- Logs begin at Fri 2020-01-31 22:18:52 UTC, end at Fri 2020-01-31 22:32:04 UTC. -- Jan 31 22:22:52 balena balenad[763]: time="2020-01-31T22:22:52.493990168Z" level=info msg="Container d27e07b29167e63ed8499f82836f8b668fa264b25297b374898b061cbe322809 failed to exit within 10 seconds of signal 15 - using the force" Jan 31 22:22:53 balena balenad[763]: time="2020-01-31T22:22:53.057202747Z" level=info msg="shim reaped" id=d27e07b29167e63ed8499f82836f8b668fa264b25297b374898b061cbe322809 Jan 31 22:22:53 balena balenad[763]: time="2020-01-31T22:22:53.063648652Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jan 31 22:22:53 balena balenad[763]: time="2020-01-31T22:22:53.067618111Z" level=warning msg="Failed to delete conntrack state for 172.19.0.3: invalid argument" Jan 31 22:22:53 balena balenad[763]: time="2020-01-31T22:22:53.487819437Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/d27e07b29167e63ed8499f82836f8b668fa264b25297b374898b061cbe322809/shim.sock debug=false pid=20961 Jan 31 22:24:23 balena balenad[763]: time="2020-01-31T22:24:23.408503979Z" level=info msg="Container c845c6eb11178a84aae553a1f866e75f4c78d74a2ef597cbe6969a50ae5c6b91 failed to exit within 10 seconds of signal 15 - using the force" Jan 31 22:24:24 balena balenad[763]: time="2020-01-31T22:24:24.128867783Z" level=info msg="shim reaped" id=c845c6eb11178a84aae553a1f866e75f4c78d74a2ef597cbe6969a50ae5c6b91 Jan 31 22:24:24 balena balenad[763]: time="2020-01-31T22:24:24.134670107Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jan 31 22:24:24 balena balenad[763]: time="2020-01-31T22:24:24.134798073Z" level=error msg="attach failed with error: error attaching stdout stream: write unix /var/run/balena-engine.sock->@: write: broken pipe" Jan 31 22:24:24 balena balenad[763]: time="2020-01-31T22:24:24.997676932Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/c845c6eb11178a84aae553a1f866e75f4c78d74a2ef597cbe6969a50ae5c6b91/shim.sock debug=false pid=21337 Jan 31 22:25:28 balena balenad[763]: time="2020-01-31T22:25:28.345176970Z" level=info msg="Container d27e07b29167e63ed8499f82836f8b668fa264b25297b374898b061cbe322809 failed to exit within 10 seconds of signal 15 - using the force" Jan 31 22:25:28 balena balenad[763]: time="2020-01-31T22:25:28.930426228Z" level=info msg="shim reaped" id=d27e07b29167e63ed8499f82836f8b668fa264b25297b374898b061cbe322809 Jan 31 22:25:28 balena balenad[763]: time="2020-01-31T22:25:28.937650516Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jan 31 22:25:28 balena balenad[763]: time="2020-01-31T22:25:28.941016978Z" level=warning msg="Failed to delete conntrack state for 172.19.0.3: invalid argument" Jan 31 22:25:29 balena balenad[763]: time="2020-01-31T22:25:29.389715390Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/d27e07b29167e63ed8499f82836f8b668fa264b25297b374898b061cbe322809/shim.sock debug=false pid=21563 Jan 31 22:28:47 balena balenad[763]: time="2020-01-31T22:28:47.787077825Z" level=info msg="Container c845c6eb11178a84aae553a1f866e75f4c78d74a2ef597cbe6969a50ae5c6b91 failed to exit within 10 seconds of signal 15 - using the force" Jan 31 22:28:48 balena balenad[763]: time="2020-01-31T22:28:48.395185522Z" level=info msg="shim reaped" id=c845c6eb11178a84aae553a1f866e75f4c78d74a2ef597cbe6969a50ae5c6b91 Jan 31 22:28:48 balena balenad[763]: time="2020-01-31T22:28:48.401748769Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jan 31 22:28:59 balena balenad[763]: time="2020-01-31T22:28:59.404899588Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/c845c6eb11178a84aae553a1f866e75f4c78d74a2ef597cbe6969a50ae5c6b91/shim.sock debug=false pid=22089 Jan 31 22:29:10 balena balenad[763]: time="2020-01-31T22:29:10.911786242Z" level=info msg="Container d27e07b29167e63ed8499f82836f8b668fa264b25297b374898b061cbe322809 failed to exit within 10 seconds of signal 15 - using the force" Jan 31 22:29:11 balena balenad[763]: time="2020-01-31T22:29:11.510608771Z" level=info msg="shim reaped" id=d27e07b29167e63ed8499f82836f8b668fa264b25297b374898b061cbe322809 Jan 31 22:29:11 balena balenad[763]: time="2020-01-31T22:29:11.517248481Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jan 31 22:29:11 balena balenad[763]: time="2020-01-31T22:29:11.520676113Z" level=warning msg="Failed to delete conntrack state for 172.19.0.3: invalid argument" Jan 31 22:29:11 balena balenad[763]: time="2020-01-31T22:29:11.977223395Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/d27e07b29167e63ed8499f82836f8b668fa264b25297b374898b061cbe322809/shim.sock debug=false pid=22254
The supervisor has a systemd service that will continually try restart it if the container is stopped. So to stop the supervisor you need to do systemctl stop resin-supervisor
and that should stop trying to start the other containers as well
From the logs it seems like one of the containers is stuck in some weird hung state, where it is failing to be killed properly. I would stop the supervisor as described above, then list all the containers using balena ps -a
and try remove all of those manually doing balena rm <container_id>
and see what happens, but it seems like there might be some thing wrong with the way the kernel is running containers, perhaps something is missing in your custom build to enable everything it needs. What kernel version is it running? We should also try stop all the containers, supervisor etc, remove all of them and their images and just try run a simple container with a volume mounted to it and see that this works correctly and then move forward from there.
I went ahead and stopped/deleted all containers and deleted all images from my device. I asked the developer that made our OS what kernel version was used, as I am unsure.
Do you mean for me to add a simple container to my application (I’m seeing this issue when in application mode, not local)? I’d rather not potentially mess up the other devices running on the application, and I’m not sure if making a new application would change too much. And what kind of a simple container?
Our custom OS uses kernel version 4.19.
No, no need to add anything to your app or code. This is just a pure test on the device to test the engine by itself all you need to do is run balena run --privileged --volume "/dev:/dev" -it arm64v8/busybox /bin/sh
and see that this works correctly. It should drop you into a terminal session of the busybox container. If that works, then you need to try figure out which of the images in your application are causing the issue by using the same command but trying to run them instead of arm64v8/busybox
Oh, I see.
Running that container worked, as it did drop me in a terminal session. I exited out of it and checked journalctl -u balena.service -t balenad --no-pager
, which gave me
Jan 31 23:20:13 balena balenad[763]: time="2020-01-31T23:20:13.835940810Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/9d3964a281226b990000571fc652b9b58f9af637de8cb5ddff1caee0593d37fd/shim.sock debug=false pid=25578 Jan 31 23:21:06 balena balenad[763]: time="2020-01-31T23:21:06.071023249Z" level=info msg="shim reaped" id=9d3964a281226b990000571fc652b9b58f9af637de8cb5ddff1caee0593d37fd Jan 31 23:21:06 balena balenad[763]: time="2020-01-31T23:21:06.076145238Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jan 31 23:21:06 balena balenad[763]: time="2020-01-31T23:21:06.079910183Z" level=warning msg="Failed to delete conntrack state for 10.114.101.2: invalid argument"
I’m not sure if this is the expected logs when stopping a container or not.
How do I specify pulling the images from my application, and the right release? I deleted them as a part of working through this.
It definitely seems like that could be causing a problem just from a quick google on the specific error, it seems like some part of the networking is not getting cleaned up properly but I don’t know very much about the docker engine internals unfortunately. One other thing to check is the kernel logs for for when this error happens, which you should be able to grab from dmesg
.
For testing your containers you can either just try with the base images for each service, so what every you used as FROM ...
or alternatively you can restart the resin-supervisor
let it pull all the images again to disc, then stop the supervisor once more and all its containers, but don’t delete the images and then try start up each image in the same way as you did with busybox.
I just want to check whether you have had any success with shaunmulligan
's suggestions?
Kind regards
Alida
I have not had the chance to. I may be able to get to it later today, and will get back to you on my results.