Tried logging in with another browser that I’ve never used with balena - still getting the auth error for restart. weird.
If it helps… The other device in that application is also giving errors when I try to issue app restart. So both devices in that app are showing similar behavior. The other device is a Raspberry Pi 3.
Maybe you did something? It just let me restart the app.
Hello again,
Sorry it was my bad, I’ve purged the supervisor database which caused the Unauthorized
errors.
I’ve set it back to what it was and you should be able to restart containers again.
Yep. Working.
@zvin - I am an incredible pain in the arse today!
Rebooted the device - first reboot everything came up good. 2nd reboot, back into the service restart cycle. Maybe it’s some kind of service dependency timing issue? Maybe I need a script somewhere that let’s things get to a certain state before some of the other containers try to start?
Support still enabled if you want to dig any further.
@zvin - some observations after playing with this a bit more:
-
when a application restart is issued
- all the services are killed and then started back up - which seems normal behavior of course
- looks to me like ALL the services start at the same time regardless of dependencies specified in the docker-compose file
- Then almost almost all of the containers stop (or are killed) I’m not 100% sure which - maybe 8-10 seconds after everything starts
- then services seem to startup in an appropriate order and everything comes up to “running” and stable state.
-
when a device reboot is issued
- its hard to tell whats actually happening. Once logs start showing up we’re already in state of cycling.
- the supervisor state in the diagnostics indicates there is a failed update pending and so I think its trying to correct a problem that doesn’t exist.
In any case, I’m going to see if adding a startup script that waits for some state to be true before allowing some dependencies to actually get going.
@zvin - okay, I’ve been able to reduce my application stack down to 2 services and I’m still able to recreate the weird service restart loop we’ve been seeing. I think it has something to do with depends_on
during boot up.
Application restarts work as expected with everything returning to normal.
Device reboots is where this loop starts happening.
So it’s one of 2 things IMO – either:
-
depends_on
is being ignored at boot, so things end up in a race condition - there is something wrong with my
proxy
service – entirely possible
I’ve enabled support and the id = 64b5fccd4dc2a5f42268020c6578c0e4
– maybe it will be easier to isolate the issue with fewer services running?
Hey @donfmorrison , I think what you are seeing is actually kind of expected. In the earlier days (back when we were called resin.io) the supervisor used to start up all the containers after reboot, but many users complained of slow boot times especially on the rpi0 device types. One of the largest slow downs was that the device had to wait for the container engine and then supervisor to get up and running fully before their containers and logic started running. To reduce this time, we changed it so that the container engine now starts up the containers what were running when it was shutdown.
The problem comes in because the balena container engine knows nothing about the notion of depends_on
this is a purely docker-compose extension, so this is implemented in the supervisor. This is why you see the behaviour you do. It’s not ideal at all and we have had several discussions on how to improve it, but there is not an easy way to fix it right now.
For now my recommendation would be to have your services not rely on depends_on
but instead be defensive in the way they run and ensure they only try do what then need to do once the have confirmed the other services they need are their and running healthily.
Thanks, @shaunmulligan. This gives me a path forward.
I would think once the supervisor was up it would start enforcing depends_on
? But, it makes sense to be more defensive in scripts at startup for each service that needs it. My concern here is that I only have 1 dependent service but the ones that have no dependency on other services also get into the recycling chaos. It’s clear now that the “issue/feature” causing my issues is the container engine just starting containers, but I don’t understand why services with no dependencies would be restarting. The logs don’t show the services dying with an error. They are receiving a TERM signal.
Also, the supervisor gets into what I’ll call a dirty state where update_pending
and update_failed
are both true
and never recover. Is that expected behavior?
@zvin @shaunmulligan
I removed depends_on
from my containers and added scripting to control startup of any service that depends on another. As a test, I’ve also removed all services that have a dependency on another service. So in my application I have 7 containers, 5 of them are stand-alone with no dependencies and the other 2 glue those 5 together to accomplish what the app needs to do.
So, I have 5 services right now in this test that do not talk to each other and have no depends_on
tags in the docker-compose file.
I’m still seeing the service reboot loops. Can someone take a look?
uuid: 64b5fccd4dc2a5f42268020c6578c0e4
Support is enabled.
I can reliably recreate this issue on this device by rebooting it.
Hi @donfmorrison, I was looking at the device, but I cannot find anything that stands out as an issue, and all diagnostics pass. I see the restarts happening without any precursor of the cause in the logs, so maybe we can start the following:
- Remove all services from the compose file, and leave just one. Keep adding services and test at what point the issue occurs.
- Simplify the docker-compose to the bare minimum and try changing the
restart
policy.
If we can narrow down when and due to which service the restart is happening, it will be easier to debug the root cause. Let me know when you have experimented with this approach, and we can see what else we can do.
Ok, @sradevski – I think I have the container causing the issue isolated and it is the only container in my application right now. I am unable to recreate the constant service container restarts, but if I reboot the device while the service is running – the service starts, runs for 10-15 seconds and then gets killed. It stays killed.
You should still have support access. Application restarts seem to work fine. Device reboots is where things go bad.
The purpose of this container is to establish a VPN connection to my OpenVPN server. I am open to other ways of achieving the same end if you have ideas.
I would appreciate it if you could take a look at the device and see if you can find anything. Thank you!
hey @donfmorrison
Can I restart the device to try and reproduce the error and look into it further?
Also can I enable persistent logging?
Can you provide more details regarding the vpn service. Would applying openvpn configs somehow interfere with the network or device healthchecks?
@rahul-thakoor - yes, you can do whatever you like to that device.
As for the vpn service. It’s a vpn client. network communications for that container are tunneled to my VPN. Device (host) healthchecks would still go through the balena VPN I think.
This application (7 service containers) works just fine as long as you never reboot it. Application restarts where primarily only the supervisor are involved are not affected by whatever is happening at boot up. There is just something about this container that upsets the boot process I guess.
The image I’m using for the vpn-client is widely used: https://github.com/dperson/openvpn-client
Thanks. I enabled persitentLogging and will try to reboot the device now
Hey,
I want to take this back to the initial error you saw: level=error msg="Handler for POST /containers/245a5107e4854b82c816e84a20bd619a2fa549aeffae1d6bc9a1ae787c266255/start returned error: OCI runtime create failed: container_linux.go:338: creating new parent process caused \"container_linux.go:1897: running lstat on namespace path \\\"/proc/10273/ns/net\\\" caused \\\"lstat /proc/10273/ns/net: no such file or directory\\\"\": unknown"
This looks to me like a process in the startup phase of your container failed container_linux.go:1897: running lstat on namespace path \\\"/proc/10273/ns/net\\\" caused \\\"lstat /proc/10273/ns/net: no such file or directory\\\"
Please could you share your compose file here so that we can take a look. Thanks.
@richbayliss - here is my compose file for the device you’ve been looking at:
version: '2'
volumes:
ovpn-data:
services:
vpn:
image: dperson/openvpn-client
cap_add:
- net_admin
networks:
- default
tmpfs:
- /run
- /tmp
security_opt:
- label=disable
devices:
- "/dev/net:/dev/net"
volumes:
- 'ovpn-data:/vpn'
restart: unless-stopped
networks:
default:
ipam:
driver: default
config:
- subnet: 192.168.12.0/24
The image is from here: