Services are in a constant restart loop!

donfmorrison · April 13, 2020, 4:39pm

Tried logging in with another browser that I’ve never used with balena - still getting the auth error for restart. weird.

donfmorrison · April 13, 2020, 4:45pm

If it helps… The other device in that application is also giving errors when I try to issue app restart. So both devices in that app are showing similar behavior. The other device is a Raspberry Pi 3.

donfmorrison · April 13, 2020, 4:49pm

Maybe you did something? It just let me restart the app.

zvin · April 13, 2020, 4:53pm

Hello again,
Sorry it was my bad, I’ve purged the supervisor database which caused the Unauthorized errors.
I’ve set it back to what it was and you should be able to restart containers again.

donfmorrison · April 13, 2020, 5:22pm

Yep. Working.

donfmorrison · April 13, 2020, 5:48pm

@zvin - I am an incredible pain in the arse today!

Rebooted the device - first reboot everything came up good. 2nd reboot, back into the service restart cycle. Maybe it’s some kind of service dependency timing issue? Maybe I need a script somewhere that let’s things get to a certain state before some of the other containers try to start?

Support still enabled if you want to dig any further.

donfmorrison · April 14, 2020, 3:30am

@zvin - some observations after playing with this a bit more:

when a application restart is issued
- all the services are killed and then started back up - which seems normal behavior of course
- looks to me like ALL the services start at the same time regardless of dependencies specified in the docker-compose file
- Then almost almost all of the containers stop (or are killed) I’m not 100% sure which - maybe 8-10 seconds after everything starts
- then services seem to startup in an appropriate order and everything comes up to “running” and stable state.
when a device reboot is issued
- its hard to tell whats actually happening. Once logs start showing up we’re already in state of cycling.
- the supervisor state in the diagnostics indicates there is a failed update pending and so I think its trying to correct a problem that doesn’t exist.

~~In any case, I’m going to see if adding a startup script that waits for some state to be true before allowing some dependencies to actually get going.~~

donfmorrison · April 14, 2020, 6:39am

@zvin - okay, I’ve been able to reduce my application stack down to 2 services and I’m still able to recreate the weird service restart loop we’ve been seeing. I think it has something to do with depends_on during boot up.

Application restarts work as expected with everything returning to normal.

Device reboots is where this loop starts happening.

So it’s one of 2 things IMO – either:

depends_on is being ignored at boot, so things end up in a race condition
there is something wrong with my proxy service – entirely possible

I’ve enabled support and the id = 64b5fccd4dc2a5f42268020c6578c0e4 – maybe it will be easier to isolate the issue with fewer services running?

shaunmulligan · April 14, 2020, 11:59am

Hey @donfmorrison , I think what you are seeing is actually kind of expected. In the earlier days (back when we were called resin.io) the supervisor used to start up all the containers after reboot, but many users complained of slow boot times especially on the rpi0 device types. One of the largest slow downs was that the device had to wait for the container engine and then supervisor to get up and running fully before their containers and logic started running. To reduce this time, we changed it so that the container engine now starts up the containers what were running when it was shutdown.

The problem comes in because the balena container engine knows nothing about the notion of depends_on this is a purely docker-compose extension, so this is implemented in the supervisor. This is why you see the behaviour you do. It’s not ideal at all and we have had several discussions on how to improve it, but there is not an easy way to fix it right now.

For now my recommendation would be to have your services not rely on depends_on but instead be defensive in the way they run and ensure they only try do what then need to do once the have confirmed the other services they need are their and running healthily.

donfmorrison · April 14, 2020, 2:46pm

Thanks, @shaunmulligan. This gives me a path forward.

I would think once the supervisor was up it would start enforcing depends_on? But, it makes sense to be more defensive in scripts at startup for each service that needs it. My concern here is that I only have 1 dependent service but the ones that have no dependency on other services also get into the recycling chaos. It’s clear now that the “issue/feature” causing my issues is the container engine just starting containers, but I don’t understand why services with no dependencies would be restarting. The logs don’t show the services dying with an error. They are receiving a TERM signal.

Also, the supervisor gets into what I’ll call a dirty state where update_pending and update_failed are both true and never recover. Is that expected behavior?

donfmorrison · April 14, 2020, 9:24pm

@zvin @shaunmulligan
I removed depends_on from my containers and added scripting to control startup of any service that depends on another. As a test, I’ve also removed all services that have a dependency on another service. So in my application I have 7 containers, 5 of them are stand-alone with no dependencies and the other 2 glue those 5 together to accomplish what the app needs to do.

So, I have 5 services right now in this test that do not talk to each other and have no depends_on tags in the docker-compose file.

I’m still seeing the service reboot loops. Can someone take a look?

uuid: 64b5fccd4dc2a5f42268020c6578c0e4
Support is enabled.

I can reliably recreate this issue on this device by rebooting it.

sradevski · April 15, 2020, 9:32am

Hi @donfmorrison, I was looking at the device, but I cannot find anything that stands out as an issue, and all diagnostics pass. I see the restarts happening without any precursor of the cause in the logs, so maybe we can start the following:

Remove all services from the compose file, and leave just one. Keep adding services and test at what point the issue occurs.
Simplify the docker-compose to the bare minimum and try changing the restart policy.

If we can narrow down when and due to which service the restart is happening, it will be easier to debug the root cause. Let me know when you have experimented with this approach, and we can see what else we can do.

donfmorrison · April 17, 2020, 4:25am

Ok, @sradevski – I think I have the container causing the issue isolated and it is the only container in my application right now. I am unable to recreate the constant service container restarts, but if I reboot the device while the service is running – the service starts, runs for 10-15 seconds and then gets killed. It stays killed.

You should still have support access. Application restarts seem to work fine. Device reboots is where things go bad.

The purpose of this container is to establish a VPN connection to my OpenVPN server. I am open to other ways of achieving the same end if you have ideas.

I would appreciate it if you could take a look at the device and see if you can find anything. Thank you!

rahul-thakoor · April 17, 2020, 9:39am

hey @donfmorrison

Can I restart the device to try and reproduce the error and look into it further?

rahul-thakoor · April 17, 2020, 10:12am

Also can I enable persistent logging?

Can you provide more details regarding the vpn service. Would applying openvpn configs somehow interfere with the network or device healthchecks?

donfmorrison · April 17, 2020, 3:13pm

@rahul-thakoor - yes, you can do whatever you like to that device.

donfmorrison · April 17, 2020, 3:19pm

As for the vpn service. It’s a vpn client. network communications for that container are tunneled to my VPN. Device (host) healthchecks would still go through the balena VPN I think.

This application (7 service containers) works just fine as long as you never reboot it. Application restarts where primarily only the supervisor are involved are not affected by whatever is happening at boot up. There is just something about this container that upsets the boot process I guess.

The image I’m using for the vpn-client is widely used: https://github.com/dperson/openvpn-client

nazrhom · April 17, 2020, 4:42pm

Thanks. I enabled persitentLogging and will try to reboot the device now

richbayliss · April 20, 2020, 12:51pm

Hey,

I want to take this back to the initial error you saw: level=error msg="Handler for POST /containers/245a5107e4854b82c816e84a20bd619a2fa549aeffae1d6bc9a1ae787c266255/start returned error: OCI runtime create failed: container_linux.go:338: creating new parent process caused \"container_linux.go:1897: running lstat on namespace path \\\"/proc/10273/ns/net\\\" caused \\\"lstat /proc/10273/ns/net: no such file or directory\\\"\": unknown"

This looks to me like a process in the startup phase of your container failed container_linux.go:1897: running lstat on namespace path \\\"/proc/10273/ns/net\\\" caused \\\"lstat /proc/10273/ns/net: no such file or directory\\\"

Please could you share your compose file here so that we can take a look. Thanks.

donfmorrison · April 20, 2020, 5:27pm

@richbayliss - here is my compose file for the device you’ve been looking at:

version: '2'
volumes:
  ovpn-data:
services:
  vpn:
    image: dperson/openvpn-client
    cap_add:
      - net_admin
    networks:
      - default
    tmpfs:
      - /run
      - /tmp
    security_opt:
      - label=disable
    devices:
      - "/dev/net:/dev/net"
    volumes:
        - 'ovpn-data:/vpn'
    restart: unless-stopped
networks:
  default:    
    ipam:
      driver: default
      config:
        - subnet: 192.168.12.0/24

The image is from here:

Topic		Replies	Views
Two instances of supervisor trying to start balenaOS	12	496	September 15, 2022
BalenaOS Does Not Update a Container Product support	6	463	April 29, 2021
Container stuck continuously restarting Product support	5	7067	April 26, 2019
Supervisor constantly restarting balenaOS docker	3	237	October 10, 2023
'balena ps' shows container restarting for many hours; reboot required to clear condition balenaEngine	30	2552	January 26, 2021

Services are in a constant restart loop!

Related topics