Services are in a constant restart loop!

I’ve got an application that is deployed to 1 Balena Fin v1.1 device right now for testing - my services seem to be in a constant state of restarting right now. If I look at the supervisor diagnostics (Diagnostics -> Supervisor State) I can see the following:

{
  "api_port": 48484,
  "ip_address": "xxxxxxxxxx",
  "os_version": "balenaOS 2020.04.0",
  "supervisor_version": "10.8.0",
  "update_pending": true,
  "update_failed": true,
  "update_downloaded": false,
  "status": "Downloaded",
  "download_progress": null
}

The status seems to be flipping between cycling through some combination of:

Downloaded --> Idle --> Starting --> Stopping --> Idle --> Downloaded.... 

Service containers being starting and then seem to get a term signal and reboot. It actually looks like they are getting stopped, removed, re-installed, started, then stopped… on a continuous loop.

I suspect this is occuring because the supervisor is trying to correct whatever error is thinks occurred for the last update (pending and failed are both true) - and I’m not sure how to force a full update. Right delta updates are turned on and I have NOT tried turning delta off. I actually want to leave it in this state so someone at Balena can take a look?

Any ideas? I’m glad to grant access to support - will send details about the app
via private message if needed.

Eventually the cycle does stop or slow down with containers in a Stopped state - but, I’m guessing that because everything is going into backoff and just trying to start less often. Need help at this point!

I found that running the command journalctl -f -n 1000 -a -u balena can shed a little more light on whatever errors are occurring.

Found this error in that log output - repeats on a regular basis!

Apr 13 04:12:02 b05ae2a balenad[790]: time="2020-04-13T04:12:02.439484599Z" level=error msg="245a5107e4854b82c816e84a20bd619a2fa549aeffae1d6bc9a1ae787c266255 cleanup: failed to delete container from containerd: no such container"
Apr 13 04:12:02 b05ae2a balenad[790]: time="2020-04-13T04:12:02.439729025Z" level=error msg="Handler for POST /containers/245a5107e4854b82c816e84a20bd619a2fa549aeffae1d6bc9a1ae787c266255/start returned error: OCI runtime create failed: container_linux.go:338: creating new parent process caused \"container_linux.go:1897: running lstat on namespace path \\\"/proc/10273/ns/net\\\" caused \\\"lstat /proc/10273/ns/net: no such file or directory\\\"\": unknown"
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]   Scheduling another update attempt in 900000ms due to failure:  Error: Failed to apply state transition steps. (HTTP code 500) server error - OCI runtime create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/10273/ns/net\" caused \"lstat /proc/10273/ns/net: no such file or directory\"": unknown  Steps:["start","start","kill","kill","kill","kill"]
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]         at /usr/src/app/dist/app.js:614:16375
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at c (/usr/src/app/dist/app.js:9:77566)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at O._settlePromiseFromHandler (/usr/src/app/dist/app.js:312:325515)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at O._settlePromise (/usr/src/app/dist/app.js:312:326315)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at O._settlePromise0 (/usr/src/app/dist/app.js:312:327014)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at O._settlePromises (/usr/src/app/dist/app.js:312:328255)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at d (/usr/src/app/dist/app.js:312:329960)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at p (/usr/src/app/dist/app.js:312:329899)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at s._drainQueues (/usr/src/app/dist/app.js:312:331419)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at Immediate.drainQueues [as _onImmediate] (/usr/src/app/dist/app.js:312:329641)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at runCallback (timers.js:705:18)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at tryOnImmediate (timers.js:676:5)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at processImmediate (timers.js:658:5)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]   Device state apply error Error: Failed to apply state transition steps. (HTTP code 500) server error - OCI runtime create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/10273/ns/net\" caused \"lstat /proc/10273/ns/net: no such file or directory\"": unknown  Steps:["start","start","kill","kill","kill","kill"]
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]         at /usr/src/app/dist/app.js:614:16375
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at c (/usr/src/app/dist/app.js:9:77566)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at O._settlePromiseFromHandler (/usr/src/app/dist/app.js:312:325515)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at O._settlePromise (/usr/src/app/dist/app.js:312:326315)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at O._settlePromise0 (/usr/src/app/dist/app.js:312:327014)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at O._settlePromises (/usr/src/app/dist/app.js:312:328255)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at d (/usr/src/app/dist/app.js:312:329960)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at p (/usr/src/app/dist/app.js:312:329899)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at s._drainQueues (/usr/src/app/dist/app.js:312:331419)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at Immediate.drainQueues [as _onImmediate] (/usr/src/app/dist/app.js:312:329641)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at runCallback (timers.js:705:18)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at tryOnImmediate (timers.js:676:5)
Apr 13 04:12:02 b05ae2a 8a566be31289[790]: [error]       at processImmediate (timers.js:658:5)

I’m sure you’ll need more of the log than this, but I’m not sure what parts are pertinent, yet. Will share more with support once someone is available to take a look.

Hello, can you please check dmesg in the host os and run the diagnostics tool from the dashboard (last item on the left) on this device ?

@zvin

some of dmesg output:

[59384.096539] device vethce0d80f entered promiscuous mode
[59384.097116] IPv6: ADDRCONF(NETDEV_UP): vethce0d80f: link is not ready
[59384.223727] br-ebc537e44695: port 3(veth67f2e69) entered blocking state
[59384.223745] br-ebc537e44695: port 3(veth67f2e69) entered disabled state
[59384.223976] device veth67f2e69 entered promiscuous mode
[59384.224277] IPv6: ADDRCONF(NETDEV_UP): veth67f2e69: link is not ready
[59384.224291] br-ebc537e44695: port 3(veth67f2e69) entered blocking state
[59384.224300] br-ebc537e44695: port 3(veth67f2e69) entered forwarding state
[59385.099870] br-ebc537e44695: port 3(veth67f2e69) entered disabled state
[59385.387963] br-ebc537e44695: port 2(vethb88dfc4) entered disabled state
[59385.388551] vethbb40a59: renamed from eth0
[59385.505614] br-ebc537e44695: port 6(veth8ddbd4a) entered disabled state
[59385.508940] veth28f0639: renamed from eth0
[59385.646338] br-ebc537e44695: port 2(vethb88dfc4) entered disabled state
[59385.665720] device vethb88dfc4 left promiscuous mode
[59385.665737] br-ebc537e44695: port 2(vethb88dfc4) entered disabled state
[59385.670234] vetha3bc066: renamed from eth0
[59385.730659] br-ebc537e44695: port 4(veth930713b) entered disabled state
[59385.806571] br-ebc537e44695: port 6(veth8ddbd4a) entered disabled state
[59385.826056] device veth8ddbd4a left promiscuous mode
[59385.826080] br-ebc537e44695: port 6(veth8ddbd4a) entered disabled state
[59385.900683] br-ebc537e44695: port 4(veth930713b) entered disabled state
[59385.927718] device veth930713b left promiscuous mode
[59385.927738] br-ebc537e44695: port 4(veth930713b) entered disabled state
[59386.505356] br-ebc537e44695: port 5(veth06e5425) entered disabled state
[59386.510655] vethae54f63: renamed from eth0
[59386.833069] br-ebc537e44695: port 5(veth06e5425) entered disabled state
[59386.846914] device veth06e5425 left promiscuous mode
[59386.846935] br-ebc537e44695: port 5(veth06e5425) entered disabled state
[59387.213584] eth0: renamed from veth5180a1c
[59387.264177] eth0: renamed from vethed0998c
[59387.310341] IPv6: ADDRCONF(NETDEV_CHANGE): veth67f2e69: link becomes ready
[59387.310623] br-ebc537e44695: port 3(veth67f2e69) entered blocking state
[59387.310639] br-ebc537e44695: port 3(veth67f2e69) entered forwarding state
[59387.314350] IPv6: ADDRCONF(NETDEV_CHANGE): vethce0d80f: link becomes ready
[59387.314587] br-ebc537e44695: port 1(vethce0d80f) entered blocking state
[59387.314613] br-ebc537e44695: port 1(vethce0d80f) entered forwarding state
[59594.853977] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[59909.824857] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[60224.819264] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[60287.121532] br-ebc537e44695: port 2(vethaae3af5) entered blocking state
[60287.121550] br-ebc537e44695: port 2(vethaae3af5) entered disabled state
[60287.123723] device vethaae3af5 entered promiscuous mode
[60287.125241] IPv6: ADDRCONF(NETDEV_UP): vethaae3af5: link is not ready
[60287.548785] br-ebc537e44695: port 4(veth4a0ba06) entered blocking state
[60287.548802] br-ebc537e44695: port 4(veth4a0ba06) entered disabled state
[60287.549286] device veth4a0ba06 entered promiscuous mode
[60287.549938] IPv6: ADDRCONF(NETDEV_UP): veth4a0ba06: link is not ready
[60287.549964] br-ebc537e44695: port 4(veth4a0ba06) entered blocking state
[60287.549977] br-ebc537e44695: port 4(veth4a0ba06) entered forwarding state
[60287.979016] br-ebc537e44695: port 5(veth45c4c5a) entered blocking state
[60287.979034] br-ebc537e44695: port 5(veth45c4c5a) entered disabled state
[60287.979389] device veth45c4c5a entered promiscuous mode
[60287.983610] IPv6: ADDRCONF(NETDEV_UP): veth45c4c5a: link is not ready
[60287.983636] br-ebc537e44695: port 5(veth45c4c5a) entered blocking state
[60287.983645] br-ebc537e44695: port 5(veth45c4c5a) entered forwarding state
[60288.114942] br-ebc537e44695: port 4(veth4a0ba06) entered disabled state
[60288.115206] br-ebc537e44695: port 5(veth45c4c5a) entered disabled state
[60288.134063] br-ebc537e44695: port 6(veth776d04f) entered blocking state
[60288.134082] br-ebc537e44695: port 6(veth776d04f) entered disabled state
[60288.134350] device veth776d04f entered promiscuous mode
[60288.135484] IPv6: ADDRCONF(NETDEV_UP): veth776d04f: link is not ready
[60288.135509] br-ebc537e44695: port 6(veth776d04f) entered blocking state
[60288.135519] br-ebc537e44695: port 6(veth776d04f) entered forwarding state
[60288.292794] br-ebc537e44695: port 6(veth776d04f) entered disabled state
[60288.293411] br-ebc537e44695: port 3(veth67f2e69) entered disabled state
[60288.297930] veth5180a1c: renamed from eth0
[60288.510853] br-ebc537e44695: port 3(veth67f2e69) entered disabled state
[60288.519430] device veth67f2e69 left promiscuous mode
[60288.519451] br-ebc537e44695: port 3(veth67f2e69) entered disabled state
[60288.558525] br-ebc537e44695: port 1(vethce0d80f) entered disabled state
[60288.561308] vethed0998c: renamed from eth0
[60288.775951] br-ebc537e44695: port 1(vethce0d80f) entered disabled state
[60288.786723] device vethce0d80f left promiscuous mode
[60288.786744] br-ebc537e44695: port 1(vethce0d80f) entered disabled state
[60289.756186] eth0: renamed from vethef18472
[60289.806308] IPv6: ADDRCONF(NETDEV_CHANGE): vethaae3af5: link becomes ready
[60289.806471] br-ebc537e44695: port 2(vethaae3af5) entered blocking state
[60289.806486] br-ebc537e44695: port 2(vethaae3af5) entered forwarding state
[60290.376483] eth0: renamed from vethaf2aa43
[60290.415930] IPv6: ADDRCONF(NETDEV_CHANGE): veth4a0ba06: link becomes ready
[60290.416055] br-ebc537e44695: port 4(veth4a0ba06) entered blocking state
[60290.416067] br-ebc537e44695: port 4(veth4a0ba06) entered forwarding state
[60291.036098] eth0: renamed from vethe396aff
[60291.097267] IPv6: ADDRCONF(NETDEV_CHANGE): veth776d04f: link becomes ready
[60291.097406] br-ebc537e44695: port 6(veth776d04f) entered blocking state
[60291.097416] br-ebc537e44695: port 6(veth776d04f) entered forwarding state
[60291.426182] eth0: renamed from veth5aaddbf
[60291.478234] IPv6: ADDRCONF(NETDEV_CHANGE): veth45c4c5a: link becomes ready
[60291.478361] br-ebc537e44695: port 5(veth45c4c5a) entered blocking state
[60291.478376] br-ebc537e44695: port 5(veth45c4c5a) entered forwarding state
[60294.871359] br-ebc537e44695: port 1(veth30d08d3) entered blocking state
[60294.871375] br-ebc537e44695: port 1(veth30d08d3) entered disabled state
[60294.874606] device veth30d08d3 entered promiscuous mode
[60294.880869] IPv6: ADDRCONF(NETDEV_UP): veth30d08d3: link is not ready
[60295.168480] br-ebc537e44695: port 3(veth322e099) entered blocking state
[60295.168506] br-ebc537e44695: port 3(veth322e099) entered disabled state
[60295.168868] device veth322e099 entered promiscuous mode
[60295.169381] IPv6: ADDRCONF(NETDEV_UP): veth322e099: link is not ready
[60295.169402] br-ebc537e44695: port 3(veth322e099) entered blocking state
[60295.169410] br-ebc537e44695: port 3(veth322e099) entered forwarding state
[60295.905158] br-ebc537e44695: port 3(veth322e099) entered disabled state
[60296.109308] br-ebc537e44695: port 6(veth776d04f) entered disabled state
[60296.109626] vethe396aff: renamed from eth0
[60296.299971] br-ebc537e44695: port 6(veth776d04f) entered disabled state
[60296.309852] device veth776d04f left promiscuous mode
[60296.309879] br-ebc537e44695: port 6(veth776d04f) entered disabled state
[60296.504938] br-ebc537e44695: port 4(veth4a0ba06) entered disabled state
[60296.511532] vethaf2aa43: renamed from eth0
[60296.769741] br-ebc537e44695: port 4(veth4a0ba06) entered disabled state
[60296.783316] device veth4a0ba06 left promiscuous mode
[60296.783351] br-ebc537e44695: port 4(veth4a0ba06) entered disabled state
[60297.180850] br-ebc537e44695: port 5(veth45c4c5a) entered disabled state
[60297.181299] veth5aaddbf: renamed from eth0
[60297.393470] br-ebc537e44695: port 5(veth45c4c5a) entered disabled state
[60297.401939] device veth45c4c5a left promiscuous mode
[60297.401955] br-ebc537e44695: port 5(veth45c4c5a) entered disabled state
[60297.742132] br-ebc537e44695: port 2(vethaae3af5) entered disabled state
[60297.742605] vethef18472: renamed from eth0
[60298.006605] eth0: renamed from veth1e04dbd
[60298.067170] br-ebc537e44695: port 2(vethaae3af5) entered disabled state
[60298.079836] device vethaae3af5 left promiscuous mode
[60298.079852] br-ebc537e44695: port 2(vethaae3af5) entered disabled state
[60298.082367] eth0: renamed from vethb568c95
[60298.116108] IPv6: ADDRCONF(NETDEV_CHANGE): veth322e099: link becomes ready
[60298.116370] br-ebc537e44695: port 3(veth322e099) entered blocking state
[60298.116388] br-ebc537e44695: port 3(veth322e099) entered forwarding state
[60298.118017] IPv6: ADDRCONF(NETDEV_CHANGE): veth30d08d3: link becomes ready
[60298.118261] br-ebc537e44695: port 1(veth30d08d3) entered blocking state
[60298.118288] br-ebc537e44695: port 1(veth30d08d3) entered forwarding state

I’ve run the diagnostics and will send the output to you via private message.

Can you please enable support access on this device and send me its uuid ?

uuid: b05ae2a037f4150558906f8a1947a080

Thanks, I’m having a look

Thanks!

I noticed yesterday that the OS version is tagged “next” – and it doesn’t seem to be the current released version. I didn’t want to change anything as I figured it was good to have some look at this behavior.

I’m guessing I downloaded that OS image before someone was able to mark is as pre-release.

It should be fixed now.
The containers were in a weird state.
What I did to fix it (from the host OS):

  • systemctl stop resin-supervisor to stop the supervisor froim trying to restart the containers
  • balena ps -a to list all containers
  • balena kill <all container ids from above except resin-supervisor>
  • balena rm <all container ids from above except resin-supervisor>
  • systemctl start resin-supervisor
1 Like

@zvin - Excellent! Thank you. Having those steps will help in the future if this ever happens again.

I’m okay to stay on this version of: balenaOS 2020.04.0? Or should I drop this one (totally redploy) to v2.48.0 official latest release?

You should be fine with this version.
Please check https://www.balena.io/blog/balenaos-esr-peace-of-mind-with-only-two-updates-a-year/

@zvin - I restarted the services and everything returned back to a normal state after your fix.

I rebooted the device and it’s back into that weird state again. Is there a way to track down what’s causing it to get into this state?

We seem to have a remedy - manually correcting the issue by stopping supervisor, stopping and removing containers, and restarting supervisor - but, how do we prevent it from getting into this state to start with?

I’ll leave it in the strange state just in case you need to look at it.

I’m still looking at this device.
Did you add some update locks in your services ?

No, I don’t have any locks that I’ve created.

I’ve updated the supervisor and purged its database.
It works as expected now.
Still I’m not sure how it gets in that weird state.
Please ping us if that happens again.

1 Like

Right now, if I try to restart services or anything I get an error in the dashboard
Request error: Unauthorized

Please check that you are still logged in.

Logged out and back in. Still get the error. :man_shrugging:

Please try forcing a refresh (CTRL + F5).
Can you still access other pages, open terminals, etc… ?

I’m able to start terminal sessions from the dashboard to the host and any service container I want. I was also able to change a service variable and see it change in the log output.

Restart and/or Reboot both throw the error Request error: Unauthorized