Supervisor restarting continuously

cnr · March 23, 2020, 6:04pm

For some reason, my supervisor is continuously restarting:

23.03.20 03:19:28 (-0600) Supervisor starting
23.03.20 03:19:30 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:06:13 (-0600) Supervisor starting
23.03.20 04:06:14 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:06:35 (-0600) Supervisor starting
23.03.20 04:06:36 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:53:20 (-0600) Supervisor starting
23.03.20 04:53:21 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:53:42 (-0600) Supervisor starting
23.03.20 04:53:43 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 05:40:27 (-0600) Supervisor starting
23.03.20 05:40:28 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 05:40:49 (-0600) Supervisor starting
23.03.20 05:40:51 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 06:27:34 (-0600) Supervisor starting
23.03.20 06:27:36 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 06:27:56 (-0600) Supervisor starting
23.03.20 06:27:57 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 07:14:41 (-0600) Supervisor starting
23.03.20 07:14:42 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 07:15:03 (-0600) Supervisor starting
23.03.20 07:15:05 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:01:48 (-0600) Supervisor starting
23.03.20 08:01:50 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:02:10 (-0600) Supervisor starting
23.03.20 08:02:12 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:48:56 (-0600) Supervisor starting
23.03.20 08:48:58 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:49:18 (-0600) Supervisor starting
23.03.20 08:49:19 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 09:36:03 (-0600) Supervisor starting
23.03.20 09:36:04 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 09:36:25 (-0600) Supervisor starting
23.03.20 09:36:26 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 10:23:09 (-0600) Supervisor starting
23.03.20 10:23:11 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 10:23:32 (-0600) Supervisor starting
23.03.20 10:23:34 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:10:17 (-0600) Supervisor starting
23.03.20 11:10:18 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:10:39 (-0600) Supervisor starting
23.03.20 11:10:40 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:45:21 (-0600) Starting service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:45:22 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'

The supervisor version I’m using is:
balena/aarch64-supervisor:v10.6.27

OS version:
balenaOS 2.47.0+rev5

Though I have also had this problem with OS version balenaOS 2.45.1+rev1 on this device. It seems to be device specific.

I have seen other forum posts about potentially similar problems resulting from a full disk, but as you can see I have plenty of space:

root@36a1450:~# du -hs /
...
11G     /

root@36a1450:~# lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0      179:0    0 29.1G  0 disk 
...

When I run balena-engine logs on the supervisor I get this:

...
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[api]     POST /v2/applications/1506891/stop-service  -  ms
...

It leaves my containers in a stopped state, it’s been this way for multiple days:

The device in question is found here: https://dashboard.balena-cloud.com/devices/36a1450a2a85cc82c1c103f75398d5d7/summary
I have granted support access for a week.

Rebooting the device has helped in the past, but the problem still occurs relatively frequently, so I will leave it in the failed state to facilitate debugging.

richbayliss · March 23, 2020, 6:17pm

Have you run the diagnostics from the menu on the left of the dashboard? This will provide some useful pointers if the HW is having issues.

cnr · March 23, 2020, 6:38pm

Confusingly, it says that supervisor is unhealthy

yet when I run balena-engine ps it show’s that the container is in a healthy state.

It also mentions a high temperature, not sure if this is a symptom or a cause

cnr · March 23, 2020, 6:50pm

Accidentally rebooted, now it’s in a working state. Will follow up here when this happens again.

richbayliss · March 23, 2020, 7:30pm

Yes, a weird one. Please do

cnr · March 24, 2020, 1:22am

Ok, in a failed state again. All it took was restarting the container. Would very much appreciate some help

anujdeshpande · March 24, 2020, 9:20am

Just to clairfy - rebooting the device fixed the issue, but restarting the container started it again. Is that right?

cnr · March 24, 2020, 4:32pm

Yep. That’s correct

alexgg · March 24, 2020, 4:55pm

Hi, I see that you have granted support access. Would it be an inconvenience for me to reboot the device, see the container working, and then restart the container to see the failure? Thanks.

cnr · March 24, 2020, 4:57pm

Please go ahead and do that. Not sure if it enters the failed state on every reboot, but you’re certainly welcome to do what you need to to make it work. I will stay off of it until you’re done

alexgg · March 24, 2020, 4:58pm

Thanks, we will let you know once we’re done.

zvin · March 24, 2020, 6:14pm

Hello, according to dmesg it seems like the mmc on this device is misbehaving:

[   23.627980] mmc1: Timeout waiting for hardware cmd interrupt.
[   23.633943] sdhci: =========== REGISTER DUMP (mmc1)===========
[   23.639874] sdhci: Sys addr: 0x00000000 | Version:  0x00000404
[   23.645773] sdhci: Blk size: 0x00000000 | Blk cnt:  0x00000000
[   23.651676] sdhci: Argument: 0x80062000 | Trn mode: 0x00000000
[   23.657573] sdhci: Present:  0x01fb00f1 | Host ctl: 0x00000001
[   23.663468] sdhci: Power:    0x00000000 | Blk gap:  0x00000000
[   23.669358] sdhci: Wake-up:  0x00000000 | Clock:    0x00000403
[   23.675242] sdhci: Timeout:  0x00000000 | Int stat: 0x00000000
[   23.681125] sdhci: Int enab: 0x00ff1003 | Sig enab: 0x00fc1003
[   23.687008] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[   23.692895] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[   23.698781] sdhci: Cmd:      0x0000341a | Max curr: 0x00000000
[   23.704666] sdhci: Host ctl2: 0x00003000
[   23.708654] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc100410
[   23.715325] sdhci: ===========================================
[   23.722188] brcmfmac: probe of mmc1:0001:3 failed with error -110
[   23.723709] usbcore: registered new interface driver brcmfmac
[   23.726620] vdd-1v8: voltage operation not allowed
[   23.731594] sdhci-tegra 3440000.sdhci: could not set regulator OCR (-1)
[   23.749768] sdhci-tegra 3440000.sdhci: Auto calibration timed out
[   23.756388] mmc1: Got command interrupt 0x00010001 even though no command operation was in progress.
[   23.765571] sdhci: =========== REGISTER DUMP (mmc1)===========
[   23.771450] sdhci: Sys addr: 0x00000000 | Version:  0x00000404
[   23.777325] sdhci: Blk size: 0x00000000 | Blk cnt:  0x00000000
[   23.783200] sdhci: Argument: 0x80062000 | Trn mode: 0x00000000
[   23.789078] sdhci: Present:  0x01fb0000 | Host ctl: 0x00000000
[   23.794952] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[   23.800823] sdhci: Wake-up:  0x00000000 | Clock:    0x00000407
[   23.806694] sdhci: Timeout:  0x00000000 | Int stat: 0x00000000
[   23.812565] sdhci: Int enab: 0x00ff1003 | Sig enab: 0x00fc1003
[   23.818431] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[   23.824299] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[   23.830166] sdhci: Cmd:      0x0000341a | Max curr: 0x00000000
[   23.836031] sdhci: Host ctl2: 0x00003008
[   23.839996] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc100410
[   23.846643] sdhci: ===========================================

That would explain the weird behaviour you’re seeing.

cnr · March 24, 2020, 6:40pm

Oh ok. So is this a hardware problem, or would reflashing the device fix it?

ntzovanis · March 24, 2020, 6:49pm

Hi there, it’s hard to tell for sure, but often times it means that the mmc is nearing the end of life. I would suggest you try the same device with a brand new mmc to rule out any other issues.

zvin · March 25, 2020, 10:14am

Looks like a hardware problem to me.

cnr · June 26, 2020, 10:54pm

After returning to this much later, this problem persists.

Even after switching to a completely new device and new carrier board I still have this problem. You can see the output from trying to access the CUDA cores.

 compute_capability = 620, cudnn_half = 0 
layer     filters    size              input                output
   0  Try to set subdivisions=64 in your cfg-file. 
CUDA Error: all CUDA-capable devices are busy or unavailable: Invalid argument: [...] Assertion `0' failed.
Demo
CUDA status Error: file: ./src/cuda.c : () : line: 203 : build time: Feb 19 2020 - 22:33:15 
CUDA Error: all CUDA-capable devices are busy or unavailable
Aborted (core dumped)

The device is https://dashboard.balena-cloud.com/devices/966d4ecab74231ab4e1a7126e32cdf69
Support is enabled for 1 week. I haven’t been able to make any development progress for months due to this issue.

cnr · June 27, 2020, 12:44pm

Feel free to do anything you want (reboot, update OS, whatever) to get it working.

cnr · June 27, 2020, 12:46pm

Is there any way to roll back to an older OS version? It’s working with balenaOS 2.45.1+rev1
but not balenaOS 2.47.0+rev5

Maybe that would help?

rahul-thakoor · June 29, 2020, 11:09am

Hey there!

Are you able to ssh into the device? I am not able to access the hostOS.

We noticed you are using a production OS. Would you be able to flash the device(or another one) with the dev variant and try to ssh again?
If you are unable to ssh into the device, perhaps you can access it via serial?

This would help us gather logs from the device so we can further determine what is going on.

Thanks

cnr · June 29, 2020, 5:37pm

weird, should be fixed now. Try again?

Topic		Replies	Views
Supervisor stuck in restart loop Product support support	2	15	June 28, 2025
Two instances of supervisor trying to start balenaOS	12	507	September 15, 2022
Supervisor restating over and over Product support	32	1441	June 18, 2020
balenaOS 2.80.5+rev1, supervisor version 12.8.10 keeps rebooting. balenaOS	6	629	July 15, 2021
Supervisor continuously restarts - major outage of Balena balenaOS	17	82	July 24, 2024

Supervisor restarting continuously

Related topics