Supervisor restarting continuously

For some reason, my supervisor is continuously restarting:

23.03.20 03:19:28 (-0600) Supervisor starting
23.03.20 03:19:30 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:06:13 (-0600) Supervisor starting
23.03.20 04:06:14 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:06:35 (-0600) Supervisor starting
23.03.20 04:06:36 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:53:20 (-0600) Supervisor starting
23.03.20 04:53:21 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:53:42 (-0600) Supervisor starting
23.03.20 04:53:43 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 05:40:27 (-0600) Supervisor starting
23.03.20 05:40:28 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 05:40:49 (-0600) Supervisor starting
23.03.20 05:40:51 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 06:27:34 (-0600) Supervisor starting
23.03.20 06:27:36 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 06:27:56 (-0600) Supervisor starting
23.03.20 06:27:57 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 07:14:41 (-0600) Supervisor starting
23.03.20 07:14:42 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 07:15:03 (-0600) Supervisor starting
23.03.20 07:15:05 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:01:48 (-0600) Supervisor starting
23.03.20 08:01:50 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:02:10 (-0600) Supervisor starting
23.03.20 08:02:12 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:48:56 (-0600) Supervisor starting
23.03.20 08:48:58 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:49:18 (-0600) Supervisor starting
23.03.20 08:49:19 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 09:36:03 (-0600) Supervisor starting
23.03.20 09:36:04 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 09:36:25 (-0600) Supervisor starting
23.03.20 09:36:26 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 10:23:09 (-0600) Supervisor starting
23.03.20 10:23:11 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 10:23:32 (-0600) Supervisor starting
23.03.20 10:23:34 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:10:17 (-0600) Supervisor starting
23.03.20 11:10:18 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:10:39 (-0600) Supervisor starting
23.03.20 11:10:40 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:45:21 (-0600) Starting service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:45:22 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'

The supervisor version I’m using is:
balena/aarch64-supervisor:v10.6.27

OS version:
balenaOS 2.47.0+rev5

Though I have also had this problem with OS version balenaOS 2.45.1+rev1 on this device. It seems to be device specific.

I have seen other forum posts about potentially similar problems resulting from a full disk, but as you can see I have plenty of space:

root@36a1450:~# du -hs /
...
11G     /

root@36a1450:~# lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0      179:0    0 29.1G  0 disk 
...

When I run balena-engine logs on the supervisor I get this:

...
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[api]     POST /v2/applications/1506891/stop-service  -  ms
...

It leaves my containers in a stopped state, it’s been this way for multiple days:

The device in question is found here: https://dashboard.balena-cloud.com/devices/36a1450a2a85cc82c1c103f75398d5d7/summary
I have granted support access for a week.

Rebooting the device has helped in the past, but the problem still occurs relatively frequently, so I will leave it in the failed state to facilitate debugging.

Have you run the diagnostics from the menu on the left of the dashboard? This will provide some useful pointers if the HW is having issues.

Confusingly, it says that supervisor is unhealthy

yet when I run balena-engine ps it show’s that the container is in a healthy state.

It also mentions a high temperature, not sure if this is a symptom or a cause

Accidentally rebooted, now it’s in a working state. Will follow up here when this happens again.

Yes, a weird one. Please do :+1:

Ok, in a failed state again. All it took was restarting the container. Would very much appreciate some help

Just to clairfy - rebooting the device fixed the issue, but restarting the container started it again. Is that right?

Yep. That’s correct

Hi, I see that you have granted support access. Would it be an inconvenience for me to reboot the device, see the container working, and then restart the container to see the failure? Thanks.

Please go ahead and do that. Not sure if it enters the failed state on every reboot, but you’re certainly welcome to do what you need to to make it work. I will stay off of it until you’re done

Thanks, we will let you know once we’re done.

Hello, according to dmesg it seems like the mmc on this device is misbehaving:

[   23.627980] mmc1: Timeout waiting for hardware cmd interrupt.
[   23.633943] sdhci: =========== REGISTER DUMP (mmc1)===========
[   23.639874] sdhci: Sys addr: 0x00000000 | Version:  0x00000404
[   23.645773] sdhci: Blk size: 0x00000000 | Blk cnt:  0x00000000
[   23.651676] sdhci: Argument: 0x80062000 | Trn mode: 0x00000000
[   23.657573] sdhci: Present:  0x01fb00f1 | Host ctl: 0x00000001
[   23.663468] sdhci: Power:    0x00000000 | Blk gap:  0x00000000
[   23.669358] sdhci: Wake-up:  0x00000000 | Clock:    0x00000403
[   23.675242] sdhci: Timeout:  0x00000000 | Int stat: 0x00000000
[   23.681125] sdhci: Int enab: 0x00ff1003 | Sig enab: 0x00fc1003
[   23.687008] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[   23.692895] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[   23.698781] sdhci: Cmd:      0x0000341a | Max curr: 0x00000000
[   23.704666] sdhci: Host ctl2: 0x00003000
[   23.708654] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc100410
[   23.715325] sdhci: ===========================================
[   23.722188] brcmfmac: probe of mmc1:0001:3 failed with error -110
[   23.723709] usbcore: registered new interface driver brcmfmac
[   23.726620] vdd-1v8: voltage operation not allowed
[   23.731594] sdhci-tegra 3440000.sdhci: could not set regulator OCR (-1)
[   23.749768] sdhci-tegra 3440000.sdhci: Auto calibration timed out
[   23.756388] mmc1: Got command interrupt 0x00010001 even though no command operation was in progress.
[   23.765571] sdhci: =========== REGISTER DUMP (mmc1)===========
[   23.771450] sdhci: Sys addr: 0x00000000 | Version:  0x00000404
[   23.777325] sdhci: Blk size: 0x00000000 | Blk cnt:  0x00000000
[   23.783200] sdhci: Argument: 0x80062000 | Trn mode: 0x00000000
[   23.789078] sdhci: Present:  0x01fb0000 | Host ctl: 0x00000000
[   23.794952] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[   23.800823] sdhci: Wake-up:  0x00000000 | Clock:    0x00000407
[   23.806694] sdhci: Timeout:  0x00000000 | Int stat: 0x00000000
[   23.812565] sdhci: Int enab: 0x00ff1003 | Sig enab: 0x00fc1003
[   23.818431] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[   23.824299] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[   23.830166] sdhci: Cmd:      0x0000341a | Max curr: 0x00000000
[   23.836031] sdhci: Host ctl2: 0x00003008
[   23.839996] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc100410
[   23.846643] sdhci: ===========================================

That would explain the weird behaviour you’re seeing.

Oh ok. So is this a hardware problem, or would reflashing the device fix it?

Hi there, it’s hard to tell for sure, but often times it means that the mmc is nearing the end of life. I would suggest you try the same device with a brand new mmc to rule out any other issues.

1 Like

Looks like a hardware problem to me.