Supervisor restarting continuously

For some reason, my supervisor is continuously restarting:

23.03.20 03:19:28 (-0600) Supervisor starting
23.03.20 03:19:30 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:06:13 (-0600) Supervisor starting
23.03.20 04:06:14 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:06:35 (-0600) Supervisor starting
23.03.20 04:06:36 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:53:20 (-0600) Supervisor starting
23.03.20 04:53:21 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 04:53:42 (-0600) Supervisor starting
23.03.20 04:53:43 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 05:40:27 (-0600) Supervisor starting
23.03.20 05:40:28 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 05:40:49 (-0600) Supervisor starting
23.03.20 05:40:51 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 06:27:34 (-0600) Supervisor starting
23.03.20 06:27:36 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 06:27:56 (-0600) Supervisor starting
23.03.20 06:27:57 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 07:14:41 (-0600) Supervisor starting
23.03.20 07:14:42 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 07:15:03 (-0600) Supervisor starting
23.03.20 07:15:05 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:01:48 (-0600) Supervisor starting
23.03.20 08:01:50 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:02:10 (-0600) Supervisor starting
23.03.20 08:02:12 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:48:56 (-0600) Supervisor starting
23.03.20 08:48:58 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 08:49:18 (-0600) Supervisor starting
23.03.20 08:49:19 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 09:36:03 (-0600) Supervisor starting
23.03.20 09:36:04 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 09:36:25 (-0600) Supervisor starting
23.03.20 09:36:26 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 10:23:09 (-0600) Supervisor starting
23.03.20 10:23:11 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 10:23:32 (-0600) Supervisor starting
23.03.20 10:23:34 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:10:17 (-0600) Supervisor starting
23.03.20 11:10:18 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:10:39 (-0600) Supervisor starting
23.03.20 11:10:40 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:45:21 (-0600) Starting service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'
23.03.20 11:45:22 (-0600) Killing service 'my_app sha256:66d99d89085257007509534088b4aa803e4b9460df88409b4a28a5c6cefb567e'

The supervisor version I’m using is:
balena/aarch64-supervisor:v10.6.27

OS version:
balenaOS 2.47.0+rev5

Though I have also had this problem with OS version balenaOS 2.45.1+rev1 on this device. It seems to be device specific.

I have seen other forum posts about potentially similar problems resulting from a full disk, but as you can see I have plenty of space:

root@36a1450:~# du -hs /
...
11G     /

root@36a1450:~# lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0      179:0    0 29.1G  0 disk 
...

When I run balena-engine logs on the supervisor I get this:

...
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[debug]   Replacing container for service my_app because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:7d180c0a927f2d994a7ec06843f2fc414997773d2e1af9ba6ffb44bd3ab71e1b"}}
[api]     POST /v2/applications/1506891/stop-service  -  ms
...

It leaves my containers in a stopped state, it’s been this way for multiple days:

The device in question is found here: https://dashboard.balena-cloud.com/devices/36a1450a2a85cc82c1c103f75398d5d7/summary
I have granted support access for a week.

Rebooting the device has helped in the past, but the problem still occurs relatively frequently, so I will leave it in the failed state to facilitate debugging.

Have you run the diagnostics from the menu on the left of the dashboard? This will provide some useful pointers if the HW is having issues.

Confusingly, it says that supervisor is unhealthy

yet when I run balena-engine ps it show’s that the container is in a healthy state.

It also mentions a high temperature, not sure if this is a symptom or a cause

Accidentally rebooted, now it’s in a working state. Will follow up here when this happens again.

Yes, a weird one. Please do :+1:

Ok, in a failed state again. All it took was restarting the container. Would very much appreciate some help

Just to clairfy - rebooting the device fixed the issue, but restarting the container started it again. Is that right?

Yep. That’s correct

Hi, I see that you have granted support access. Would it be an inconvenience for me to reboot the device, see the container working, and then restart the container to see the failure? Thanks.

Please go ahead and do that. Not sure if it enters the failed state on every reboot, but you’re certainly welcome to do what you need to to make it work. I will stay off of it until you’re done

Thanks, we will let you know once we’re done.

Hello, according to dmesg it seems like the mmc on this device is misbehaving:

[   23.627980] mmc1: Timeout waiting for hardware cmd interrupt.
[   23.633943] sdhci: =========== REGISTER DUMP (mmc1)===========
[   23.639874] sdhci: Sys addr: 0x00000000 | Version:  0x00000404
[   23.645773] sdhci: Blk size: 0x00000000 | Blk cnt:  0x00000000
[   23.651676] sdhci: Argument: 0x80062000 | Trn mode: 0x00000000
[   23.657573] sdhci: Present:  0x01fb00f1 | Host ctl: 0x00000001
[   23.663468] sdhci: Power:    0x00000000 | Blk gap:  0x00000000
[   23.669358] sdhci: Wake-up:  0x00000000 | Clock:    0x00000403
[   23.675242] sdhci: Timeout:  0x00000000 | Int stat: 0x00000000
[   23.681125] sdhci: Int enab: 0x00ff1003 | Sig enab: 0x00fc1003
[   23.687008] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[   23.692895] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[   23.698781] sdhci: Cmd:      0x0000341a | Max curr: 0x00000000
[   23.704666] sdhci: Host ctl2: 0x00003000
[   23.708654] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc100410
[   23.715325] sdhci: ===========================================
[   23.722188] brcmfmac: probe of mmc1:0001:3 failed with error -110
[   23.723709] usbcore: registered new interface driver brcmfmac
[   23.726620] vdd-1v8: voltage operation not allowed
[   23.731594] sdhci-tegra 3440000.sdhci: could not set regulator OCR (-1)
[   23.749768] sdhci-tegra 3440000.sdhci: Auto calibration timed out
[   23.756388] mmc1: Got command interrupt 0x00010001 even though no command operation was in progress.
[   23.765571] sdhci: =========== REGISTER DUMP (mmc1)===========
[   23.771450] sdhci: Sys addr: 0x00000000 | Version:  0x00000404
[   23.777325] sdhci: Blk size: 0x00000000 | Blk cnt:  0x00000000
[   23.783200] sdhci: Argument: 0x80062000 | Trn mode: 0x00000000
[   23.789078] sdhci: Present:  0x01fb0000 | Host ctl: 0x00000000
[   23.794952] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[   23.800823] sdhci: Wake-up:  0x00000000 | Clock:    0x00000407
[   23.806694] sdhci: Timeout:  0x00000000 | Int stat: 0x00000000
[   23.812565] sdhci: Int enab: 0x00ff1003 | Sig enab: 0x00fc1003
[   23.818431] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[   23.824299] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[   23.830166] sdhci: Cmd:      0x0000341a | Max curr: 0x00000000
[   23.836031] sdhci: Host ctl2: 0x00003008
[   23.839996] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc100410
[   23.846643] sdhci: ===========================================

That would explain the weird behaviour you’re seeing.

Oh ok. So is this a hardware problem, or would reflashing the device fix it?

Hi there, it’s hard to tell for sure, but often times it means that the mmc is nearing the end of life. I would suggest you try the same device with a brand new mmc to rule out any other issues.

1 Like

Looks like a hardware problem to me.

After returning to this much later, this problem persists.

Even after switching to a completely new device and new carrier board I still have this problem. You can see the output from trying to access the CUDA cores.

 compute_capability = 620, cudnn_half = 0 
layer     filters    size              input                output
   0  Try to set subdivisions=64 in your cfg-file. 
CUDA Error: all CUDA-capable devices are busy or unavailable: Invalid argument: [...] Assertion `0' failed.
Demo
CUDA status Error: file: ./src/cuda.c : () : line: 203 : build time: Feb 19 2020 - 22:33:15 
CUDA Error: all CUDA-capable devices are busy or unavailable
Aborted (core dumped)

The device is https://dashboard.balena-cloud.com/devices/966d4ecab74231ab4e1a7126e32cdf69
Support is enabled for 1 week. I haven’t been able to make any development progress for months due to this issue.

Feel free to do anything you want (reboot, update OS, whatever) to get it working.

Is there any way to roll back to an older OS version? It’s working with balenaOS 2.45.1+rev1
but not balenaOS 2.47.0+rev5

Maybe that would help?

Hey there!

Are you able to ssh into the device? I am not able to access the hostOS.

We noticed you are using a production OS. Would you be able to flash the device(or another one) with the dev variant and try to ssh again?
If you are unable to ssh into the device, perhaps you can access it via serial?

This would help us gather logs from the device so we can further determine what is going on.

Thanks

weird, should be fixed now. Try again?