Supervisor restating over and over

I can confirm: ticket created for my own device, but some days ago my friend Gloria told me she was experiencing the same problem.
Production images on both devices.
Both devices are Raspberry Pi 3 B devices. My one was a B+ actually.
My own was connected to the Internet via a 3G USB dongle. I may ask Gloria but it should be WiFi.
But this are “remote” devices, so we can’t easily SSH into them, if not via the VPN.

Proxy service? Any hint on that?

Just in the case this may help with with further investigation… I’ve found the microSD from the very first system, but it would be quite difficult to also find the corresponding Raspberry Pi. :slight_smile:

Hi. If you try to boot that SD on another rpi does it still behave in the same way?

If yes, then we could convert the SD to be a development image and then you could use a serial console to login and debug further.

You would use this project to turn the SD card into a development image: https://github.com/balena-os/serial-it

Hi @floion,
maybe life is getting too complicated here: I have the microSD and I can try to boot another same-model device, but… I’ve deleted the application from my balenaCloud account! I think this happened as soon as I realized that I had lost the microSD.
Do you think I can modify some files on the microSD to try to join another application?

@daghemo if you deleted the application and device from the dashboard you may be able to get it online again with a new config.json file from a new application (if you use the advanced setting on the add device screen you can download just this). However, if the device comes online in a new application it will attempt to clear all the containers and data from the old application and bring it inline with the new one, so I don’t think we’re gaining anything by doing that.

Hi @chrisys,
you are right, obviously.
I booted another Raspberry Pi of mine with that microSD and a config.json from another application and it is now up and running, but there is nothing left to investigate! :slight_smile:

When I’m suffering a microSD corruption and I’m sucked in the “Supervisor starting… Killing service… Killed service… Rebooting… Service exited… repeat…” black hole is there any way to exit it?
That is, I don’t know where the microSD corruption was on this microSD, but, if it is only within the application, is there a way to just simulate what I’ve actually done now? That is, clear all the container (anche maybe check the /data and and clear it just in case of problems) and start again?

Hi, as you pointed out its impossible to know what data has been corrupted on the SD card and even if you did clear all the containers, there is no guarantee the device will behave correctly. The best advice I can give is to flash a new SD card and copy your devices config.json across from the old corrupted card (see https://www.balena.io/docs/reference/OS/configuration/ ) for more details.

Hi @lucianbuzzo,
maybe you’re missing my point here.
I’ve reported a problem here: the supervisor was restarting over and over again and again.
You said this could be a corruption on the microSD card and I this you were right here.

But I’m just wondering if balenaCloud could “mitigate” this problem on remote devices that are not easily accessible. Two proposals here from this side:

  • Give the user the option not to reboot again and again, but just allow him/her to set a timeout, so that him/her is given enough time to SSH into the Host OS via the VPN to perform investigation and or clear all the containers (and maybe the /data filesystem) and start again. See Supervisor restating over and over.
  • Give the user the option just to get rid of his/her containers on the device and fetch the application again. May also split the option in two here: that is, just dispose the container, while trying to keep the /data filesystem or get rid of them both at once.

That is what I’ve pointed here is that IT IS NOT impossible to know what data has been corrupted on the microSD card. :slight_smile: On the other side, if the corruption affects the balenaOS itself, the onl solution would be to replace the microSD card.

Hi @chrisys,
don’t know if it may be of interest to you, but I’ve just noted that booting the old microSD with a new confing.json from another application just required a reboot.
This is because on the very first boot I saw the balenaOS was unknown and both the Supervisor version and the IP Address not shown at all in the dashboard. Also, the Reboot button was ghosted, while the Restart not. But I managed to SSH into the Host OS via che dashboard and issue a reboot.
Now I can see the correct version of balenaOS in the dashboard, along with the Supervisor version, the IP address and the Reboot button can be clicked now!

Hi

Thanks for the solutions. I think the options you do make sense - I have sent your solutions to the folks from our product team and we’ll update this thread once we are ready.

If you see a device which comes up on the dashboard with version and ip address not shown as you reported, can you please share support access to the device so that we’ll be able to see what’s going on. Because that definitely shouldn’t happen, and I’d like to take look at that if you see that again.
(Once you enable support access, you can send a private message with the UUID, and then ping us here to let know that you have shared the access - no need to share the UUID publicly)

Thank you @anujdeshpande,
but the devices is showing correct information now. The problem happened only on the very first boot.