Supervisor is unable to start service

  • Python script initiated a reboot of the entire device using api (it may have done this multiple times)
  • Device rebooted
  • Supervisor was unable to initialise script again
  • I manually rebooted from cloud console
  • Device rebooted
  • Script started fine

Anyone seen this behaviour before?

Hi @bithell, I find it hard to image how this behaviour would be triggered by the balena env. Could there be any state in the weatherStationCoreLink container that makes it fail to start ?
The way to analyse this would be to look at the device while it is in the failing state. Can you trigger the reboot and does it show that behaviour when you do ?

Hi @samothx - no when I trigger the reboot in the dashboard it starts fine

Can you please enable support access and share the device url/uuid with us so we can take a closer look?

Sure thing its f453f0e67eb717e9b5930cd9452d2d1f

Hey there,

Is the device on the problematic state right now? If not, can you make device get into the state where the supervisor fails to start the service? I’d be interesting to check the device logs as that happens

It’s not in that state right now sorry, and not really sure how to get it there - it’s happened a few times today (and yesterday) but I don’t really have a sure fire way to get it there other than by initiating a restart pretty soon after the script starts

No worries! Please keep an eye on it and once it happens, leave it in that state and ping us here. We are always watching the forums so we’ll be able to jump into the device as soon as it happens again and get all the necessary information.

@jviotti it’s just gone into it now

Awesome, I’m taking a look now

Hey @bithell,

I have a theory for what’s going on. It seems that the supervisor starts all the other containers before binding to the HTTP port, so I think that if your container is fast enough, the supervisor API might not be available yet. I’m double checking that this is indeed the case, and if so we can update it to ensure that the API is available before the other containers start.

For now, can you try updating your script to retry various times if the connection is refused, waiting a bit before each retry? I believe that the supervisor will eventually start responding

Sure thing - I’ve set it to loop around waiting 10 minutes each time

@jviotti - seems to be stuck in a state of trying to start the service but it can’t. Any chance you could take a look?

Hey @bithell, can you remind us of the device UUID to check, and enable support access please? Thanks!

Sure thing it’s f453f0e67eb717e9b5930cd9452d2d1f

Taking a look now.

Could you also make sure that support access is enabled, it does not seem to be currently.

Apologies, done now @CameronDiver

Hey, it seems like the device still does not have support access enabled, could you check you enabled it on this device?

Apologies, I’ve tried again