Hi @bithell, I find it hard to image how this behaviour would be triggered by the balena env. Could there be any state in the weatherStationCoreLink container that makes it fail to start ?
The way to analyse this would be to look at the device while it is in the failing state. Can you trigger the reboot and does it show that behaviour when you do ?
Is the device on the problematic state right now? If not, can you make device get into the state where the supervisor fails to start the service? I’d be interesting to check the device logs as that happens
It’s not in that state right now sorry, and not really sure how to get it there - it’s happened a few times today (and yesterday) but I don’t really have a sure fire way to get it there other than by initiating a restart pretty soon after the script starts
No worries! Please keep an eye on it and once it happens, leave it in that state and ping us here. We are always watching the forums so we’ll be able to jump into the device as soon as it happens again and get all the necessary information.
I have a theory for what’s going on. It seems that the supervisor starts all the other containers before binding to the HTTP port, so I think that if your container is fast enough, the supervisor API might not be available yet. I’m double checking that this is indeed the case, and if so we can update it to ensure that the API is available before the other containers start.
For now, can you try updating your script to retry various times if the connection is refused, waiting a bit before each retry? I believe that the supervisor will eventually start responding