Unexpected supervisor crash/container restart on network change

We’re still experiencing a number of host OS/supervisor-related issues that are affecting our customers. In particular, we’re seeing problems communicating with the supervisor if a device boots and the local network is available, but the internet is not yet available. In that case, our nemo service periodically either hits the OCI runtime exec failed issue described above where it appears to be running but can’t reach DNS and we can’t SSH into it, or DNS works but it gets 404 responses from the supervisor when trying to hit the restart-service API endpoint.

The 404 is definitely very strange - it’s clearly reaching the supervisor to get a response at all, and we know the REST query is good since the same exact query works normally. Why would it get a 404?

We’re still running a very old Balena release and supervisor on our devices (2.50.1). We want very much to update to the latest, particularly to solve system time issues when booting a device (Chronyc config is bad if device comes online without internet - #51 by alexgg), but right now we’re having issues with the UART dying on the latest Balena release (2.71.3) so we’re trying to get that resolved as soon as possible (UART Failure after upgrading to 2.71.3+rev1 from 2.50.1+rev1 on Variscite IMX8M - #22 by acostach).