unmatched local and remote supervisor

We have had several instances where the device comes up in the dashboard w/ the target build different from alleged current build and no application listed. Diagnostics shows “unmatched local and remote supervisor”. On the dashboard it shows 10.8.0 but what is running locally is 9.14.0. We have seen this happen in 2 different scenarios (1) moving a device from one application to another, and (2) updating a system by burning a new SD card with same UUID but with different host OS (2.32 -> 2.48).

In this latest incident, I see:

root@51e8aa2:~# balena ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS                            PORTS               NAMES
944ac6e74434        balena/rpi-supervisor:v9.14.0   "./entry.sh"             2 weeks ago         Up 7 minutes (health: starting)                       resin_supervisor
4acaecc51458        47d9342dad3d                    "/usr/bin/entry.sh /…"   3 weeks ago         Up 19 hours                                           solmon_2423264_1433257
root@51e8aa2:~# balena images
REPOSITORY                                                       TAG                 IMAGE ID            CREATED             SIZE
registry2.balena-cloud.com/v2/ddf04b3f5d8ca26cd61796853192e796   <none>              47d9342dad3d        2 months ago        340MB
balena/rpi-supervisor                                            v10.8.0             cc42f1e9307e        7 months ago        60.6MB
balena-healthcheck-image                                         latest              cfdb1bf11e4c        8 months ago        8.95kB
balena/rpi-supervisor                                            v9.14.0             1764acf9f2c9        17 months ago       56.4MB

The proposed solution in Can't update supervisor shows that I can remove the old supervisor image (9.14.0) and trigger an update, and I have used that process previously to correct the situation when where there wasn’t an image for 10.8.0 already downloaded.

Now, when I follow the instructions in the forum topic above, set the new tag and run the script to set the correct remote tag, then run update-resin-supervisor I correctly see:

Sep 10 18:34:24 51e8aa2 balenad[880]: time="2020-09-10T18:34:24.261728224Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/093375c525e3511c34c4c2589c02743c93ba76b44cd153c8bec3e88297541d63/shim.sock debug=false pid=1635
Supervisor configuration found from API.
Getting image id...
Supervisor balena/rpi-supervisor:v10.8.0 already downloaded.

I then start the supervisor using systemctl start resin-supervisor, I get into a loop with the following errors:

root@51e8aa2:/lib/systemd/system# systemctl start resin-supervisor
root@51e8aa2:/lib/systemd/system# Sep 10 18:35:27 51e8aa2 resin-supervisor[1827]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:35:27 51e8aa2 resin-supervisor[1832]: active
Sep 10 18:35:32 51e8aa2 resin-supervisor[1833]: Error: No such object: balena/rpi-supervisor:v9.14.0
Sep 10 18:35:34 51e8aa2 resin-supervisor[1833]: Error: No such object: resin_supervisor
Sep 10 18:35:36 51e8aa2 resin-supervisor[1893]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:35:36 51e8aa2 sh[1892]: Getting image name and tag...
Sep 10 18:35:38 51e8aa2 sh[1892]: Supervisor configuration found from API.
Sep 10 18:35:38 51e8aa2 sh[1892]: Getting image id...
Sep 10 18:35:39 51e8aa2 sh[1892]: Supervisor balena/rpi-supervisor:v10.8.0 already downloaded.
Sep 10 18:35:47 51e8aa2 resin-supervisor[1942]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:35:47 51e8aa2 resin-supervisor[1947]: active
Sep 10 18:35:51 51e8aa2 resin-supervisor[1948]: Error: No such object: balena/rpi-supervisor:v9.14.0
Sep 10 18:35:52 51e8aa2 resin-supervisor[1948]: Error: No such object: resin_supervisor
Sep 10 18:35:55 51e8aa2 resin-supervisor[1985]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:35:56 51e8aa2 sh[1984]: Getting image name and tag...
Sep 10 18:35:58 51e8aa2 sh[1984]: Supervisor configuration found from API.
Sep 10 18:35:58 51e8aa2 sh[1984]: Getting image id...
Sep 10 18:35:59 51e8aa2 sh[1984]: Supervisor balena/rpi-supervisor:v10.8.0 already downloaded.
Sep 10 18:36:06 51e8aa2 resin-supervisor[2032]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:36:06 51e8aa2 resin-supervisor[2038]: active
Sep 10 18:36:11 51e8aa2 resin-supervisor[2039]: Error: No such object: balena/rpi-supervisor:v9.14.0```

I suppose I can fix this by removing the existing image completely and forcing the download, but in some of our installations a 60MB download is problematic:

My questions are:

  1. Help me understand the process that leads to this mismatch so we can perhaps prevent it from happening (especially in the field where connections are weak).
  2. Why does the resin-supervisor systemctl unit think that the supervisor version should be 9.14.0 when the update script recognizes that it should be 10.8.0. How can I remedy this mismatch short of the brute force method of deleting the existing image completely and forcing a redownload, which presumably triggers something that causes a consistent view on the version.

Thanks

Hi there – thanks for getting in touch with us. I’ll need to do some digging to understand the sequence of events you’re describing, and will get back to you on this.

As for upgrading the supervisor, we are in the process of making this available to do from the dashboard in the same way you can currently upgrade the host OS. In the meantime, though, you can accomplish the same thing by using the developer console in your browser. Open a tab in your browser to the balena dashboard, ensure you’re logged in, open up the console and run:

await window.sdk.models.device.setSupervisorRelease("[uuid of device]", "v[supervisor version");

For example: to set device 380df5d944932ab47f5bcfdc34a85e1c to supervisor version 10.8.0, I would run:

await window.sdk.models.device.setSupervisorRelease("380df5d944932ab47f5bcfdc34a85e1c", "v10.8.0");

Some things to note:

  • The supervisor version must start with v – that is, v10.8.0, not 10.8.0.
  • The supervisor image will, if needed, be downloaded in its entirety – we plan on adding the capability to download the image delta in order to save bandwidth, but this is not yet in production.
  • Supervisors can be upgraded, but not downgraded.
  • The upgrade should happen quickly (within a few minutes at most) if the device is online; if the device is not online, it should attempt it the next time it can hit our API.
  • This method is documented in our SDK documentation.

I’ll get back to you on your other questions, and in the meantime please let us know if this works for you.

All the best,
Hugh

Hi Hugh,

Thanks for the quick response. I suspect that the line of code that you provided does the same thing as:

  curl -s "${API_ENDPOINT}/v2/device($DEVICEID)?apikey=$APIKEY" -X PATCH -H 'Content-Type: application/json;charset=UTF-8' --data-binary "{\"supervisor_release\": \"$SUPERVISOR_ID\"}"

in the script provided in the forum post.

I believe this sets the remote supervisor version, or am I misunderstanding? My issue on this device is the errors that I have posted, that is, the remote version is set and understood, the image has already been downloaded, but the local system seems to still reference the old supervisor.

Hi there – thanks for the response. I need to give a bit of background first, to answer this and your previous questions.

Our API has the task of setting target device state: this is the application you should be running, the version of the containers you should have, the version of the supervisor you should be running, and so on. The task of the device’s supervisor is to get its target state from the API, carry it out as best it can, and then report its current state. This may not always match the target state, but the idea is that the supervisor will do its best to match that. In our API, the states for the supervisor version are are:

  • target state: should_be_managed_by__supervisor_release
  • current state: supervisor_version

Note that a release is a device-version tuple – thus, release 1234 might be Pi4-10.8.0, while release 5678 might be Pi3-10.8.0. See here for details.

In the case of the curl you’ve posted, this is setting the current state on the API – it does not change the target state. You would need to patch should_be_managed_by__supervisor_release in order to change the target state. This is the advantage of the SDK method, by the way – it accepts a version for the supervisor, and figures out the appropriate release for you. However, if you pick the right release for your device-version tuple, a PATCH should have the same effect.

As for how your devices came to be in this state: In the case where you’ve installed a new version of the OS on a separate SD card, while preserving the UUID, the API notes that the device has reported the newer supervisor version, but does not adjust its target state. When you run update-resin-supervisor on the device, this does not update the API’s target state for the device. Thus, while the upgrade succeeds, the change may be reverted when the supervisor gets its target state next. One complication is that supervisors cannot be downgraded; I believe that if the target version is lower than the current version, it should be left alone (though still triggering the warning you saw in your device checks).

To resolve this, I would definitely recommend setting the target state for the supervisor using the SDK method, or by PATCHing should_be_managed_by__supervisor_release.

In the case you describe where this is triggered by moving a device to another application: I don’t know a reason why this would happen without an OS upgrade (as in the previous case), or some other change in device configuration. If you can provide us with a UUID for the device and enable support, we can take a look to see what might be going on.

I’ll be taking a look at the forum post you mention, as it’s possible we should note that it is out of date.

I hope this helps. Please let us know if you have any further questions.

All the best,
Hugh

Thanks for the really detailed explanation. Super helpful, and good for posterity also. And as no good deed shall go unpunished :wink:, that triggers another couple of questions:

  1. With the application build the dashboard shows both the current release and the target. For supervisor version it only shows a single number. Is that the target or the current - I would assume currrent but in my system it was already showing 10.8.0, which was definitely not current.
  2. You indicate that in the curl command the options are of patching supervisor_version or should_be_managed_by__supervisor_release. I’m setting supervisor_release which I don’t see in the API doc. Perhaps I’m using an older API?
  3. Per your explanation, I’m setting the current state when I use that API. (Note that the dashboard was already showing 10.8.0) If so, if the target is set to 10.8.0 and the current is set to 10.8.0, why is the system flailing and trying to load 9.14.0?

I ran the command you indicated with my uuid in the browser window - no change in behavior. Same exact errors as posted above show up in journalctl -f

Hello there, we are investigating further into this issue, can you send us your device’s UUID and provide us with support access?