unmatched local and remote supervisor

0xff · September 10, 2020, 6:46pm

We have had several instances where the device comes up in the dashboard w/ the target build different from alleged current build and no application listed. Diagnostics shows “unmatched local and remote supervisor”. On the dashboard it shows 10.8.0 but what is running locally is 9.14.0. We have seen this happen in 2 different scenarios (1) moving a device from one application to another, and (2) updating a system by burning a new SD card with same UUID but with different host OS (2.32 -> 2.48).

In this latest incident, I see:

root@51e8aa2:~# balena ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS                            PORTS               NAMES
944ac6e74434        balena/rpi-supervisor:v9.14.0   "./entry.sh"             2 weeks ago         Up 7 minutes (health: starting)                       resin_supervisor
4acaecc51458        47d9342dad3d                    "/usr/bin/entry.sh /…"   3 weeks ago         Up 19 hours                                           solmon_2423264_1433257
root@51e8aa2:~# balena images
REPOSITORY                                                       TAG                 IMAGE ID            CREATED             SIZE
registry2.balena-cloud.com/v2/ddf04b3f5d8ca26cd61796853192e796   <none>              47d9342dad3d        2 months ago        340MB
balena/rpi-supervisor                                            v10.8.0             cc42f1e9307e        7 months ago        60.6MB
balena-healthcheck-image                                         latest              cfdb1bf11e4c        8 months ago        8.95kB
balena/rpi-supervisor                                            v9.14.0             1764acf9f2c9        17 months ago       56.4MB

The proposed solution in Can't update supervisor shows that I can remove the old supervisor image (9.14.0) and trigger an update, and I have used that process previously to correct the situation when where there wasn’t an image for 10.8.0 already downloaded.

Now, when I follow the instructions in the forum topic above, set the new tag and run the script to set the correct remote tag, then run update-resin-supervisor I correctly see:

Sep 10 18:34:24 51e8aa2 balenad[880]: time="2020-09-10T18:34:24.261728224Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/093375c525e3511c34c4c2589c02743c93ba76b44cd153c8bec3e88297541d63/shim.sock debug=false pid=1635
Supervisor configuration found from API.
Getting image id...
Supervisor balena/rpi-supervisor:v10.8.0 already downloaded.

I then start the supervisor using systemctl start resin-supervisor, I get into a loop with the following errors:

root@51e8aa2:/lib/systemd/system# systemctl start resin-supervisor
root@51e8aa2:/lib/systemd/system# Sep 10 18:35:27 51e8aa2 resin-supervisor[1827]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:35:27 51e8aa2 resin-supervisor[1832]: active
Sep 10 18:35:32 51e8aa2 resin-supervisor[1833]: Error: No such object: balena/rpi-supervisor:v9.14.0
Sep 10 18:35:34 51e8aa2 resin-supervisor[1833]: Error: No such object: resin_supervisor
Sep 10 18:35:36 51e8aa2 resin-supervisor[1893]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:35:36 51e8aa2 sh[1892]: Getting image name and tag...
Sep 10 18:35:38 51e8aa2 sh[1892]: Supervisor configuration found from API.
Sep 10 18:35:38 51e8aa2 sh[1892]: Getting image id...
Sep 10 18:35:39 51e8aa2 sh[1892]: Supervisor balena/rpi-supervisor:v10.8.0 already downloaded.
Sep 10 18:35:47 51e8aa2 resin-supervisor[1942]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:35:47 51e8aa2 resin-supervisor[1947]: active
Sep 10 18:35:51 51e8aa2 resin-supervisor[1948]: Error: No such object: balena/rpi-supervisor:v9.14.0
Sep 10 18:35:52 51e8aa2 resin-supervisor[1948]: Error: No such object: resin_supervisor
Sep 10 18:35:55 51e8aa2 resin-supervisor[1985]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:35:56 51e8aa2 sh[1984]: Getting image name and tag...
Sep 10 18:35:58 51e8aa2 sh[1984]: Supervisor configuration found from API.
Sep 10 18:35:58 51e8aa2 sh[1984]: Getting image id...
Sep 10 18:35:59 51e8aa2 sh[1984]: Supervisor balena/rpi-supervisor:v10.8.0 already downloaded.
Sep 10 18:36:06 51e8aa2 resin-supervisor[2032]: Error response from daemon: No such container: resin_supervisor
Sep 10 18:36:06 51e8aa2 resin-supervisor[2038]: active
Sep 10 18:36:11 51e8aa2 resin-supervisor[2039]: Error: No such object: balena/rpi-supervisor:v9.14.0```

I suppose I can fix this by removing the existing image completely and forcing the download, but in some of our installations a 60MB download is problematic:

My questions are:

Help me understand the process that leads to this mismatch so we can perhaps prevent it from happening (especially in the field where connections are weak).
Why does the resin-supervisor systemctl unit think that the supervisor version should be 9.14.0 when the update script recognizes that it should be 10.8.0. How can I remedy this mismatch short of the brute force method of deleting the existing image completely and forcing a redownload, which presumably triggers something that causes a consistent view on the version.

Thanks

saintaardvark · September 10, 2020, 8:11pm

Hi there – thanks for getting in touch with us. I’ll need to do some digging to understand the sequence of events you’re describing, and will get back to you on this.

As for upgrading the supervisor, we are in the process of making this available to do from the dashboard in the same way you can currently upgrade the host OS. In the meantime, though, you can accomplish the same thing by using the developer console in your browser. Open a tab in your browser to the balena dashboard, ensure you’re logged in, open up the console and run:

await window.sdk.models.device.setSupervisorRelease("[uuid of device]", "v[supervisor version");

For example: to set device 380df5d944932ab47f5bcfdc34a85e1c to supervisor version 10.8.0, I would run:

await window.sdk.models.device.setSupervisorRelease("380df5d944932ab47f5bcfdc34a85e1c", "v10.8.0");

Some things to note:

The supervisor version must start with v – that is, v10.8.0, not 10.8.0.
The supervisor image will, if needed, be downloaded in its entirety – we plan on adding the capability to download the image delta in order to save bandwidth, but this is not yet in production.
Supervisors can be upgraded, but not downgraded.
The upgrade should happen quickly (within a few minutes at most) if the device is online; if the device is not online, it should attempt it the next time it can hit our API.
This method is documented in our SDK documentation.

I’ll get back to you on your other questions, and in the meantime please let us know if this works for you.

All the best,
Hugh

0xff · September 10, 2020, 8:48pm

Hi Hugh,

Thanks for the quick response. I suspect that the line of code that you provided does the same thing as:

  curl -s "${API_ENDPOINT}/v2/device($DEVICEID)?apikey=$APIKEY" -X PATCH -H 'Content-Type: application/json;charset=UTF-8' --data-binary "{\"supervisor_release\": \"$SUPERVISOR_ID\"}"

in the script provided in the forum post.

I believe this sets the remote supervisor version, or am I misunderstanding? My issue on this device is the errors that I have posted, that is, the remote version is set and understood, the image has already been downloaded, but the local system seems to still reference the old supervisor.

saintaardvark · September 10, 2020, 9:18pm

Hi there – thanks for the response. I need to give a bit of background first, to answer this and your previous questions.

Our API has the task of setting target device state: this is the application you should be running, the version of the containers you should have, the version of the supervisor you should be running, and so on. The task of the device’s supervisor is to get its target state from the API, carry it out as best it can, and then report its current state. This may not always match the target state, but the idea is that the supervisor will do its best to match that. In our API, the states for the supervisor version are are:

target state: should_be_managed_by__supervisor_release
current state: supervisor_version

Note that a release is a device-version tuple – thus, release 1234 might be Pi4-10.8.0, while release 5678 might be Pi3-10.8.0. See here for details.

In the case of the curl you’ve posted, this is setting the current state on the API – it does not change the target state. You would need to patch should_be_managed_by__supervisor_release in order to change the target state. This is the advantage of the SDK method, by the way – it accepts a version for the supervisor, and figures out the appropriate release for you. However, if you pick the right release for your device-version tuple, a PATCH should have the same effect.

As for how your devices came to be in this state: In the case where you’ve installed a new version of the OS on a separate SD card, while preserving the UUID, the API notes that the device has reported the newer supervisor version, but does not adjust its target state. When you run update-resin-supervisor on the device, this does not update the API’s target state for the device. Thus, while the upgrade succeeds, the change may be reverted when the supervisor gets its target state next. One complication is that supervisors cannot be downgraded; I believe that if the target version is lower than the current version, it should be left alone (though still triggering the warning you saw in your device checks).

To resolve this, I would definitely recommend setting the target state for the supervisor using the SDK method, or by PATCHing should_be_managed_by__supervisor_release.

In the case you describe where this is triggered by moving a device to another application: I don’t know a reason why this would happen without an OS upgrade (as in the previous case), or some other change in device configuration. If you can provide us with a UUID for the device and enable support, we can take a look to see what might be going on.

I’ll be taking a look at the forum post you mention, as it’s possible we should note that it is out of date.

I hope this helps. Please let us know if you have any further questions.

All the best,
Hugh

0xff · September 10, 2020, 11:07pm

Thanks for the really detailed explanation. Super helpful, and good for posterity also. And as no good deed shall go unpunished , that triggers another couple of questions:

With the application build the dashboard shows both the current release and the target. For supervisor version it only shows a single number. Is that the target or the current - I would assume currrent but in my system it was already showing 10.8.0, which was definitely not current.
You indicate that in the curl command the options are of patching supervisor_version or should_be_managed_by__supervisor_release. I’m setting supervisor_release which I don’t see in the API doc. Perhaps I’m using an older API?
Per your explanation, I’m setting the current state when I use that API. (Note that the dashboard was already showing 10.8.0) If so, if the target is set to 10.8.0 and the current is set to 10.8.0, why is the system flailing and trying to load 9.14.0?

I ran the command you indicated with my uuid in the browser window - no change in behavior. Same exact errors as posted above show up in journalctl -f

JuanFRidano · September 11, 2020, 2:08pm

Hello there, we are investigating further into this issue, can you send us your device’s UUID and provide us with support access?

xginn8 · September 22, 2020, 12:44pm

Hi again,

We have received your device UUID, and can upgrade that supervisor as a one time measure in order to address that for you.

Please re-enable support access whenever convenient and I’ll take care of that for you.

0xff · September 23, 2020, 9:09pm

Hi @xginn8 ,

As indicated at the top of the thread, I know I can simply remove the new supervisor image and run the update supervisor script and that will fix this instance (and since then, I have done precisely that, so I’ll save the one time intervention measure for another time ).

But we have quite a few balena devices in the field and it’s not uncommon to find one in some pathological state. I posted this in an effort to further understand how the system works, how it got into this state, and hopefully, so your team can address it at the cause and continue improving the robustness of the system.

Hopefully it’s been addressed in newer OS releases and is now moot.

xginn8 · September 25, 2020, 5:04pm

Hi there, and apologies for the misunderstanding! I originally thought the method you used of overriding the current state is not sufficient to persist the change long-term, so that’s why I offered to upgrade your device directly.

I’m the developer responsible for redesigning this update process and have been making changes to improve this that I’m hoping will make it to production soon. As such, let’s take a step back here, as I think we’re getting our wires crossed:

Moving a device from one application to another as an isolated action will not cause a supervisor mismatch. If there is a host OS update involved as part of that move, that could affect the states. If there is a reproduction for a simple move, I’d love to know what that is as it’s obviously not something that should be happening!
Unfortunately at the moment, UUID reuse will always result in a funky UX for supervisor updates. I think this should be an easy fix though and I’ve started to look into how to improve that for you in the near term.

I can answer some of your subsequent questions as well:

The version of the supervisor you see in your dashboard is the current state, i.e. what the supervisor is reporting as it’s version. When we release the improved UI for these updates, that will be much clearer!
The /v2 endpoint of the API is quite old, and I would recommend not to use that moving forward. More than anything, it becomes harder to interact with given that our docs are kept in line with the latest version (at the time of writing, that’s /v6).

To give a bit more color to your original question of “why is this mismatch happening at all?” when the UUID is reused, there is a mismatch between the state on the device and the target state in the cloud. In the case that the device already has the cloud’s target supervisor, the on-device update mechanism short-circuits before updating it’s target state. That in turn is referenced by the next script that starts the supervisor, which is why you see something that looks like a split brain on device. Fortunately I believe this will be fixed as well as part of allowing device UUID reuse.

If you have any more questions or comments, I am all ears! As I mentioned I am actively working to improve this UX, so any feedback you do have is well-received. We will of course keep you updated as we make progress!

Topic		Replies	Views
Two instances of supervisor trying to start balenaOS	12	513	September 15, 2022
Device not downloading latest release Product support	29	4589	May 9, 2019
Supervisor/tunneling failure: nothing running on device balenaOS	19	951	August 24, 2020
supervisor busy.. Cannot reach supervisor General	38	700	July 30, 2021
Can't update supervisor balenaOS	6	4196	November 1, 2019

unmatched local and remote supervisor

Related topics