Python SDK release pin overrides service update lock and does not have an option not to force.

Hey folks,

We’re using update locks to prevent the supervisor from restarting our containers when actively running. Our build script has an option to change the pinning for a single specified device under test via the Python SDK. When we do that, we still want it to abide by the update lock and not restart the services unexpectedly. The Python SDK doesn’t seem to have a force control like the supervisor API does though, and in practice it looks like it always forces.

Is this intended behavior? It seems unexpected to me. For comparison, the dashboard does not do this.

Thanks,

Adam

Hi

If I understand your situation correctly -

  1. You are using application update locking on your devices using the pythonlockfile method mentioned here
  2. You remove this for a particular device using your script when you want to test something.
  3. While you are doing this using the Python SDK, you are using the update function in there. What are you setting the force argument to that function?

Hey Anuj,

No that’s not what we’re doing, sorry if that wasn’t clear:

  1. We are running multiple containers
  2. One runs a C++ application, which creates updates.lock as in that page you linked while it runs and releases it when it is shutdown
    • To shut it down when a user commands it, a different container stops it using the supervisor API stop-service command with force: true. The C++ application catches SIGTERM and shuts down cleanly.
  3. Separately, on our developers’ computers, we have a Bash build script that calls balena push through the CLI.
  4. When the build is done, the script captures the release UUID from the console output of balena push.
  5. If the user running the Bash script specifies an optional command-line argument, the script then uses the Python SDK set_to_release() function to pin a single device to that release for testing.

Step 5 is the one that does not work as expected. Neither set_to_release() nor should_be_running__release have a way to specify force: false - they always forcibly override the lock file. We do not want that to happen.

I’m not sure if the dashboard uses the REST API under the hood, or if it is implemented differently, but the dashboard service stop/restart buttons do abide by the lock file. If it does use the REST API, I would imagine the force option is possible but just missing from the documentation. Either way, the Python SDK set_to_release() function definitely does not have that argument.

Hi Adam, that definitely sounds like strange behaviour. However I don’t think there set_to_release() is missing a force:true option, since that function is purely just hitting the backend API and just changing what target version the device should be pointing at and then the device makes a decision on what to do with the new release it downloads. What it sounds like is that there is some incorrect behaviour in the update-locks for multicontainer. Can you tell us what version of balenaOS you are experiencing this one, perhaps if we can create a minimal reproduction of the issue we can get to the bottom of it.

Hey Shaun,

Yeah, we’re running 2.50.1+rev1 - the dart mx8mm release.

Can you test the if the same behaviour happens if you set the pin using the dashboard UI functionality so we can rule out that its an issue on the SDK and that will point to it being an issue on the device side of things.

It does not - if we build a release and then pin a device in the dashboard it obeys the update lock. When we build the release and the build script pins the device via Python SDK it overrides the lock.

Is the build and release script something you could share with us. I have had a look at the python SDK calls and can’t see why it would do that, so having a way for our team to reproduce this case would be helpful for us to narrow down the issue. Its sounds very strange as both the UI and python SDK should be hitting the exact same API and there aren’t too many options for that API. One other thing to try from your side is to set the device pin directly using the API. Essentially replace the set_to_release() call with the curl snippet here: https://www.balena.io/docs/reference/api/resources/device/#set-device-to-release and see if that behaves correctly, since that is what the pythonSDK should be calling under the hood.

Sure, happy to. I’ve stripped out a bunch of junk for building our software before it gets pushed to your build servers, but the important stuff is pretty simple:

# Build the service images remotely on Balena's build servers.
echo "Performing remote build for ${BALENA_APPLICATION} application..."
stdbuf -o0 balena push ${BALENA_APPLICATION} 2>&1 | tee build.log
RELEASE_COMMIT=$(perl -n -e 's/.*Release: *(?:[[;\\ 0-9Em\x1b]*[m ])?(\w+).*/$1/s && print' build.log)

# Finally, use the Balena Python SDK to tag the release and tell the device to
# to update its images (i.e., change its pinned release pin to the new release).
GIT_EMAIL=$(git config user.email)
GIT_COMMIT=$(git rev-parse HEAD)
VERSION=$(git describe --always --tags --dirty)

cat <<EOF | python3
from balena import Balena

release_commit = '${RELEASE_COMMIT}'

balena = Balena()
balena.auth.login_with_token('${BALENA_AUTH_TOKEN}')

balena.models.tag.release.set(release_commit, 'author', '${GIT_EMAIL}')
balena.models.tag.release.set(release_commit, 'git_commit', '${GIT_COMMIT}')
balena.models.tag.release.set(release_commit, 'version', '${VERSION}')

device_uuid = '${DEVICE_UUID}'
if device_uuid == '':
    device_name = '${DEVICE_NAME}'
    if device_name != '':
        device_uuid = balena.models.device.get_by_name(device_name)[0]['uuid']

if device_uuid != '':
    print('Updating device %s to release %s.' % (device_uuid, release_commit))
    balena.models.device.set_to_release(device_uuid, release_commit)
EOF

Hi Adam, we have forwarded your request to the developer in charge of the Python SDK.

Hi,

I tried to replicate the issue but did not succeed. Here is my test: there is an application with one service and a lockfile created. While the lockfile exists, it did download the new updates but did not install it as we describe in our docs. I tried balena.models.device.set_to_release and nothing changed as it did not install the target release (I could see on the dashboard that the retry period for lockfile checking was reset).

Hey Adam, we tried also testing the same scenario for an application that has multiple services, but we still were not able to reproduce the bug. Just as a note we did test this with the latest SDK version available (which is v10.1.1), just in case you have an older version you could try updating to the latest one. One more thing I would try to attempt and completely rule out any issues in the SDK itself is to try what Shaun suggested above, in your build script you can replace the single SDK call to set_to_release with an http call to the API. This call should hit the device endpoint and PATCH should_be_running_release with the correct release ID (have a look at https://www.balena.io/docs/reference/api/resources/device/#set-device-to-release in the Pin device to a specific release section).
Are you able to get the correct behaviour through the UI consistently? The API endpoint that is hit through the SDK and UI should not really be related to the decision the device makes once it has the release available, so even if the UI and SDK were doing different things when hitting that endpoint (and from what we are able to tell they are not) it should not affect the behaviour of the device with respect to update locks.

Hey guys,

I just tried to replicate this issue explicitly using supervisor v11.4.10 and SDK v9.3.0 and could not. I had the device set to track the application release, started our software, and then pinned it to another release. The supervisor did the right thing and printed a series of Updates are locked, retrying in ~2m... messages. When I stopped our software, the supervisor eventually timed out and updated the device.

One minor question: is there an upper limit for retry time? I noticed it increases with each attempt.

I’m not sure what the conditions were exactly when we did see this. We were working on time-sensitive issues and pushing releases, and we saw what we thought was a restart on a customer device when we pushed a build while it was running. It’s hard to recall what we saw exactly unfortunately.

We can probably close this for now. I appreciate you looking into it.

Cheers,

Adam

Hi Adam, The retry logic is an exponential back-off you can see the logic here: https://github.com/balena-io/balena-supervisor/blob/master/src/device-state.ts#L707-L710

You will notice that the max is maxPollTime this value is loaded from RESIN_SUPERVISOR_POLL_INTERVAL you can read more about it here: https://www.balena.io/docs/reference/supervisor/bandwidth-reduction/

Hey Tom,

Thanks for that - didn’t realize the poll interval also controlled the retry time after it had already downloaded the update. Looks like 10 mins is the minimum (and the default). Good to know.

Cheers,

Adam