We’ve been noticing that a lot of our builds have been getting unexpectedly cancelled recently, usually towards the end of the build:
[Success] Successfully generated image deltas
[Info] Uploading images
[Error] Build has been cancelled while running. Aborting...
[Success] Successfully uploaded images
[Error] This build has been cancelled
[Info] Built on arm03
[Error] Not deploying release.
I believe most of these failures occurred on our Jenkins CI machine, but it has also happened to us when building manually. As far as we can tell the Jenkins job is running normally when the build suddenly says it has been cancelled - Jenkins didn’t send SIGTERM or anything. Certainly when this happened with manual builds we didn’t kill the build.
And of course it just happened again, this time while the build seems to be hung/suspended on the remote build server:
[=================================================> ] 97%e[2K
[=================================================> ] 98%e[2K
e[36m[Info]e[39m Still Working...
15:39:12 e[2K
e[36m[Info]e[39m Still Working...
15:40:02 e[2K
e[36m[Info]e[39m Still Working...
15:40:52 e[2K
e[36m[Info]e[39m Still Working...
15:41:42 e[2K
e[36m[Info]e[39m Still Working...
15:42:32 e[2K
e[36m[Info]e[39m Still Working...
15:43:22 e[2K
e[36m[Info]e[39m Still Working...
15:44:12 e[2K
e[36m[Info]e[39m Still Working...
15:45:02 e[2K
e[36m[Info]e[39m Still Working...
15:45:52 e[2K
e[36m[Info]e[39m Still Working...
15:46:42 e[2K
e[36m[Info]e[39m Still Working...
15:47:32 e[2K
e[36m[Info]e[39m Still Working...
15:48:14 e[2K
e[31m[Error]e[39m Build has been cancelled while running. Aborting...
15:49:03 e[2K
e[36m[Info]e[39m Still Working...
15:49:53 e[2K
e[36m[Info]e[39m Still Working...
15:50:43 e[2K
e[36m[Info]e[39m Still Working...
15:51:33 e[2K
e[36m[Info]e[39m Still Working...
This is for release d63861c. According to the dashboard, the build is still actively running… No clue why it thinks it was cancelled, or why it’s been stuck at 98% for 12 minutes.
Possibly related, I ran the build again and this time it succeeded instead of getting cancelled, but after finishing it is still stuck saying Still working..., even though the dashboard says the release (d1b0cca) is finished. Not sure why the dashboard says the build is done but the CLI thinks it’s still ongoing.
For the record, the dashboard says the build finished at 16:09.
Update: I cancelled the above stuck run after ~half an hour and restarted. The restarted run failed yet again with “cancelled” (release 52b54a7).
[==============================================> ] 92%e[2K
e[31m[Error]e[39m Build has been cancelled while running. Aborting...
16:45:26 e[2K
[==============================================> ] 92%e[2K
...
[==================================================>] 100%e[2K
e[2K
e[32m[Success]e[39m Successfully uploaded images
16:48:25 e[2K
e[31m[Error]e[39m This build has been cancelled
16:48:25 e[2K
e[36m[Info]e[39m Built on arm03
16:48:25 e[2K
e[31m[Error]e[39m Not deploying release.
I ran it yet another time and it finally succeeded (release 9a08dd9).
Any help would be much appreciated here. Our CI builds are failing or hanging more often than they are passing.
Note: The CI machine is currently running CLI version 12.14.18. If you think anything has been resolved upstream related to this and updating would help we can definitely try that.
Sure, I went back a few weeks in our release list and these are all the builds that are listed as cancelled or failed (all times listed in EST since that’s my current setting). You can see that it was happening a ton yesterday, but also it’s been happening for a while.
Not sure if you can view our releases, but I included the hashes. I figured that would probably be more helpful since it has the build log, docker compose, and other relevant stuff.
12/7 17:02 (52b54a7)
12/7 16:48 (ddd4a95)
12/7 16:43 (251a313)
12/7 15:54 (9a394cc)
12/7 11:29 (9c9f90c)
This one was failed instead of cancelled with the following:
[Error] An error occured: (HTTP code 404) no such image - no such image: dfb713e8eb38: No such image: dfb713e8eb38:latest
It was right after 6b1b2f1, which was cancelled so not sure if that’s the reason it couldn’t find the image or what
12/7 11:27 (6b1b2f1)
12/7 10:49 (377183b)
12/4 12:39 (1dd1702)
12/4 12:39 (33c93ca)
12/1 19:58 (fd86bbd)
12/1 17:50 (9b88f62)
12/1 18:47 (355e6e4)
12/1 16:09 (53fb6cc)
12/1 14:48 (4c1f82c)
11/29 22:57 (fde1a37)
11/24 19:02 (033d6c2)
11/24 18:58 (257dcf5)
11/24 16:37 (ed20734)
11/17 16:46 (726165f)
I didn’t check the log for every single one of these, but most of them got all the way to generating deltas so it’s not like the builds actually failed, and I know they weren’t actually cancelled, at least not intentionally.
Quite some time since the QEMU emulation is pretty slow. Not really sure - I haven’t run a local build in months because of that.
The biggest slow point used to be compiling (pip installing) one of our Python dependencies (pynacl), but that is now pre-built in our base Docker images so I imagine it would be much faster. Last time I ran I think it took maybe 30 mins-1 hour, but that’s just a guess really.
Does that affect anything here though? This should all be running on the native ARM machines, no?
Right, yes. The ARM machines are shared with the other customers of balenaCloud and so if multiple builds are running at the same time, the builder response times could degrade quite substantially, to the pint where requests to fetch images over the network will time out. We have seen this with two parallel heavy builds running, compiling OpenCV or TensorFlow from source for example.
I don’t really have a solution for this at present, aside from to say that we are working on next generation build infrastructure, which will be addressing these concerns.
In the mean time the only thing I can recommend is to wrap the build command in an exponential retry backoff function, which will retry X times in you CI pipeline(s).
Well that might be fair for the one build that failed to fetch since that doesn’t seem to happen often. To be clear: for now, it’s not a great long-term solution.
What about all the other builds that are mysteriously claiming they were cancelled though? That definitely should not be happening, and it’s happening frequently. It’s also happening both on our CI machine and when we build manually (remote build, but started on a developer machine instead of CI).
Because of this outstanding issue, each build typically takes 10+ minutes even for tiny changes. Looping multiple times just to test a single change wastes a ton of time. Especially if you’re trying to iterate fast to debug and find an issue.
TLDR for the other outstanding issue: the build system automatically creates two sets of deltas for every build:
Delta since the most recent build
Delta against the build that’s currently being used by the most online devices
For larger images, these can often take a ton of time. In our case, it adds about 7-10 minutes to nearly every build, even if all we changed was a single text file.
#1 is sometimes nice but is useless if multiple developers are working on different things at the same time and pushing to different devices. It’s also not really necessary since the delta will get generated on demand when you push it to a device for the first time.
#2 in generally is a big waste of time for people doing development builds. I would expect that most of the time the “most popular” release is the production release pushed to users. Development builds will never get pushed to users, so there’s no need for there to be a delta created automatically. Again, if it one is ever needed, the system will do it on demand.
Are you familiar with the local mode flow we’ve added to balenaOS to help speed up development? Is there any reason this can’t work for your use case(s)?
Yes, my understanding was that local mode has a few limitations that do not work for our use case:
Devices must be available on your local network
Our devices are often deployed in moving vehicles, and the developer testing against it may not be in the vehicle
We often test a single build on multiple devices, so developers would have to build them all separately, and we wouldn’t be able to simply pin another device to an existing dev build
Environment variables from the cloud do not apply
We use the environment variables for a number of things, including turning on/off some development settings on the fly. Having to manually set them in the docker compose file to be in sync with the settings in the cloud every time we do a quick development test isn’t really feasible and is error prone
To deploy on multiple devices the developer would then have to set the compose file for each device by hand and make sure they got all the settings right, which takes time and is easy to get wrong, and then run separate builds for each of those devices
Our preference would definitely be to use the Balena cloud build servers for development builds.
Hi Adam, I am catching up on this thread, and you are correct, Local Mode only applies to a device on a local network, so I definitely understand if that does not fit your use case. Your other bullet points are also valid, ENV variables do not apply, and there are even a few more caveats listed here: https://www.balena.io/docs/learn/develop/local-mode/#local-mode-caveats
As Anton mentioned, the long term solution is currently being designed and built, but in the meantime, he pinged our Build maintainers to take a look. So, hopefully we will have some information to pass along to you soon. Thanks!
Ok, thanks David. Our main concern at the moment is the large number of “cancellations” we’re seeing. It seems to have gotten worse recently to the point that it’s very hard to get a CI build completed (as you can see above).
If there’s any chance of addressing the other thread re long build times as well, that would also be very helpful. Even simply an option to disable automatic delta generation altogether might be a good enough band-aid for now. We can continue that discussion in the other thread though.
We have noticed that the “build cancelled” seems to happen frequently when two developers are building at the same time (with different accounts and access keys, but for the same application). I don’t think that is always the case - when I got all of the CI build failures on 12/7, that was the only machine building at the time - but it seems like we can replicate it pretty reliably if two people build at the same time.
I’m hoping that might provide a hint. Other than the cancellation error, there are no other indications from the build that anything is wrong. The following is the output from the latest CI build from the start of the balena push to when it said it was cancelled. One of our developers happened to be running a build from his own machine at the same time.
Worth noting that it doesn’t actually abort when it says it is going to. This build continued after that error message for another few minutes, including uploading images even though they will never be used.
Thanks Adam, your last message could help pin point it a bit more. We do have concurrent build handling logic in the code, so we’ll take a look and see if we can first replicate this issue internally first and hopefully produce a fix. We’ll keep you posted.
Hey Adam, after speaking to the back-end engineers it would seem that our automated builders only expect building one instance of an application at a time. So if multiple developers are pushing to the same app, we would see this behaviour where one or more of the builds are cancelled. Though that wouldn’t fully explain any cancelled builds if there was only one CI/developer pushing at a time.
Have you looked into local development builds for devs, and using CI only to push official releases though the balena builders?
By having each developer use a local development device to run builds, it could potentially speed up your development workflow by not relying on cloud services and image uploads.
That is pretty unexpected and a little disappointing to be honest. As I described a few comments earlier in response to @ab77’s question (Release builds getting "cancelled" unexpectedly.), local builds do not work for us for a few different reasons.
We have multiple developers working on independent things in different timezones, plus automated CI. Trying to coordinate builds among even just two developers has proven to be tricky, especially since the builds currently take much longer than expected due to unnecessary delta generation.
I’m not entirely shocked that a single user can’t start multiple builds in parallel, though it would be nice, particularly if using different auth tokens, say from different machines. For example, if I’m trying to resolve issues for different customers, I would like to be able to kick off two builds right away if needed. I am however quite surprised that builds aren’t at least distinguished by user. Isn’t that the point of having multiple developer accounts? Really though, since each release is assigned a unique hash, it would seem like they should be treated independently and should be able to build in parallel.
If nothing else, having the build suddenly claim to be “cancelled” with no error explaining why is definitely not good…
Thanks for the update, Adam. I had missed the prior discussion regarding local mode limitations but I’m caught up now!
The current limitation on the builds is carried over from before we supported pinning releases. At that time there would have been no use in allowing concurrent builds since the app would always track latest and and the running build would be cancelled and never applied. We will discuss it internally to investigate alternative approaches.