Release builds getting "cancelled" unexpectedly.

Hey folks,

We’ve been noticing that a lot of our builds have been getting unexpectedly cancelled recently, usually towards the end of the build:

[Success]      Successfully generated image deltas
[Info]         Uploading images
[Error]        Build has been cancelled while running. Aborting...
[Success]      Successfully uploaded images
[Error]        This build has been cancelled
[Info]         Built on arm03
[Error]        Not deploying release.

I believe most of these failures occurred on our Jenkins CI machine, but it has also happened to us when building manually. As far as we can tell the Jenkins job is running normally when the build suddenly says it has been cancelled - Jenkins didn’t send SIGTERM or anything. Certainly when this happened with manual builds we didn’t kill the build.

Any idea why this would be happening?

Thanks,

Adam

And of course it just happened again, this time while the build seems to be hung/suspended on the remote build server:

[=================================================> ] 97%e[2K
[=================================================> ] 98%e[2K
e[36m[Info]e[39m         Still Working...
15:39:12 e[2K
e[36m[Info]e[39m         Still Working...
15:40:02 e[2K
e[36m[Info]e[39m         Still Working...
15:40:52 e[2K
e[36m[Info]e[39m         Still Working...
15:41:42 e[2K
e[36m[Info]e[39m         Still Working...
15:42:32 e[2K
e[36m[Info]e[39m         Still Working...
15:43:22 e[2K
e[36m[Info]e[39m         Still Working...
15:44:12 e[2K
e[36m[Info]e[39m         Still Working...
15:45:02 e[2K
e[36m[Info]e[39m         Still Working...
15:45:52 e[2K
e[36m[Info]e[39m         Still Working...
15:46:42 e[2K
e[36m[Info]e[39m         Still Working...
15:47:32 e[2K
e[36m[Info]e[39m         Still Working...
15:48:14 e[2K
e[31m[Error]e[39m        Build has been cancelled while running. Aborting...
15:49:03 e[2K
e[36m[Info]e[39m         Still Working...
15:49:53 e[2K
e[36m[Info]e[39m         Still Working...
15:50:43 e[2K
e[36m[Info]e[39m         Still Working...
15:51:33 e[2K
e[36m[Info]e[39m         Still Working...

This is for release d63861c. According to the dashboard, the build is still actively running… No clue why it thinks it was cancelled, or why it’s been stuck at 98% for 12 minutes.

Possibly related, I ran the build again and this time it succeeded instead of getting cancelled, but after finishing it is still stuck saying Still working..., even though the dashboard says the release (d1b0cca) is finished. Not sure why the dashboard says the build is done but the CLI thinks it’s still ongoing.

For the record, the dashboard says the build finished at 16:09.

e[36m[Info]e[39m         e[90m├─────────────e[39me[90m┼────────────e[39me[90m┼────────────e[39me[90m┼──────────────────────┤e[39m
16:09:41 e[2K
e[36m[Info]e[39m         e[90m│e[39m eos         e[90m│e[39m 22.22 MB   e[90m│e[39m 1.08 MB    e[90m│e[39m 47 seconds           e[90m│e[39m
16:09:41 e[2K
e[36m[Info]e[39m         e[90m└─────────────e[39me[90m┴────────────e[39me[90m┴────────────e[39me[90m┴──────────────────────┘e[39m
16:09:41 e[2K
e[36m[Info]e[39m         Build finished in 9 minutes, 31 seconds
16:09:41 e[2K
e[1me[34m			    \e[39me[22m
16:09:41 e[1me[34m			     \e[39me[22m
16:09:41 e[1me[34m			      \\e[39me[22m
16:09:41 e[1me[34m			       \\e[39me[22m
16:09:41 e[1me[34m			        >\/7e[39me[22m
16:09:41 e[1me[34m			    _.-(6'  \e[39me[22m
16:09:41 e[1me[34m			   (=___._/` \e[39me[22m
16:09:41 e[1me[34m			        )  \ |e[39me[22m
16:09:41 e[1me[34m			       /   / |e[39me[22m
16:09:41 e[1me[34m			      /    > /e[39me[22m
16:09:41 e[1me[34m			     j    < _\e[39me[22m
16:09:41 e[1me[34m			 _.-' :      ``.e[39me[22m
16:09:41 e[1me[34m			 \ r=._\        `.e[39me[22m
16:09:41 e[1me[34m			<`\\_  \         .`-.e[39me[22m
16:09:41 e[1me[34m			 \ r-7  `-. ._  ' .  `\e[39me[22m
16:09:41 e[1me[34m			  \`,      `-.`7  7)   )e[39me[22m
16:09:41 e[1me[34m			   \/         \|  \'  / `-._e[39me[22m
16:09:41 e[1me[34m			              ||    .'e[39me[22m
16:09:41 e[1me[34m			               \\  (e[39me[22m
16:09:41 e[1me[34m			                >\  >e[39me[22m
16:09:41 e[1me[34m			            ,.-' >.'e[39me[22m
16:09:41 e[1me[34m			           <.'_.''e[39me[22m
16:09:41 e[1me[34m			             <'e[39me[22m
16:10:31 e[2K
e[36m[Info]e[39m         Still Working...
16:11:21 e[2K
e[36m[Info]e[39m         Still Working...
16:12:11 e[2K
e[36m[Info]e[39m         Still Working...
...
16:24:41 e[2K
e[36m[Info]e[39m         Still Working...
...
16:38:01 e[2K
e[36m[Info]e[39m         Still Working...

Update: I cancelled the above stuck run after ~half an hour and restarted. The restarted run failed yet again with “cancelled” (release 52b54a7).

[==============================================>    ] 92%e[2K
e[31m[Error]e[39m        Build has been cancelled while running. Aborting...
16:45:26 e[2K
[==============================================>    ] 92%e[2K
...
[==================================================>] 100%e[2K
e[2K
e[32m[Success]e[39m      Successfully uploaded images
16:48:25 e[2K
e[31m[Error]e[39m        This build has been cancelled
16:48:25 e[2K
e[36m[Info]e[39m         Built on arm03
16:48:25 e[2K
e[31m[Error]e[39m        Not deploying release.

I ran it yet another time and it finally succeeded (release 9a08dd9).

Any help would be much appreciated here. Our CI builds are failing or hanging more often than they are passing.

Note: The CI machine is currently running CLI version 12.14.18. If you think anything has been resolved upstream related to this and updating would help we can definitely try that.

Hey, thanks for sharing this.

Could you let us know the times that you saw these issues, and also is it possible to see your Dockerfile?

Sure, I went back a few weeks in our release list and these are all the builds that are listed as cancelled or failed (all times listed in EST since that’s my current setting). You can see that it was happening a ton yesterday, but also it’s been happening for a while.

Not sure if you can view our releases, but I included the hashes. I figured that would probably be more helpful since it has the build log, docker compose, and other relevant stuff.

  • 12/7 17:02 (52b54a7)
  • 12/7 16:48 (ddd4a95)
  • 12/7 16:43 (251a313)
  • 12/7 15:54 (9a394cc)
  • 12/7 11:29 (9c9f90c)
    • This one was failed instead of cancelled with the following:
      [Error]        An error occured: (HTTP code 404) no such image - no such image: dfb713e8eb38: No such image: dfb713e8eb38:latest 
      
      It was right after 6b1b2f1, which was cancelled so not sure if that’s the reason it couldn’t find the image or what
  • 12/7 11:27 (6b1b2f1)
  • 12/7 10:49 (377183b)
  • 12/4 12:39 (1dd1702)
  • 12/4 12:39 (33c93ca)
  • 12/1 19:58 (fd86bbd)
  • 12/1 17:50 (9b88f62)
  • 12/1 18:47 (355e6e4)
  • 12/1 16:09 (53fb6cc)
  • 12/1 14:48 (4c1f82c)
  • 11/29 22:57 (fde1a37)
  • 11/24 19:02 (033d6c2)
  • 11/24 18:58 (257dcf5)
  • 11/24 16:37 (ed20734)
  • 11/17 16:46 (726165f)

I didn’t check the log for every single one of these, but most of them got all the way to generating deltas so it’s not like the builds actually failed, and I know they weren’t actually cancelled, at least not intentionally.

Our docker compose file is currently:

version: '2.1'
networks: {}
volumes:
  resin-data: {}
services:
  nautilus:
    build: ./nautilus
    # Note: Disabling 'always' restart until Nautilus application handles
    # start/stop behavior internally.
    restart: on-failure
    devices:
      - "/dev/ttymxc1:/dev/ttymxc1"
      - "/dev/ttymxc2:/dev/ttymxc2"
    cap_add:
      - SYS_RAWIO
    volumes:
      - 'resin-data:/data'
    ports:
      - "5500:5500" # Vision ZMQ
      - "30200:30200" # NMEA-0183
      - "30201:30201" # FusionEngine Client
  nemo:
    build: ./nemo
    privileged: true
    restart: always
    depends_on:
      - "nautilus"
    volumes:
      - 'resin-data:/data'
    ports:
      - "8888:8888" # Nemo REST Interface
    labels:
      io.balena.features.supervisor-api: '1'
      io.balena.features.dbus: '1'
      io.balena.features.procfs: '1'
    environment:
      - DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket
  ui-bridge:
    build: ./nautilus
    restart: always
    ports:
      - "3000:3000" # UI Websocket
    command: ["/home/point_one/bootstrap.sh", "--force-start",
              "./bin/ui_websocket_bridge",
              "--ui-multicast-port=30108",
              "--ws-port=3000"]
  ntrip-proxy:
    build: ./nautilus
    restart: always
    ports:
      - "2101:2101" # Proxied NTRIP Output
    command: ["/home/point_one/bootstrap.sh", "--force-start",
              "./bin/ntrip_proxy",
              "--polaris-api-key=${P1_POLARIS_API_KEY}",
              "--polaris-unique-id=${P1_DEVICE_ID:-${BALENA_DEVICE_NAME_AT_INIT}}-ntrip-proxy",
              "0.0.0.0", "2101", "point_one/applications/ntrip_proxy"]
  eos:
    build: ./eos
    restart: always
    volumes:
      - 'resin-data:/data'
    ports:
      - "80:80" # EOS UI

Adam, if you run the build locally with the CLI balena build ... flow, how long does it usually take?

Quite some time since the QEMU emulation is pretty slow. Not really sure - I haven’t run a local build in months because of that.

The biggest slow point used to be compiling (pip installing) one of our Python dependencies (pynacl), but that is now pre-built in our base Docker images so I imagine it would be much faster. Last time I ran I think it took maybe 30 mins-1 hour, but that’s just a guess really.

Does that affect anything here though? This should all be running on the native ARM machines, no?

Right, yes. The ARM machines are shared with the other customers of balenaCloud and so if multiple builds are running at the same time, the builder response times could degrade quite substantially, to the pint where requests to fetch images over the network will time out. We have seen this with two parallel heavy builds running, compiling OpenCV or TensorFlow from source for example.

I don’t really have a solution for this at present, aside from to say that we are working on next generation build infrastructure, which will be addressing these concerns.

In the mean time the only thing I can recommend is to wrap the build command in an exponential retry backoff function, which will retry X times in you CI pipeline(s).

Well that might be fair for the one build that failed to fetch since that doesn’t seem to happen often. To be clear: for now, it’s not a great long-term solution.

What about all the other builds that are mysteriously claiming they were cancelled though? That definitely should not be happening, and it’s happening frequently. It’s also happening both on our CI machine and when we build manually (remote build, but started on a developer machine instead of CI).

Because of this outstanding issue, each build typically takes 10+ minutes even for tiny changes. Looping multiple times just to test a single change wastes a ton of time. Especially if you’re trying to iterate fast to debug and find an issue.

TLDR for the other outstanding issue: the build system automatically creates two sets of deltas for every build:

  1. Delta since the most recent build
  2. Delta against the build that’s currently being used by the most online devices

For larger images, these can often take a ton of time. In our case, it adds about 7-10 minutes to nearly every build, even if all we changed was a single text file.

#1 is sometimes nice but is useless if multiple developers are working on different things at the same time and pushing to different devices. It’s also not really necessary since the delta will get generated on demand when you push it to a device for the first time.

#2 in generally is a big waste of time for people doing development builds. I would expect that most of the time the “most popular” release is the production release pushed to users. Development builds will never get pushed to users, so there’s no need for there to be a delta created automatically. Again, if it one is ever needed, the system will do it on demand.

Are you familiar with the local mode flow we’ve added to balenaOS to help speed up development? Is there any reason this can’t work for your use case(s)?

Yes, my understanding was that local mode has a few limitations that do not work for our use case:

  • Devices must be available on your local network
    • Our devices are often deployed in moving vehicles, and the developer testing against it may not be in the vehicle
    • We often test a single build on multiple devices, so developers would have to build them all separately, and we wouldn’t be able to simply pin another device to an existing dev build
  • Environment variables from the cloud do not apply
    • We use the environment variables for a number of things, including turning on/off some development settings on the fly. Having to manually set them in the docker compose file to be in sync with the settings in the cloud every time we do a quick development test isn’t really feasible and is error prone
    • To deploy on multiple devices the developer would then have to set the compose file for each device by hand and make sure they got all the settings right, which takes time and is easy to get wrong, and then run separate builds for each of those devices

Our preference would definitely be to use the Balena cloud build servers for development builds.

Hi Adam, I am catching up on this thread, and you are correct, Local Mode only applies to a device on a local network, so I definitely understand if that does not fit your use case. Your other bullet points are also valid, ENV variables do not apply, and there are even a few more caveats listed here: https://www.balena.io/docs/learn/develop/local-mode/#local-mode-caveats

As Anton mentioned, the long term solution is currently being designed and built, but in the meantime, he pinged our Build maintainers to take a look. So, hopefully we will have some information to pass along to you soon. Thanks!

Ok, thanks David. Our main concern at the moment is the large number of “cancellations” we’re seeing. It seems to have gotten worse recently to the point that it’s very hard to get a CI build completed (as you can see above).

If there’s any chance of addressing the other thread re long build times as well, that would also be very helpful. Even simply an option to disable automatic delta generation altogether might be a good enough band-aid for now. We can continue that discussion in the other thread though.

We have noticed that the “build cancelled” seems to happen frequently when two developers are building at the same time (with different accounts and access keys, but for the same application). I don’t think that is always the case - when I got all of the CI build failures on 12/7, that was the only machine building at the time - but it seems like we can replicate it pretty reliably if two people build at the same time.

I’m hoping that might provide a hint. Other than the cancellation error, there are no other indications from the build that anything is wrong. The following is the output from the latest CI build from the start of the balena push to when it said it was cancelled. One of our developers happened to be running a build from his own machine at the same time.

15:55:07 Performing remote build for Atlas application...
15:55:27 e[2K
e[36m[Info]e[39m         Starting build for Atlas, user g_aaron_nathan
15:55:27 e[2K
e[36m[Info]e[39m         Dashboard link: https://dashboard.balena-cloud.com/apps/1623312/devices
15:55:49 e[2K
e[36m[Info]e[39m         Building on arm01
15:55:49 e[2K
e[36m[Info]e[39m         Pulling previous images for caching purposes...
15:55:50 e[2K
e[2K
e[32m[Success]e[39m      Successfully pulled cache images
15:56:40 e[2K
e[36m[Info]e[39m         Still Working...
15:57:30 e[2K
e[36m[Info]e[39m         Still Working...
15:58:11 e[2K
e[34m[nemo]e[39m         Step 1/10 : FROM pointonenav/atlas-nemo-base:v1.2.1
15:58:11 e[2K
e[34m[ntrip-proxy]e[39m  Step 1/10 : FROM pointonenav/atlas-nautilus-base:v1.0.0
15:58:11 e[2K
e[34m[ui-bridge]e[39m    Step 1/10 : FROM pointonenav/atlas-nautilus-base:v1.0.0
15:58:11 e[2K
e[34m[nautilus]e[39m     Step 1/10 : FROM pointonenav/atlas-nautilus-base:v1.0.0
15:58:12 e[2K
e[34m[nemo]e[39m          ---> a4d01cb00907
15:58:12 e[2K
e[34m[nemo]e[39m         Step 2/10 : WORKDIR /home/point_one
15:58:12 e[2K
e[34m[ntrip-proxy]e[39m   ---> a3144caf25aa
15:58:12 e[2K
e[34m[ntrip-proxy]e[39m  Step 2/10 : RUN mkdir -p /home/point_one     && mkdir -p /home/point_one/nautilus     && ln -sf /data/logs /home/point_one/logs     && ln -sf /data/logs /logs     && ln -sf /data/nautilus/config /home/point_one/config     && ln -sf /data/nautilus/config/config.glog /home/point_one/nautilus/config.glog
15:58:12 e[2K
e[34m[ui-bridge]e[39m     ---> a3144caf25aa
15:58:12 e[2K
e[34m[ui-bridge]e[39m    Step 2/10 : RUN mkdir -p /home/point_one     && mkdir -p /home/point_one/nautilus     && ln -sf /data/logs /home/point_one/logs     && ln -sf /data/logs /logs     && ln -sf /data/nautilus/config /home/point_one/config     && ln -sf /data/nautilus/config/config.glog /home/point_one/nautilus/config.glog
15:58:12 e[2K
e[34m[nautilus]e[39m      ---> a3144caf25aa
15:58:12 e[2K
e[34m[nautilus]e[39m     Step 2/10 : RUN mkdir -p /home/point_one     && mkdir -p /home/point_one/nautilus     && ln -sf /data/logs /home/point_one/logs     && ln -sf /data/logs /logs     && ln -sf /data/nautilus/config /home/point_one/config     && ln -sf /data/nautilus/config/config.glog /home/point_one/nautilus/config.glog
15:58:13 e[2K
e[34m[eos]e[39m          Step 1/5 : FROM nginx:stable-alpine
15:58:13 e[2K
e[34m[nemo]e[39m          ---> Running in 888177db1c94
15:58:14 e[2K
e[34m[eos]e[39m           ---> dfb713e8eb38
15:58:14 e[2K
e[34m[eos]e[39m          Step 2/5 : COPY eos.conf /etc/nginx/conf.d/default.conf
15:58:16 e[2K
e[34m[nautilus]e[39m      ---> Running in 0faab13f2797
15:58:16 e[2K
e[34m[ui-bridge]e[39m     ---> Running in e862710e5c9c
15:58:16 e[2K
e[34m[ntrip-proxy]e[39m   ---> Running in e139c1eebed6
15:58:20 e[2K
e[34m[nemo]e[39m         Removing intermediate container 888177db1c94
15:58:20 e[2K
e[34m[nemo]e[39m          ---> 6ca38ced82d7
15:58:20 e[2K
e[34m[nemo]e[39m         Step 3/10 : COPY nemo_pkg.tar.gz s3_uploader_pkg.tar.gz ./
15:58:20 e[2K
e[34m[eos]e[39m           ---> b60363a42a2b
15:58:20 e[2K
e[34m[eos]e[39m          Step 3/5 : WORKDIR /usr/share/nginx/html
15:58:21 e[2K
e[34m[eos]e[39m           ---> Running in 03322502fa6f
15:58:25 e[2K
e[34m[eos]e[39m          Removing intermediate container 03322502fa6f
15:58:25 e[2K
e[34m[eos]e[39m           ---> a617c462f954
15:58:25 e[2K
e[34m[eos]e[39m          Step 4/5 : COPY eos-atlas.tar.gz .
15:58:32 e[2K
e[34m[nemo]e[39m          ---> cc1bc4d122c5
15:58:32 e[2K
e[34m[nemo]e[39m         Step 4/10 : RUN tar xzf nemo_pkg.tar.gz     && tar xzf s3_uploader_pkg.tar.gz     && rm -f nemo_pkg.tar.gz s3_uploader_pkg.tar.gz
15:58:32 e[2K
e[34m[nemo]e[39m          ---> Running in 8bc688c4886e
15:58:34 e[2K
e[34m[eos]e[39m           ---> cc1b61770e4d
15:58:34 e[2K
e[34m[eos]e[39m          Step 5/5 : RUN tar xzf eos-atlas.tar.gz --strip-components=1     && rm -f eos-atlas.tar.gz
15:58:35 e[2K
e[34m[eos]e[39m           ---> Running in e6911af714b2
15:58:35 e[2K
e[34m[nautilus]e[39m     Removing intermediate container 0faab13f2797
15:58:35 e[2K
e[34m[nautilus]e[39m      ---> a0fb78bb3ca5
15:58:35 e[2K
e[34m[nautilus]e[39m     Step 3/10 : RUN echo "http 80/tcp www" >> /etc/services
15:58:36 e[2K
e[34m[nautilus]e[39m      ---> Running in 58e2b714d562
15:58:40 e[2K
e[34m[ntrip-proxy]e[39m  Removing intermediate container e139c1eebed6
15:58:40 e[2K
e[34m[ntrip-proxy]e[39m   ---> 79c82a3c48b4
15:58:40 e[2K
e[34m[ntrip-proxy]e[39m  Step 3/10 : RUN echo "http 80/tcp www" >> /etc/services
15:58:40 e[2K
e[34m[ntrip-proxy]e[39m   ---> Running in 952f5971ac89
15:58:41 e[2K
e[34m[ui-bridge]e[39m    Removing intermediate container e862710e5c9c
15:58:41 e[2K
e[34m[ui-bridge]e[39m     ---> 41503797b1b4
15:58:41 e[2K
e[34m[ui-bridge]e[39m    Step 3/10 : RUN echo "http 80/tcp www" >> /etc/services
15:58:41 e[2K
e[34m[ui-bridge]e[39m     ---> Running in 8d5cbc18f3ea
15:58:47 e[2K
e[34m[nemo]e[39m         Removing intermediate container 8bc688c4886e
15:58:47 e[2K
e[34m[nemo]e[39m          ---> 3112a1cb551a
15:58:47 e[2K
e[34m[nemo]e[39m         Step 5/10 : RUN pip install --user -r nemo/requirements.txt
15:58:47 e[2K
e[34m[nemo]e[39m          ---> Running in 591542618c6d
15:58:50 e[2K
e[34m[eos]e[39m          Removing intermediate container e6911af714b2
15:58:50 e[2K
e[34m[eos]e[39m           ---> 523f6374ce16
15:58:50 e[2K
e[34m[eos]e[39m          Successfully built 523f6374ce16
15:58:50 e[2K
e[31m[Error]e[39m        Build has been cancelled while running. Aborting...
...

Worth noting that it doesn’t actually abort when it says it is going to. This build continued after that error message for another few minutes, including uploading images even though they will never be used.

Thanks Adam, your last message could help pin point it a bit more. We do have concurrent build handling logic in the code, so we’ll take a look and see if we can first replicate this issue internally first and hopefully produce a fix. We’ll keep you posted.

Hey Adam, after speaking to the back-end engineers it would seem that our automated builders only expect building one instance of an application at a time. So if multiple developers are pushing to the same app, we would see this behaviour where one or more of the builds are cancelled. Though that wouldn’t fully explain any cancelled builds if there was only one CI/developer pushing at a time.

Have you looked into local development builds for devs, and using CI only to push official releases though the balena builders?

By having each developer use a local development device to run builds, it could potentially speed up your development workflow by not relying on cloud services and image uploads.

Hey Kyle,

That is pretty unexpected and a little disappointing to be honest. As I described a few comments earlier in response to @ab77’s question (Release builds getting "cancelled" unexpectedly.), local builds do not work for us for a few different reasons.

We have multiple developers working on independent things in different timezones, plus automated CI. Trying to coordinate builds among even just two developers has proven to be tricky, especially since the builds currently take much longer than expected due to unnecessary delta generation.

I’m not entirely shocked that a single user can’t start multiple builds in parallel, though it would be nice, particularly if using different auth tokens, say from different machines. For example, if I’m trying to resolve issues for different customers, I would like to be able to kick off two builds right away if needed. I am however quite surprised that builds aren’t at least distinguished by user. Isn’t that the point of having multiple developer accounts? Really though, since each release is assigned a unique hash, it would seem like they should be treated independently and should be able to build in parallel.

If nothing else, having the build suddenly claim to be “cancelled” with no error explaining why is definitely not good…

Thanks for the update, Adam. I had missed the prior discussion regarding local mode limitations but I’m caught up now!
The current limitation on the builds is carried over from before we supported pinning releases. At that time there would have been no use in allowing concurrent builds since the app would always track latest and and the running build would be cancelled and never applied. We will discuss it internally to investigate alternative approaches.