Hey all, we use balena to distribute updates to our updates to remote devices. The builds tend to be rather flaky, either some random Go package repository pull has a network error (killing the deploy) or a caching error from balena (ie "failed to restore cached image from “sha256: …” failed to get layer … : layer does not exist). These failures result in having to restart the build, which takes about 20-30 minutes (build time is another issue, local builds are <3 minutes).
Does anyone else have issues with the remote builder’s reliability?
Hey @mpous, our target device is an arm-based Linux machine (iot-gate-imx8). It is connected currently over local wifi. Host OS is balenaOS 2.98.33, supervisor version 14.2.16.
All the builds seem to fail in the cloud build stage, we are not currently building locally as that did not work when we set up our pipelines a year ago. The current cycle looks like
Make code change;
Balena push to the fleet;
Await either successful push and build, or failed in remote build.
@mpous We have continued to experience periods where remote builds flat out don’t work for hours. Usually, a mix of cache failures and network errors, followed by a long period of 408s. This has been a common occurrence for a couple of months now.
Thank you for reporting this. We replaced the native ARM builders i.e. new hardware. We made that change on Dec 8th. Other customers now report normal build times. If you haven’t already, please try a build.
I just tried a build and got the dreaded 408. I was getting them pretty hardcore over the weekend, but had a few days where builds went off without a hitch this week.
Hey @johnwarila,
I just took a look at the builder logs and I see a few failed builds around the time your talking about but unlike before the 8th the builder system is not showing unusually high load because of the failed arm builders. All pods are up and running as normal.
The reliability of our builder is of importance to us for obvious reasons. If you have access to payed support, I would recommend opening a ticket there. If not, then please DM me your Dockerfile(s) and docker-compose.yml.
Publicly can you post your fleet URL and enable support access to your entire fleet so we can take a look. Please note that as forms are a little slower at getting a response it would be great if you could enable support access for a week or more.
If you have any concerns with this please let me know, make sure to tag me so that I get notified.
Thanks,
-Thomas
Hey @johnwarila ,
Thanks for DMing me your dockerfiles, before I started digging into them too much I noticed at least in regard to the application the device UUID that starts with 4ee and ends with 6be3 is in. This application seems to have mostly successful builds. Are the failed builds examples of the problem occurring? For example, can I look at commit id: 52f1a2a which is the most recent failed build, as an example of what it looks like when your builds fail randomly? Are all of the builds that failed in the last month examples of this?
I just want to make sure I get all the details before I go diving into the builder logs.
Thanks, I should have more time to look into this further this week.
-Thomas
I am not super confident that the failed builds are the failure mode. If you don’t see clusters of failed builds (that did not fail for reasons like missing dependencies etc.) the failures may not even be touching your build servers. We have definitely still been observing these failures over the last month (at least once per week when doing heavy build trains); usually the failures are clustered by time, where we do not see mixed successes and failures - it is always a cluster of failures and then it starts working again some hours later.
One hunch I have is that the build bundle is not even being uploaded, and we are getting a 408 error before hitting the build servers.
Actually interesting discovery, another member on the team tried with their balena ID to push a build and it worked just fine! We started poking around and discovered that although we can see my account with admin permissions in the balena dashboards, but when someone tries to edit my account that also fails. I can still access all the dashboards and do things related to the fleet and device.
Maybe my account gets messed up for several hours every now and again, and that results in my not being able to do certain things like push a build?
Hey @johnwarila
First off thanks for the investigation from your side. I will keep tracking this down on my side but I needed to ask for help on how to find the logs for the events you are seeing as they don’t even get logged as failed builds.
If this is a partial and intermittent failure of the auth system of the builder then this is certainly something we want to track down completely.
Can you confirm your account is the one I am DMing you?
The next time you see the 408 error can you dump the full logs into this post so I can search for normal log lines around the error text to gather context of what system might be failing to auth you. This issue is complicated because the CLI should and is pre-authenticating your account or you would see a difference error entirely.
I am looking into if the builder uses a token or something from your account to act on your behalf that could be the underlying source of the issue.
@johnwarila
Hey John,
I also wanted to be sure that the application you where experiencing failures on today has a name that starts with: p and ends with e.
Just to confrim that the failures are indeed not getting logged as failed builds.
-Thomas