Balena Builds Often Fail

johnwarila · November 7, 2022, 11:18pm

Hey all, we use balena to distribute updates to our updates to remote devices. The builds tend to be rather flaky, either some random Go package repository pull has a network error (killing the deploy) or a caching error from balena (ie "failed to restore cached image from “sha256: …” failed to get layer … : layer does not exist). These failures result in having to restart the build, which takes about 20-30 minutes (build time is another issue, local builds are <3 minutes).

Does anyone else have issues with the remote builder’s reliability?

mpous · November 8, 2022, 2:51pm

Hello @johnwarila welcome to the balena community!

Could you please share more details? what hardware are you using? What connectivity does the hardware have? And the OS and supervisor versions?

Looking forward to help you more!

johnwarila · November 8, 2022, 5:33pm

Hey @mpous, our target device is an arm-based Linux machine (iot-gate-imx8). It is connected currently over local wifi. Host OS is balenaOS 2.98.33, supervisor version 14.2.16.

All the builds seem to fail in the cloud build stage, we are not currently building locally as that did not work when we set up our pipelines a year ago. The current cycle looks like

Make code change;
Balena push to the fleet;
Await either successful push and build, or failed in remote build.

johnwarila · November 8, 2022, 6:59pm

In addition to the above: all our builds are now failing with 408 Request Timeout from the remote builder.

johnwarila · November 9, 2022, 7:41pm

Hey @mpous, we are still experiencing frequent 408s, any insight here?

johnwarila · December 6, 2022, 10:34pm

@mpous We have continued to experience periods where remote builds flat out don’t work for hours. Usually, a mix of cache failures and network errors, followed by a long period of 408s. This has been a common occurrence for a couple of months now.

Any word on this?

mpous · December 7, 2022, 9:42am

Hello @johnwarila apologizes for the experience and not giving answer to your previous message.

Let me check internally!

mpous · December 7, 2022, 10:00am

@johnwarila could you please grant support access and send me over DM the ID of your device? Thanks

rosswesleyporter · December 16, 2022, 10:50pm

Hello @johnwarila,

Thank you for reporting this. We replaced the native ARM builders i.e. new hardware. We made that change on Dec 8th. Other customers now report normal build times. If you haven’t already, please try a build.

johnwarila · December 16, 2022, 11:14pm

Hey @rosswesleyporter,

I just tried a build and got the dreaded 408. I was getting them pretty hardcore over the weekend, but had a few days where builds went off without a hitch this week.

tacLog · December 16, 2022, 11:54pm

Hey @johnwarila,
I just took a look at the builder logs and I see a few failed builds around the time your talking about but unlike before the 8th the builder system is not showing unusually high load because of the failed arm builders. All pods are up and running as normal.

The reliability of our builder is of importance to us for obvious reasons. If you have access to payed support, I would recommend opening a ticket there. If not, then please DM me your Dockerfile(s) and docker-compose.yml.

Publicly can you post your fleet URL and enable support access to your entire fleet so we can take a look. Please note that as forms are a little slower at getting a response it would be great if you could enable support access for a week or more.

If you have any concerns with this please let me know, make sure to tag me so that I get notified.
Thanks,
-Thomas

mpous · December 21, 2022, 10:46am

Hello @johnwarila your device is offline! do you think you can turn it online?

Thanks

johnwarila · December 21, 2022, 6:22pm

Yeah, the device was down for physical testing recently. We are reactivating today.

johnwarila · December 28, 2022, 8:23pm

Hey @tacLog, sent you a DM with the info.

tacLog · January 4, 2023, 12:56am

Hey @johnwarila ,
Thanks for DMing me your dockerfiles, before I started digging into them too much I noticed at least in regard to the application the device UUID that starts with 4ee and ends with 6be3 is in. This application seems to have mostly successful builds. Are the failed builds examples of the problem occurring? For example, can I look at commit id: 52f1a2a which is the most recent failed build, as an example of what it looks like when your builds fail randomly? Are all of the builds that failed in the last month examples of this?
I just want to make sure I get all the details before I go diving into the builder logs.

Thanks, I should have more time to look into this further this week.
-Thomas

johnwarila · January 4, 2023, 9:18pm

Hey @tacLog

I am not super confident that the failed builds are the failure mode. If you don’t see clusters of failed builds (that did not fail for reasons like missing dependencies etc.) the failures may not even be touching your build servers. We have definitely still been observing these failures over the last month (at least once per week when doing heavy build trains); usually the failures are clustered by time, where we do not see mixed successes and failures - it is always a cluster of failures and then it starts working again some hours later.

One hunch I have is that the build bundle is not even being uploaded, and we are getting a 408 error before hitting the build servers.

johnwarila · January 5, 2023, 8:19pm

Hey we are encountering the errors on builds right now. Letting you know in case you wanted some timestamps.

johnwarila · January 5, 2023, 8:33pm

Actually interesting discovery, another member on the team tried with their balena ID to push a build and it worked just fine! We started poking around and discovered that although we can see my account with admin permissions in the balena dashboards, but when someone tries to edit my account that also fails. I can still access all the dashboards and do things related to the fleet and device.

Maybe my account gets messed up for several hours every now and again, and that results in my not being able to do certain things like push a build?

tacLog · January 5, 2023, 10:57pm

Hey @johnwarila
First off thanks for the investigation from your side. I will keep tracking this down on my side but I needed to ask for help on how to find the logs for the events you are seeing as they don’t even get logged as failed builds.
If this is a partial and intermittent failure of the auth system of the builder then this is certainly something we want to track down completely.
Can you confirm your account is the one I am DMing you?
The next time you see the 408 error can you dump the full logs into this post so I can search for normal log lines around the error text to gather context of what system might be failing to auth you. This issue is complicated because the CLI should and is pre-authenticating your account or you would see a difference error entirely.
I am looking into if the builder uses a token or something from your account to act on your behalf that could be the underlying source of the issue.

Thanks again for the valuable context.
-Thomas

tacLog · January 5, 2023, 11:13pm

@johnwarila
Hey John,
I also wanted to be sure that the application you where experiencing failures on today has a name that starts with: p and ends with e.
Just to confrim that the failures are indeed not getting logged as failed builds.
-Thomas

Topic		Replies	Views
Many failed builds over the last 24 hours Product support builder	1	239	July 27, 2023
balena push builds failing to pull previous images Product support builder , docker , balena-cli	5	279	December 15, 2023
Very long build time for balena push Product support	5	189	June 16, 2022
Remote Build Failed: Internal Server Error Product support	3	448	August 8, 2021
Cannot deploy for 5 hours, stuck uploading images Product support	3	188	June 16, 2022

Balena Builds Often Fail

Related topics