Openvpn: Auth failed on bootstrapping new device of custom build

Hi,

I have made some modifications to Balena OS to support my own board (similar to fincm3) a while back and am now in the process of automating the build / deployment, so my colleagues can also work with it.
I am noticing some issues when trying to bootstrap a new device though.

I have updated/rebased my changes to a newer Balena version and not sure if this is what broke it.
My current build is based on BalenaOS v2.72.0+rev1 and I use balena-cli 12.44.15.

As far as I’m aware, I used the same steps to get my previous builds (based on v2.58+rev3 and v2020.07) up and running, these may have been done with an older balena-cli version.

The steps I’m going through are:

  1. Build image with: docker run --rm -it --privileged -v <repo-path>:/home/build aggurio/docker-balena ./balena-yocto-scripts/build/barys -d --machine mb1929 --rm-work;
  2. Configure image with: balena os configure <image> --app <balena application> --type fincm3 --debug --version 2.72.0 --config-network ethernet;
  3. Flash image with:
    3.a. dd if=<image> of=/dev/sdb bs=4M conv=fsync or
    3.b. balena os initialize <image> --type fincm3 --drive /dev/sdb or
    3.c. using Etcher.

Next when I boot up my device, it will first complain about the certs not being valid yet, but this gets resolved automatically by NTP kicking in after a while.

However, what completely stops my bootstrap (I think) is openvpn returning an AUTH_FAILED control message; there is no further information with this message. The device will just keep retrying it in a little bit and getting the same error again.

Looking at the config.json, the failing device does have some deviceApiKey and UUID configured.
The userId, mixpanelToken and applicationId are the same as on a working image.

I have tried grabbing an image from the Balena cloud dashboard (fincm3, v2.72.0, development) and flash that using Etcher.
This shows the same error at first, but will then actually move on, reboot and properly connect.
After showing a 503 from the API, it will be getting the proper application version and reboot again after which it’s ready to use.

I will check if a fresh build of an older version still works, but building this will take a while (as will rebasing to a newer Balena version).

In the meantime I will gladly take any suggestions on how to get this working (again).

Edit: added some logging: bootstrap.log (14.7 KB)
It will keep running in this loop.

Edit 2: This works fine with a fresh v2020.07 build.

Edit 3: full log of bootstrap attempt bootstrap_full.log (132.4 KB)

Edit 4: Rebasing my changes to v2.73.1+rev1 doesn’t help, following the same procedure with the actual v2.73.1+rev1 version does work.
I will see if I can narrow this down tomorrow by incrementally adding the changes.

Hey,
I think the VPN error is a red herring. The more interesting log line is this:

Sep 06 12:14:28 09e71b1 resin-supervisor[2254]: [e[34minfoe[39m]    New device detected. Provisioning...
Sep 06 12:13:58 09e71b1 resin-supervisor[2254]: [e[36mevente[39m]   Event: Device bootstrap failed, retrying {"delay":30000,"error":{"message":""}}

This isn’t exactly helpful, but without a proper provisioning the VPN won’t have credentials to connect, so that’s expected.

Hi,

I’ve been trying out my changes (still on v2.73.1+rev1), and it seems to break as soon as I try to define my own device type and layer (no recipes yet).

I’ve tried debugging the bitbake command run by adding env MACHINE=$machine bitbake -e ${IMAGE} >> ${LOGFILE} before running the actual command.
I’ve noticed that this did not show the value fincm3 in OVERRIDES.

I suspect the issue is meta-balena-raspberrypi/conf/machine/fincm3.conf completely overwriting the MACHINEOVERRIDES that I have set resulting in an issue with the bootstrapping.

I’m not quite sure why this wouldn’t break the older version though.

For reference, my machine config is now:

#@TYPE: Machine
#@NAME: MB1929 (CM3)
#@DESCRIPTION: Machine configuration for DynaLynx Mainboard

MACHINEOVERRIDES =. "fincm3:${MACHINE}:"
include conf/machine/fincm3.conf

Hi,

I think I may found my problem, or at the very least a workaround.

In meta-balena/meta-balena-common/classes/image-balena.bbclass the supervisor is fed the slug from my device configuration, which is set to mb1929.
I think it fails to provision properly because Balena doesn’t know this type.

I’ve had a look at the configuration in /resin-boot/config.json.
A freshly flashed SD card has its deviceType set to fincm3.
I noticed that after failing to provision the deviceType was set to mb1929.
So it seems like the config.json gets overwritten?

If I change the slug (and aliases) in the generated /resin-boot/device-type.json file to fincm3 before booting up, it will properly provision my device.

Maybe the mismatch between config.json and device-type.json is relevant?

I’ve propagated these changes to my coffee file and now it works as expected.
I’m still confused why the custom slug was not an issue in the older version.

Thanks for the steps to reproduce this issue. We will try them as well and see what the actual error message is so we can provide the exact issue. We’ve also documented the lack of logging for this error in https://github.com/balena-os/balena-supervisor/issues/1785 so we can make it easier to debug in the future.

We’ll get back to you after we perform some tests.

1 Like

Hi @tjvv,

I’ve performed some tests on the device and made an update to the GitHub issue. You can check it out here: https://github.com/balena-os/balena-supervisor/issues/1785

In summary, I determined that one possible cause of the error logs of Device bootstrap failed, retrying {"delay":30000,"error":{"message":""}} is a request timeout to the device provisioning servers. This error message is unrelated to an incorrect device type reporting, since all cases of incorrect or unspecified device type will log a more verbose error message.

The strange thing is, the default apiTimeout is 15 minutes (900000 ms) as of Supervisor v12.10.x which is quite long, so without the device in question, it might be difficult to pinpoint a definite cause. Have you tried this operation again? Could you paste the complete journal logs on a device that’s refusing to provision, without editing either config.json or device-type.json? This might help us debug more accurately. Thanks!

Regards,
Christina

Hello @cywang117,

Thank you for looking into this.
The device does not wait the full 15 minutes before giving the error, so I kinda doubt it’s that timeout.
For completeness sake: the device is connected via ethernet and doesn’t show any connectivity issues when provisioned.
I will rebuild my image and see if I can get some logs later today.

Edit: I’ve just run a test run with my latest build, which is based on v2.83.21+rev1.
This version does show the “Unknown device type” error.
I’m guessing v2.73.1+rev1 just didn’t translate the error correctly?
Here’s the log, with the config.json and device-type.json at the end.

boot.log (183.1 KB)

Hi @tjvv,

Thanks for the follow up & boot logs. I looked through them and the error message is indeed different than before, which makes replicating the vague error message of "error":{"message":""} from before more puzzling… Did you do anything different this time compared to last, aside from editing or not editing device-type.json?

Supervisors >= v12.0.9 do read the device type field from device-type.json, which explains why you were able to provision your custom board successfully before. If you edit device-type.json like last time, does the device provision successfully? Also, I was assuming that you were running a Supervisor version newer than 12.0.9 on your device when you encountered the "error":{"message":""} message, but can you confirm your version just in case that’s a factor? Thanks!

Regards,
Christina

To clarify, I’m asking about editing device-type.json on your latest device that’s erroring with “Unknown device type” so we can at least see if the behavior is consistent.

I’ll take another look at the provided logs and get back to you.

Hi,

I have not changed anything in my build/provision process between the last few tests.
The only difference was the version of the balena-raspberrypi repo (and submodules), that I used.

I can confirm that changing slug type in device-type.json still makes the difference in my v2.83.21+rev1 image.
The way I’ve tested this:

  1. Create image with the slug type set to “mb1929” in the coffee script;
  2. Flash SD;
  3. Boot device, check logs: shows “Unknown device type” errors;
  4. Edit /mnt/boot/device-type.json, changing slug to “fincm3”;
  5. Reboot device, check logs: shows proper provisioning;
  6. Check dashboard, see newly provisioned device.

Checking logs is done via UART, running journalctl -fam.

I have not specified a custom supervisor tag, so the versions mappings should be:
image v2.58.3+rev3 (warrior) = supervisor v11.14.0 => provisions fine;
image v2.73.1+rev1 (dunfell) = supervisor v12.4.6 => provisions after changing slug;
image v2.83.21+rev1 (dunfell) = supervisor v12.10.3 => provisions after changing slug.

I am currently (re-)building the v2.73.1+rev1 image, to double check if the error is still empty, but this will take a while.

It took a while to double check the v2.73.1+rev1 build due to some errors with moby (also triggered on fincm3 builds).
First my builds still seemed to compile, but running them gave me errors about mounts like /resin-data and suggested something was wrong with aufs.
This effectively trapped me before the device would even try to provision.

Thinking maybe something changed in balena-cli, I tried a couple of different (older and newer) versions of that, with the same result.

Running a completely clean build gave me errors about failing to obtain the go package providing aufs.
I finally replaced the (clean) go directory with the one in my v2.83.21+rev1 build and that helped to properly finish building my image.

Long story short, my v2.73.1+rev1 image, still gives the empty error and Supervisor is indeed version v12.4.6.

Hi @tjvv,

We saw another instance of this cryptic error message recently. It’s indeed not related to API requests timing out, but rather the api.balena-cloud.com DNS not resolving as expected in container. Here is an issue which goes into more detail: https://github.com/balena-os/balena-supervisor/issues/1787

In short, some ISPs do not support IPv6. The Supervisor (and any user containers, for that matter) resolve to IPv6 addresses. We could add a block of code in the Supervisor to also resolve to v4 as this issue states, but then user containers would also run into this issue. Internally, we’ve discussed this, and an ideal path forward is for this part of networking to be handled in the OS. You can test this hypothesis by running curl https://api.balena-cloud.com in both the Supervisor container and the host OS. The host OS will resolve as it also checks IPv4, but the Supervisor container will not resolve. Please let us know however if this is not the case for you when you test, as we will need to investigate more.

I was previously assuming this was a race condition as it only appears sporadically. I’m relieved to hear it’s not a race condition, but rather a networking issue.

Regards,
Christina

Sorry, I meant curl https://api.balena-cloud.com/ping

To clarify: The host OS will resolve for both v4 and v6. The Supervisor container resolves only to v6, which breaks due to lack of network support. Can you confirm that your networking is limited in this case?

HI @cywang117,

I can confirm that the resolving seems to work fine from the Supervisor container.
I’ve tested this both with a properly provisioned v2.83.21+rev1 image and with a newly preloaded and flashed, failing to provision v2.73.1+rev1 image.
For what it’s worth, our network should be limited to IPv4.

Either way, I fail to see how this would correspond to a difference in the device-type configuration.

I’m guessing we are talking about 2 different issues at the same time.
First issue is the device not provisioning properly with a custom device-type, which seems to give a better error message in the newer Balena and/or Supervisor version.

Second issue is the error message of the VPN failing, which is probably due to the device not being provisioned properly in the first place.

@tjvv, Agreed, I was talking about another more networking centric issue in my last post. I suppose we can ignore the VPN failing for now, as it may be due to improper device provision as you say.

I missed replying to one point in your message from Sept 15 (sorry for delay!). You wrote that Supervisor v11.14.0 provisions fine but v12.4.6+ only provisions after changing the slug via editing device-type.json. I took a look at the Supervisor’s changelogs and it appears that the source of truth for device slug has been moved to device-type.json with v12.0.9+.

See:

This may explain the difference you’ve seen better than I have thus far!

If you do find a way to reliably reproduce the empty "error":{"message":""} logs on provision failure, I’d also be curious to hear what it is. Thank you!