When Secure Boot enabled, OS is stuck in an infinite boot loop

Hey Guys

We wanted to test the Secure Boot feature on Balena OS (v3.0.15) using Maxtang EHL-35 motherboard with AMI BIOS v2.22.1282. It has an Intel J6412 CPU with built-in TPM 2.0 chip (with firmware version 600.15).

  • We reset the BIOS and entered into Secure Boot setup mode
  • USB drive inserted, booted, in the cloud dashboard we wait a minute or two for system to copy all files to the SSD drive
  • Installer correctly shuts down the system (all LEDs are off)
  • We restarted the machine, set boot device to SSD UEFI and it is stuck in a “Post Provisioning state”

It keeps rebooting after the “Welcome to GRUB” text. Kinda looks like, Secure Boot feature is working but it might have some problem mounting the LUKS root partition. If we enable Secure Boot in the BIOS, the boot process successfully gets to GRUB, so probably signatures are okay, because we tried resetting the keys in the BIOS and it correctly threw and incorrect signature error upon booting.

We followed this guide:

Here are things we have tried:

  • Without Secure Boot (–secureBoot), OS image works perfectly
  • We tried it with Prod and Dev images as well
  • We tried the first boot in the BIOS with Secure Boot enabled and disabled
  • In the BIOS the boot order is clean, so all boot order options are disabled except for the first one which is set to USB UEFI, and after the shutdown we set it SSD UEFI.

Interesting thing we noticed: On the first boot the installer creates a device in the fleet, something happens, installer reboots and restart the installer and creates another device (the one that actually will be installed). It is all by itself. Then system shuts down for first boot. Only development image does this. Production image only creates one device only.

Is there any way to get more verbose error messages to help further the investigation?

Thank you.

UPDATE: I tried everything with v3.1.3 OS version, with the same results.

UPDATE2: Here are a few screenshots from the bios.

Hi Peter,

Thanks for all of the detail. I have installed balenaOS with Secure Boot & Full Disk Encryption (SB & FDE) on many different types of x86 hardware. Most of the challenges we have seen so far have been with finding the right BIOS settings for provisioning, and these do vary from one type of device to another.

But in this case, I agree with your synopsis that the provisioning seems to have gone well, except for the creation of the extra zombie device. I don’t have a suggestion for you yet. But we will look at it further and keep you posted.

BTW, smart move to try resetting the keys in the BIOS and finding that it correctly threw the incorrect signature error. That’s informative, and rules out some potential causes.

thanks for the reply @rosswesleyporter

is there any way to get some debug messages from the OS? is there any way to at least figure out what goes wrong?

Hi,

thanks for your interest in the secure boot feature. You seem to be doing everything correctly and your thought about failing to unlock LUKS also seems correct. Though uncommon, we have seen a similar behavior on devices where the PCR register values change between reboots even though nothing else has changed on the device. We can quite easily test whether this is the case with your device as well - the procedure is to provision an unencrypted/no-secure-boot OS on the device and reboot it a few times watching whether the PCR values change inbetween.

If you want us to perform the test, all we need from you is to provision the device and share support access. If you prefer running the test yourself, I can put together a testing application that you can deploy to the device. Let us know which of the two you prefer.

Hi,

Is there a workaround for setting up secure boot if the device does change PCR values on reboot? I’m looking to set up a new PC, but haven’t bought the device yet. It’s likely to be a 12th or 13th gen i7 processor. I just want to confirm I’ll be able to setup secure boot if the device turns out to have this issue.

Or better still, is there any way to identify from their specs what devices might have this issue?

Thanks

Hi,

unfortunately there is no simple way to guess this from just the specs. The good news is that we have been able to make this work with most of the 12th and 13th gen hardware available to us (with the only exception being a device that came with Secure Boot enabled and providing no option to enter Setup Mode or replace the default keys). Some devices do need additional steps depending on what gets measured into PCR1 (we have seen e.g. dynamic memory voltage or CPU temperature at boot) but these are mostly set-once-and-forget-forever BIOS settings.

We realize this is not ideal and are alreary working on moving away from PCR1 as it is quite unreliable, if you are interested you can follow the related GitHub PR: Seal LUKS passphrase with PCR7 by jakogut · Pull Request #3259 · balena-os/meta-balena · GitHub but this is still a work in progress and we have not ETA for this to be finished.

In the meantime, could you maybe elaborate about your use-case and why you need to have Secure Boot enabled?

Thanks, that’s very promising.

The use case is a remote deployment of a PC to act as an edge server. Because it’s remote I’m keen to ensure any proprietary data is protected.

Thanks again

1 Like

Hi,

Just some feedback on the current PCR1 based secure boot option for anyone following this thread. I set up a few NUC11 devices with secure boot a few months ago. That went well at the time, but starting them up this morning, all of them went into endless reboots.

I can see there’s been good progress on a PCR7 based solution and look forward to trying it when it’s released.

Thanks

Hi,

I am using secure boot BalenaOS on NUC13 devices and I am also experiencing random bootloop.

I’ll try to provide some hints to help tackle this problem once and for all.

What I have observed, on “old” image (with PCR1):

  • Some NUCs never bootloop
  • Some NUCs some times bootloop, not for infinite time thought, but it can takes up to 20 minutes to boot.

I think it was due to PCR1 fragility on NUC13.

My concern is that after a reflashing of a recent BalenaOS installation (5.24) with PCR7, I am still experiencing bootloop on some NUCs, but it is more reproducible. It happen mainly when a network usb device is connected. After a reflash without secureboot, we can observe with tpm2_pcrread that PCR1 and PCR10 do change upon plugging network device at boot, but these are normally not read in policy.

Another hint is that with the new image on an empty NUC, I do not experience that. It may be random or may it be related a previous state on hard drive or TPM badly cleared upon reflashing ?

This bootloop problem is quite concerning for my company. We are selling autonomous robots and secure boot adds a safety for our clients that the robot cannot be hacked with malicious code, and for us to avoid code leaking. But the fact that we have to flip a coin every time we boot is not very sustainable :sweat_smile:.

A curious thing is that plugging a screen on HDMI port of the NUC almost always fixes the bootloop (the PCR1 changes back to the correct one)