Unable to complete Getting Started on Intel NUC

Confusingly, this morning when I fired the device up to check some things in the logs, it managed to get itself running the daemon. I did literally nothing to get it to this point, other than turn it on.

I was able to to do a push of the multicontainer demo, and it was up and running as described on the Getting Started page.

I went to work, and when I got back I used a Live USB to use smartctl to check the drive - no signs of problems there. So I’m going to try the Cloud install again, and see where that gets me…

Hmmm.

Take care,

Caligari

Not sure whether this belongs here or in the Cloud section.

I worked at getting the Cloud install working. I did the same thing that worked for me here:

  1. flash with image - daemon does not start, /dev/sda6 problems reported (exactly the same ones as above, as far as I can tell)
  2. ssh in, and run fsck on /dev/sda6. This took a lot of patience as it had to be manual and there were, perhaps a thousand changes - inodes, directories, all sorts of things - that had to be confirmed.
  3. rebooted - still daemon does not start, /dev/sda6 problems reported (again, exactly the same ones as above, as far as I can tell)
  4. rebooted - daemon came up, and finally the device shows up on my dashboard as being alive!

I don’t understand why it requires the two boots after the fsck. Does the fsck actually do anything? Would three reboots work? I don’t know if I ever just rebooted three times, initially.

And what is causing the trouble in the first place? Is it/was it on the disk itself? Is it a hardware flaw of some sort? Is there some incompatability in a driver or something?

I’m going to play around with balena a bit, and I hope that means things just work, a bit. :slight_smile:

But if there is any more information I can provide, just let me know.

Take care,

Caligari

Hi Liam, reading your report above it does seem that the hard drive on that device is damaged. Could you try to install another OS like Ubuntu on it and confirm it works?

I’ve installed Ubuntu server 20.04 on the system. There were no problems with that at all.

It seems to have created a single large partition across the whole 2TB drive.

I’ve run some brief stress-ng tests, and some smartctl tests, and there is no indication from them of any issues with the drive.

Take care,

Caligari

Hey there,
thank you for all the information given. I’d like to ask you to make another test, would it be possible for you to try to run BalenaOS directly from a USB disk ? Just to be clear, you will need to extract the resin-image from the flasher image and burn that into the USB, to do so you can use https://github.com/balena-os/resin-image-flasher-unwrap.

I’m just off to bed, but I will work through that tomorrow.

I’m definitely curious to find out the root cause of these difficulties.

Take care,

Caligari

Thank you again @Caligari for all the information and tests you are doing.
Looking forward to your feedback, I wish you a good day

Ah, it seems that the image flasher unwrap needs to know the image format type, but I can’t find that? vdi/vhd/vmdk?

I’m using the dev build for Intel NUC from https://www.balena.io/os/#download

Not understanding this aspect of image files well, I don’t know if I can just try each format…

Advice welcome!

Update: I have worked out this is asking for the output format. Still not sure what is recommended, but I can just try all three if I have any problems.

I’m not quite there.

I extracted using all three image format types, and now I’m trying to burn a usb, but none of the tools are happy.

I have three files named balena-cloud-intel-nuc-2.50.1+rev1-dev-v11.4.10 vdi/vhd/vmdk

The only format which the tools recognize is the vhd file. But each tool (I’ve tried both etcher and rufus) complain that there is something missing (MBR/partition table) in the image, which means it is unlikely to be bootable. And when I asked etcher to go ahead anyway, it did not boot.

So, I’m afraid I’m not clear on the process here. Advice welcome!

OK, so after some more research, I ended up doing the following:

qemu-img convert balena-cloud-intel-nuc-2.50.1+rev1-dev-v11.4.10.vmdk -O raw balena-test.img
sudo dd if=./balena-test.img of=/dev/sdb

That has given me a bootable usb.

That boots and appears to get to a workable state - “balena-engine version” shows that both client and server are up and running.

Is there anything else that I need to check in order to verify the condition of the system?

Update: I should note that the whole boot was quite quick, relatively speaking - nothing like the 10-20 minutes I’ve had previously when the resize has gone badly.

Update: from “fdisk -l”:

Device     Boot   Start      End  Sectors  Size Id Type
/dev/sdb1  *       8192    90111    81920   40M  e W95 FAT16 (LBA)
/dev/sdb2         90112   729087   638976  312M 83 Linux
/dev/sdb3        729088  1368063   638976  312M 83 Linux
/dev/sdb4       1368064 15633407 14265344  6.8G  f W95 Ext'd (LBA)
/dev/sdb5       1376256  1417215    40960   20M 83 Linux
/dev/sdb6       1425408 15633407 14208000  6.8G 83 Linux

Hi Liam,
Thanks for the effort you are putting to help us debug this.

Starting with the USB boot that you successfully made to work, allow me to give you some background. BalenaOS support two different class of devices, those that can directly boot from a SD card or USB drive, and those that need a flasher image to boot from SD card or USB that will then program the internal storage drive with BalenaOS.

The Intel NUC is released as the latter, with a flasher image that boots up from USB, flashes into internal storage, and then shuts down the device. Once the USB is removed, the device then boots BalenaOS from internal storage.

However, as you know now, it is also able to boot directly from USB using the latest version of BalenaOS. For that we need to extract the raw image from the flasher image. The process should’t be as convoluted as you experienced, and I have pushed a PR to fix it along with an update to the README file of the image extracting project that should make things easier in the future. (See https://github.com/balena-os/resin-image-flasher-unwrap/pull/7)

Anyway, once you have confirmed that the Intel NUC device both boots Ubuntu and BalenaOS from a USB, I would like to go back and debug the original problem you had with the BalenaOS flasher image.

The reason why the install failed were unexpected inconsistencies in the data partition:

Jul 22 10:58:11 localhost resin-partition-mounter[747]: resin-data: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Jul 22 10:58:11 localhost resin-partition-mounter[747]:         (i.e., without -a or -p options)

As I don’t know what are the current contents of the internal drive, could you please post the output from:

ls -al /dev/disk/by-state
ls -al /dev/disk/by-id

When running from the BalenaOS USB drive?

After looking at the logs you posted, the OS could not boot due to inconsistencies on the resin-data partition, so I would like to find out which are the partitions for the internal drive and run fsck manually on all of them to fix the inconsistencies reported above.

So, at the moment on the NUC I have that Ubuntu install on the internal hdd, which should still be functional. And I’ve booted on the balena OS USB drive.

Running the above commands we get:

root@localhost:~# ls -al /dev/disk/by-state
total 0
drwxr-xr-x 2 root root 180 Jul 26 03:01 .
drwxr-xr-x 8 root root 160 Jul 26 03:01 ..
lrwxrwxrwx 1 root root  10 Jul 26 03:02 active -> ../../sdb2     
lrwxrwxrwx 1 root root  10 Jul 26 03:02 inactive -> ../../sdb3   
lrwxrwxrwx 1 root root  10 Jul 26 03:02 resin-boot -> ../../sdb1 
lrwxrwxrwx 1 root root  10 Jul 26 03:02 resin-data -> ../../sdb6 
lrwxrwxrwx 1 root root  10 Jul 26 03:02 resin-rootA -> ../../sdb2
lrwxrwxrwx 1 root root  10 Jul 26 03:02 resin-rootB -> ../../sdb3
lrwxrwxrwx 1 root root  10 Jul 26 03:02 resin-state -> ../../sdb5

root@localhost:~# ls -al /dev/disk/by-id
total 0
drwxr-xr-x 2 root root 300 Jul 26 03:00 .
drwxr-xr-x 8 root root 160 Jul 26 03:01 ..
lrwxrwxrwx 1 root root   9 Jul 26 03:01 ata-ST2000LM015-2E8174_WDZQLQDD -> ../../sda
lrwxrwxrwx 1 root root  10 Jul 26 03:01 ata-ST2000LM015-2E8174_WDZQLQDD-part1 -> ../../sda1
lrwxrwxrwx 1 root root  10 Jul 26 03:01 ata-ST2000LM015-2E8174_WDZQLQDD-part2 -> ../../sda2
lrwxrwxrwx 1 root root   9 Jul 26 03:02 usb-SanDisk_Cruzer_Blade_4C532000040915120043-0:0 -> ../../sdb       
lrwxrwxrwx 1 root root  10 Jul 26 03:02 usb-SanDisk_Cruzer_Blade_4C532000040915120043-0:0-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Jul 26 03:02 usb-SanDisk_Cruzer_Blade_4C532000040915120043-0:0-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  10 Jul 26 03:02 usb-SanDisk_Cruzer_Blade_4C532000040915120043-0:0-part3 -> ../../sdb3
lrwxrwxrwx 1 root root  10 Jul 26 03:02 usb-SanDisk_Cruzer_Blade_4C532000040915120043-0:0-part4 -> ../../sdb4
lrwxrwxrwx 1 root root  10 Jul 26 03:02 usb-SanDisk_Cruzer_Blade_4C532000040915120043-0:0-part5 -> ../../sdb5
lrwxrwxrwx 1 root root  10 Jul 26 03:02 usb-SanDisk_Cruzer_Blade_4C532000040915120043-0:0-part6 -> ../../sdb6
lrwxrwxrwx 1 root root   9 Jul 26 03:01 wwn-0x5000c500cfccd618 -> ../../sda
lrwxrwxrwx 1 root root  10 Jul 26 03:01 wwn-0x5000c500cfccd618-part1 -> ../../sda1
lrwxrwxrwx 1 root root  10 Jul 26 03:01 wwn-0x5000c500cfccd618-part2 -> ../../sda2

Happy to continue from this point however we need. To be clear, I do not need the Ubuntu install, and can remove it at any time. It was useful to get the USB booting, but we seem to be past that now.

Take care,

Caligari

Hi, thanks for coming back to us.
So, /dev/sda is the internal drive and /dev/sdb is the USB drive.
While running from the USB drive, please manually run a fsck on the dev/sda1 and /dev/sda2 partitions and make sure they are clean.

Then re-try the install of the flasher image of BalenaOS as instructed in the Getting Started. Program the image into a USB drive using Etcher, boot the flasher image, wait for the device to shut down, remove the USB drive and boot normally.

Once the disks errors have disappeared I would expect no problems, but if there are please attach the output of journalctl --no-pager so we can take a look.

OK. So I ran those fsck checks, and there were no problems on the /dev/sda1 and /dev/sda2 partitions.

So I grabbed another USB and used etcher to flash balena-cloud-intel-nuc-2.50.1+rev1-dev-v11.4.10.img on to it. I booted that, and waited for it to do its magic and turn off. I removed the usb and turned it back on, and waited for it to expand that partition out.

Once that was complete, I used ssh to login, and checked where we were at:

root@balena:~# balena-engine version
Client:
 Version:           18.09.17-dev
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2ab17e0536b6a4528b33c75e8f350447e9882af0
 Built:             Mon May 11 15:17:45 2020
 OS/Arch:           linux/amd64
 Experimental:      false
Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?

That’s not good. So I ran the journalctl command, and here are the results:
balena_logs_29_07.log (121.0 KB)

They key lines appear to be:

Jul 28 14:17:44 localhost sh[740]: Resizing the filesystem on /dev/disk/by-state/resin-data to 1952801880 (1k) blocks.
Jul 28 14:17:44 localhost sh[740]: The filesystem on /dev/disk/by-state/resin-data is now 1952801880 (1k) blocks long.
Jul 28 14:17:44 localhost kernel: EXT4-fs (sda6): ext4_check_descriptors: Block bitmap for group 131056 not in group (block 35184616189952)!
Jul 28 14:17:44 localhost kernel: EXT4-fs (sda6): group descriptors corrupted!
Jul 28 14:17:44 localhost resin-partition-mounter[758]: mount: /run/tmp.gN6HRW0bOF: mount(2) system call failed: Structure needs cleaning.
Jul 28 14:17:44 localhost resin-partition-mounter[758]: umount: /run/tmp.gN6HRW0bOF: not mounted.
Jul 28 14:17:44 localhost resin-partition-mounter[758]: resin-data: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Jul 28 14:17:44 localhost resin-partition-mounter[758]:         (i.e., without -a or -p options)

Which appears to be exactly the same error as before. At least it is consistent, right?

Let me know what is next. I’m off to bed, though, so it might be a while until I respond again - sorry about that!

Hi. Just a quick question. Throughout this whole process you have been flashing the same image, correct? It still looks like a bad disk (installation does not take that long in normal circumstances), but there might be a change that the image is corrupted in just the right way to cause these issues. Could you try redownloading the image and reflashing from it?

And thanks for the patience.

Note that smartctl may report no problems even with a damaged disk as it relies on the disk itself to report issues. And ubuntu may simply not have hit the (presumably) bad sectors.

What’s your NUC’s model?

Additionally, could you run the following and post the output after opening a terminal to the host OS:

balena run -it --rm --privileged  nixery.dev/shell/smartmontools smartctl -a /dev/sda

Before doing anything else (so with the same setup that I created yesterday, which is not working yet):

root@balena:~# balena run -it --rm --privileged  nixery.dev/shell/smartmontools smartctl -a /dev/sda
balena: Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?.

Also, my NUC model is NUC8i3BEH, and I updated to the newest firmware from Intel - 0081, from 5/4/20.

I’m downloading the image again as I type and will let you know how that goes. Thanks!

Hi, can you also do a test to provision a device with an older release, 2.38.3+rev4 ? When you do this test please flash the image you download from the dashboard and let it do it’s thing, shutdown by itself and then you can power it back on and let us know if that version works ok?

I’m just about out of time, this morning (local time), but I will do this later today.

I have downloaded the image again, and flashed that, and let it do the install. It boots to the same point, without the server running. It seems to be the exactly the same issue.

I’m inclined to run some disk tests from the hdd maufacturer, as well, to see if I can identify flaws with the hardware itself. I’ll let you know if that turns anything up.