JN30B Nano full storage: can't bring up containers

Hi all

Since deploying a new release, some devices’ status went from online to offline in the BalenaCloud dashboard without any clear cause. First thing was to exclude the possibility of hardware failure by replacing almost all of the vital components.

After re-flashing the device, the unwanted behavior was resolved and the device worked perfectly for days. When the device went offline again, the serial port showed that the boot loader stages were handled correctly and were the same as any other device working as it should.

Assertion: the device can’t bring up its containers because none of the prints can be found in the logs. Knowing that EMMC storage was getting critical low on some devices, the problem was recreated by writing random things to storage until it was full. We did this on a freshly flashed devices, that was working perfectly fine.

As storage was being filled, the device was still accessible and showed an online status in the BalenaCloud dashboard. As soon as the storage was filled for 100%, a reboot of the device made the unwanted behavior pop up again. The same boot logs were captured as before.

My question is as follows: why is the device unable to bring up its containers when storage is getting filled up and is there any protection against this? Is this something we should check ourselves, or does Balena provide any protection mechanism?

Thanks in advance for looking in to this!

1 Like

Hi @robbeg,

Thanks for reaching out and for the details on your findings. This is expected behavior as the space gets full, the containers fail to start/new release have the similar issues. You can find this in our documentation here: Development Anti-patterns - Balena Documentation

On the protection against such scenarios we do have some recommendations for general cases like logs filling up storage. I have also reached out to our OS team to see if there is any automated measures for the same.

Regards,
N

Hi @nitish

The docs indicate that filled memory could be the issue indeed. Now, the goal is to find out what files are taking up that much space so that the code responsible for it can be rewritten.
To do this, stopping the booting device in the U-Boot stage showed that the logs were not taking up too much space (~ 35 MB) of the data partition. The rest of the files were hard to examine as there are not that much useful commands available in U-Boot.

To be able to use more advanced command line tools, the idea was to make sure the device could fully boot by either one of two options:

(1) Remove some files so that the containers can be brought up again
(2) Modify the U-Boot bootargs so that init=/bin/sh is passed to the kernel command line

Using one of these options, the file system can be inspected and the files responsible for filling up storage can be found. However, some problems arise when actually trying to execute these theories.

(1) As the resin-data partition is in the ext4 file system format, there are no command line tools available in U-Boot to remove files. Is there any way known to you guys to overcome this problem?
(2) When setting the bootargs environment variable to init=/bin/sh or when appending it to the already existing cbootargs variable, no shell becomes available. Has anybody done this before and could provide me with some tips/tricks?

Thanks for looking into this and to the OS team for checking out any automated measures to prevent completely filling up storage.

Hi Robbe, sorry for the delayed reply. I don’t know of any u-boot tooling that will enable access to the ext4 partition, so I think that may now work. But, for theory number 2, you should be able to make use of the extra_os_cmdline u-boot argument, to launch the shell. Additionally, you might even be able to just execut shell from u-boot to reach an initramfs, which might be enough to get you going. Hope that helps!

Hi David

No problem. I don’t exactly know when to test this because prio’s have shifted.
If I do, I’ll let you know the results!