I am having an Intel NUC Kit NUC6CAYS
running balenaOS 2.46.0+rev1
It is directly connected to my router via ethernet (as wifi is not working - see this forum post).
The problem is that after some time (from less than an hour up to several hours) it goes offline.
It is reported offline in my balena dashboard but I can also not ping to it.
To bring the device back online I have to power off / power on the device using the power button.
Any idea how I can troubleshoot this ? I have no clue what might be the root cause.
This morning (at 5:44 CET) I faced another instance of this problem (The system was then running for a bit more than 6 hours).
Note that my system not only went offline (pinging from local machine didn’t give a response ) but also all my docker services stopped running. I can confirm this as I have simple node-red service that is writing a timestamp every minute in a file and this writing stopped at 05:44 before it got restarted at 8:23. see here the contents of it:
I am also monitoring the system resources of my NUC using grafana - telegraf - influxdb and there I also see that all monitoring stopped at 5:44.
The only strange thing I see in the charts is that the CPU resources (system and user) are slowly increasing over time which cannot be explained by any of the docker services using more CPU. So it seems that it is the balena OS that is using more CPU resources over time.
As all the logs are rotated after power cycling the device so I think you should enable persistentLogging so logs won’t be wiped and you can send them to us or grant support access so we can have a look into your device to investigate the issue.
So I would say that when the device dies again, after rebooting it grab the /var/log/messages and a dump from dmesg and make sure support access is enabled. Then just drop a reply here with the details and someone should be able to take a look and see what’s up.
As told before: I have set persistent logging to true in config.json in folder \mnt\boot but when I check it now it is set to false. (I also don’t see a folder /var/log/journal)
FYI the contents of /var/log/messages (I have masked the ssh keys as I don’t know I can make them public)
Jan 15 14:01:35 localhost syslog.info syslogd started: BusyBox v1.30.1
Jan 15 14:10:13 localhost auth.err sshd[3897]: error: Could not load host key: /etc/ssh/hostkeys/ssh_host_dsa_key
Jan 15 14:10:15 localhost auth.info sshd[3897]: Accepted publickey for root from 52.4.252.97 port 35862 ssh2: RSA SHA256:gMJ9L............qUQ
Jan 15 14:13:20 localhost auth.err sshd[4321]: error: Could not load host key: /etc/ssh/hostkeys/ssh_host_dsa_key
Jan 15 14:13:22 localhost auth.info sshd[4321]: Accepted publickey for root from 52.4.252.97 port 57058 ssh2: RSA SHA256:gMJ9............JhRqUQ
if persistent logging isnt enable then its unlikely to show anything. Could you enable support access and post the link to the device UUID please, I will see if anything remains.
Sadly, it seems the persistent log wasn’t enabled; so that’s a pain.
Could you enable persistent logging via the device dashboard, in device configuration and it should reboot and the changes be reflected in /mnt/boot/config.json. Then, if it hangs again we should have the logs
@richbayliss
The problem happened again at 18:26 CET (this is 17:26 in the log files as they are expressed in UTC !)
This is 26 minutes after I have rebooted the device (I rebooted it at 18:00 CET or 17:00 UTC).
Thanks for the logs. There’s nothing obvious to us there, though we can’t see what the service containers are doing. Would it be possible to give us the unaltered logs (eg. not just a text output). This will allow us to see the entire service output too (though I’m not sure this will help us). Just a quick question to make sure, none of your service containers are attempting to reboot the device (for example, via Dbus)?
Hi, thank you for the extended support access.
We weren’t able to identify a root cause yet and are still searching through the various logs as I’m writing you, in the meantime could you try to include the specific firmware for your NUC in a custom image? I see there’s a PR already up with what you need here to start building a custom image.
You can find more information on the thread you linked, while also having a look at our docs on custom builds here
Debugging this with only 1 working interface is kinda hard, so setting up the wifi would speed this up a lot
Then I have cloned my balena-intel fork in that poky container and updated the submodule as specified in https://www.balena.io/os/docs/custom-build/ but when running ./balena-yocto-scripts/build/barys -m intel I am getting below error regarding npm not being installed in my poky container.
pokyuser@ae30fa5181cf:/workdir/balena-intel$ ./balena-yocto-scripts/build/barys -m intel
Building JSON manifest...
/workdir/balena-intel/balena-yocto-scripts/build/build-device-type-json.sh: line 20: npm: command not found
/workdir/balena-intel/balena-yocto-scripts/build/build-device-type-json.sh: ERROR - Please make sure the 'npm' package is installed and working before running this script.
[000000000][ERROR]Could not generate .json file(s).
pokyuser@ae30fa5181cf:/workdir/balena-intel$
I tried to install npm but I don’t have permissions to do this in my container.
So I am stuck and I am also not sure if I am doing it right.
Thanks, by running the 2 following commands in a terminal window to install npm and jq in my docker container I was able to run ./balena-yocto-scripts/build/barys -h
… but when launching the command ./balena-yocto-scripts/build/barys -m genericx86-64
I am getting below error:
[000000001][LOG]BalenaOS build initialized in directory: build.
[000000002][LOG]Run build for genericx86-64: MACHINE=genericx86-64 bitbake resin-image-flasher
[000000002][LOG]This might take a while ...
ERROR: Unable to start bitbake server (None)
ERROR: Server log for this session (/workdir/balena-intel/build/bitbake-cookerdaemon.log):
--- Starting bitbake server pid 1595 at 2020-01-17 16:39:44.800394 ---
Traceback (most recent call last):
File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/cookerdata.py", line 282, in parseBaseConfiguration
bb.event.fire(bb.event.ConfigParsed(), self.data)
File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/event.py", line 215, in fire
fire_class_handlers(event, d)
File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/event.py", line 124, in fire_class_handlers
execute_handler(name, handler, event, d)
File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/event.py", line 96, in execute_handler
ret = handler(event)
File "/workdir/balena-intel/build/../layers/poky/meta/classes/base.bbclass", line 238, in base_eventhandler
setup_hosttools_dir(d.getVar('HOSTTOOLS_DIR'), 'HOSTTOOLS', d)
File "/workdir/balena-intel/build/../layers/poky/meta/classes/base.bbclass", line 142, in setup_hosttools_dir
bb.fatal("The following required tools (as specified by HOSTTOOLS) appear to be unavailable in PATH, please install them in order to proceed:\n %s" % " ".join(notfound))
File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/__init__.py", line 110, in fatal
raise BBHandledException()
bb.BBHandledException