My Intel NUC Kit NUC6CAYS frequently goes offline / down

janvda · January 14, 2020, 9:02pm

Hi,

I am having an Intel NUC Kit NUC6CAYS
running balenaOS 2.46.0+rev1

It is directly connected to my router via ethernet (as wifi is not working - see this forum post).

The problem is that after some time (from less than an hour up to several hours) it goes offline.
It is reported offline in my balena dashboard but I can also not ping to it.

To bring the device back online I have to power off / power on the device using the power button.

Any idea how I can troubleshoot this ? I have no clue what might be the root cause.

janvda · January 15, 2020, 7:57am

This morning (at 5:44 CET) I faced another instance of this problem (The system was then running for a bit more than 6 hours).

Note that my system not only went offline (pinging from local machine didn’t give a response ) but also all my docker services stopped running. I can confirm this as I have simple node-red service that is writing a timestamp every minute in a file and this writing stopped at 05:44 before it got restarted at 8:23. see here the contents of it:

2020-01-15T05:41:23.735Z
2020-01-15T05:42:23.744Z
2020-01-15T05:43:23.754Z
2020-01-15T08:23:31.645Z
2020-01-15T08:24:31.644Z
2020-01-15T08:25:31.645Z
2020-01-15T08:26:31.660Z
2020-01-15T08:27:31.676Z
2020-01-15T08:28:31.691Z

I am also monitoring the system resources of my NUC using grafana - telegraf - influxdb and there I also see that all monitoring stopped at 5:44.

The only strange thing I see in the charts is that the CPU resources (system and user) are slowly increasing over time which cannot be explained by any of the docker services using more CPU. So it seems that it is the balena OS that is using more CPU resources over time.

nghiant2710 · January 15, 2020, 11:51am

Hi,

As all the logs are rotated after power cycling the device so I think you should enable persistentLogging so logs won’t be wiped and you can send them to us or grant support access so we can have a look into your device to investigate the issue.

janvda · January 15, 2020, 1:13pm

@nghiant2710, Thanks for your response.

I have activated persistentLogging.
So which specific log files do you want me to send to you once the problem reappears ?

I have also granted support access to my device for one week.

Is there something more you need ?
Jan.

richbayliss · January 15, 2020, 1:38pm

So I would say that when the device dies again, after rebooting it grab the /var/log/messages and a dump from dmesg and make sure support access is enabled. Then just drop a reply here with the details and someone should be able to take a look and see what’s up.

janvda · January 15, 2020, 2:30pm

The system went down again at 14:21 CET.
FYI the lines logged by the 1 minute logging node-red service:

2020-01-15T14:16:52.379Z
2020-01-15T14:17:52.385Z
2020-01-15T14:18:52.399Z
2020-01-15T14:19:52.413Z
2020-01-15T14:20:52.427Z
2020-01-15T15:01:52.318Z
2020-01-15T15:02:52.318Z
2020-01-15T15:03:52.333Z
2020-01-15T15:04:52.348Z
2020-01-15T15:05:52.364Z

As told before: I have set persistent logging to true in config.json in folder \mnt\boot but when I check it now it is set to false. (I also don’t see a folder /var/log/journal)

{"persistentLogging":false,"apiEndpoint":"https://a...

I am pretty sure I have set it to true as specified on page https://www.balena.io/docs/reference/OS/overview/2.x/

FYI the contents of /var/log/messages (I have masked the ssh keys as I don’t know I can make them public)

Jan 15 14:01:35 localhost syslog.info syslogd started: BusyBox v1.30.1
Jan 15 14:10:13 localhost auth.err sshd[3897]: error: Could not load host key: /etc/ssh/hostkeys/ssh_host_dsa_key
Jan 15 14:10:15 localhost auth.info sshd[3897]: Accepted publickey for root from 52.4.252.97 port 35862 ssh2: RSA SHA256:gMJ9L............qUQ
Jan 15 14:13:20 localhost auth.err sshd[4321]: error: Could not load host key: /etc/ssh/hostkeys/ssh_host_dsa_key
Jan 15 14:13:22 localhost auth.info sshd[4321]: Accepted publickey for root from 52.4.252.97 port 57058 ssh2: RSA SHA256:gMJ9............JhRqUQ

richbayliss · January 15, 2020, 2:32pm

if persistent logging isnt enable then its unlikely to show anything. Could you enable support access and post the link to the device UUID please, I will see if anything remains.

janvda · January 15, 2020, 2:42pm

@richbayliss thanks for the swift response.

Support access is enabled and I have shared the device UUID with you via a forum message.

Here below the output of dmesg

dmesg.log (50.8 KB)

richbayliss · January 15, 2020, 2:48pm

Sadly, it seems the persistent log wasn’t enabled; so that’s a pain.

Could you enable persistent logging via the device dashboard, in device configuration and it should reboot and the changes be reflected in /mnt/boot/config.json. Then, if it hangs again we should have the logs

janvda · January 15, 2020, 3:03pm

@richbayliss Thanks for the response.
I have enabled it (and checked it also).
I will notify you once the issue reappears.

richbayliss · January 15, 2020, 3:11pm

awesome - let’s hope it captures something

janvda · January 15, 2020, 6:38pm

@richbayliss
The problem happened again at 18:26 CET (this is 17:26 in the log files as they are expressed in UTC !)
This is 26 minutes after I have rebooted the device (I rebooted it at 18:00 CET or 17:00 UTC).

2020-01-15T18:23:58.924Z
2020-01-15T18:24:58.937Z
2020-01-15T18:25:58.941Z
2020-01-15T19:30:55.004Z
2020-01-15T19:31:55.003Z

support access is still enabled !

FYI the log files
dmesg.2020-01-15_1933.log (51.3 KB)
journalctl.2020-01-15_1933.log (862.1 KB)

FYI the following is logged before the reboot in journalctl.2020-01-15_1933.log:

Jan 15 17:19:00 0365728 7e1ec30bd130[787]: [175B blob data]
Jan 15 17:19:03 0365728 balenad[787]: time="2020-01-15T17:19:03.790276245Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/31bf61536b4a427ce4c1a9c43fae7ab38b8032
9409af40c66ea80462a482263f/shim.sock debug=false pid=5223
Jan 15 17:19:04 0365728 balenad[787]: time="2020-01-15T17:19:04.738986629Z" level=info msg="shim reaped" id=31bf61536b4a427ce4c1a9c43fae7ab38b80329409af40c66ea80462a482263f
Jan 15 17:19:04 0365728 balenad[787]: time="2020-01-15T17:19:04.747271262Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 15 17:20:00 0365728 7e1ec30bd130[787]: [175B blob data]
Jan 15 17:20:59 0365728 aa7aaab0f1af[787]: [160B blob data]
Jan 15 17:20:59 0365728 aa7aaab0f1af[787]: [95B blob data]
Jan 15 17:21:00 0365728 7e1ec30bd130[787]: [175B blob data]
Jan 15 17:21:06 0365728 b2bc00404067[787]: [api]     GET /v1/healthy 200 - 3.767 ms
Jan 15 17:21:06 0365728 resin-supervisor[2813]: [api]     GET /v1/healthy 200 - 3.767 ms
Jan 15 17:21:08 0365728 b2bc00404067[787]: [debug]   Attempting container log timestamp flush...
Jan 15 17:21:08 0365728 resin-supervisor[2813]: [debug]   Attempting container log timestamp flush...
Jan 15 17:21:08 0365728 b2bc00404067[787]: [debug]   Container log timestamp flush complete
Jan 15 17:21:08 0365728 resin-supervisor[2813]: [debug]   Container log timestamp flush complete
Jan 15 17:22:00 0365728 7e1ec30bd130[787]: [175B blob data]
Jan 15 17:22:05 0365728 balenad[787]: time="2020-01-15T17:22:05.368874386Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/11800726a02e678ccde5e251aa271b9822a592e078900ac47692e0ed0b665a80/shim.sock debug=false pid=5632
Jan 15 17:22:06 0365728 balenad[787]: time="2020-01-15T17:22:06.320084333Z" level=info msg="shim reaped" id=11800726a02e678ccde5e251aa271b9822a592e078900ac47692e0ed0b665a80
Jan 15 17:22:06 0365728 balenad[787]: time="2020-01-15T17:22:06.328754038Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 15 17:23:00 0365728 7e1ec30bd130[787]: [175B blob data]
Jan 15 17:23:59 0365728 7e1ec30bd130[787]: [257B blob data]
Jan 15 17:23:59 0365728 7e1ec30bd130[787]: [384B blob data]
.... I deleted several blob data logs.
Jan 15 17:23:59 0365728 7e1ec30bd130[787]: [272B blob data]
Jan 15 17:23:59 0365728 7e1ec30bd130[787]: [385B blob data]
Jan 15 17:24:30 0365728 7e1ec30bd130[787]: [1.0K blob data]
Jan 15 17:25:00 0365728 7e1ec30bd130[787]: [175B blob data]
Jan 15 17:25:07 0365728 balenad[787]: time="2020-01-15T17:25:07.102932215Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/597bb61b3f3a2fdc63519858271a520aba43a1
740cb36195104a6112ae65aa9f/shim.sock debug=false pid=5967
Jan 15 17:25:08 0365728 balenad[787]: time="2020-01-15T17:25:08.077695659Z" level=info msg="shim reaped" id=597bb61b3f3a2fdc63519858271a520aba43a1740cb36195104a6112ae65aa9f
Jan 15 17:25:08 0365728 balenad[787]: time="2020-01-15T17:25:08.086342205Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 15 17:25:14 0365728 7e1ec30bd130[787]: [246B blob data]
Jan 15 17:25:14 0365728 7e1ec30bd130[787]: [250B blob data]
.... I deleted several blob data logs.
Jan 15 17:25:59 0365728 aa7aaab0f1af[787]: [107B blob data]
Jan 15 17:26:00 0365728 7e1ec30bd130[787]: [175B blob data]
-- Reboot --
Jan 15 18:30:34 localhost kernel: Linux version 5.2.10-yocto-standard (oe-user@oe-host) (gcc version 8.3.0 (GCC)) #1 SMP PREEMPT Tue Dec 24 06:42:14 UTC 2019
Jan 15 18:30:34 localhost kernel: Command line: BOOT_IMAGE=/boot/bzImage root=LABEL=resin-rootA rootwait quiet loglevel=0 splash udev.log-priority=3 vt.global_cursor_default=0 intel_idle.max_cstate=1
Jan 15 18:30:34 localhost kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

janvda · January 16, 2020, 7:25am

Another instance happened last night, I have restarted the system (it is online now) and support access is still granted.

More precisely the device stopped at 3:33 CET (= 2:33 UTC) and I restarted it at 7:41 CET (= 6:41 UTC)

2020-01-16T03:30:59.189Z
2020-01-16T03:31:59.201Z
2020-01-16T03:32:59.212Z
2020-01-16T07:41:55.780Z
2020-01-16T07:42:55.780Z

These are the messages logged just before the reboot as reported by journalctl:

Jan 16 02:26:37 0365728 resin-supervisor[2766]: [api]     GET /v1/healthy 200 - 2.845 ms
Jan 16 02:27:00 0365728 7e1ec30bd130[769]: [175B blob data]
Jan 16 02:28:00 0365728 7e1ec30bd130[769]: [175B blob data]
Jan 16 02:29:00 0365728 7e1ec30bd130[769]: [175B blob data]
Jan 16 02:29:18 0365728 balenad[769]: time="2020-01-16T02:29:18.825723210Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/c12684f54758de3a21b273fc696e3e3d1104f0
1a04bb0b80ad73fbae65ae89b5/shim.sock debug=false pid=30951
Jan 16 02:29:19 0365728 balenad[769]: time="2020-01-16T02:29:19.796847068Z" level=info msg="shim reaped" id=c12684f54758de3a21b273fc696e3e3d1104f01a04bb0b80ad73fbae65ae89b5
Jan 16 02:29:19 0365728 balenad[769]: time="2020-01-16T02:29:19.805067909Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 16 02:30:00 0365728 7e1ec30bd130[769]: [175B blob data]
Jan 16 02:31:00 0365728 7e1ec30bd130[769]: [175B blob data]
Jan 16 02:31:03 0365728 b2bc00404067[769]: [debug]   Attempting container log timestamp flush...
Jan 16 02:31:03 0365728 resin-supervisor[2766]: [debug]   Attempting container log timestamp flush...
Jan 16 02:31:03 0365728 b2bc00404067[769]: [debug]   Container log timestamp flush complete
Jan 16 02:31:03 0365728 resin-supervisor[2766]: [debug]   Container log timestamp flush complete
Jan 16 02:31:06 0365728 7e1ec30bd130[769]: [229B blob data]
Jan 16 02:31:06 0365728 7e1ec30bd130[769]: [250B blob data]
Jan 16 02:31:37 0365728 b2bc00404067[769]: [api]     GET /v1/healthy 200 - 3.022 ms
Jan 16 02:31:38 0365728 resin-supervisor[2766]: [api]     GET /v1/healthy 200 - 3.022 ms
Jan 16 02:32:00 0365728 7e1ec30bd130[769]: [175B blob data]
Jan 16 02:32:20 0365728 balenad[769]: time="2020-01-16T02:32:20.534425899Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/dd6c646c0b81e5590b989611b370b26bcd619b8c93916d24f7c90048fce4fe31/shim.sock debug=false pid=31559
Jan 16 02:32:21 0365728 balenad[769]: time="2020-01-16T02:32:21.498862924Z" level=info msg="shim reaped" id=dd6c646c0b81e5590b989611b370b26bcd619b8c93916d24f7c90048fce4fe31
Jan 16 02:32:21 0365728 balenad[769]: time="2020-01-16T02:32:21.507316777Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 16 02:33:00 0365728 7e1ec30bd130[769]: [175B blob data]
-- Reboot --
Jan 16 06:41:34 localhost kernel: Linux version 5.2.10-yocto-standard (oe-user@oe-host) (gcc version 8.3.0 (GCC)) #1 SMP PREEMPT Tue Dec 24 06:42:14 UTC 2019
Jan 16 06:41:34 localhost kernel: Command line: BOOT_IMAGE=/boot/bzImage root=LABEL=resin-rootA rootwait quiet loglevel=0 splash udev.log-priority=3 vt.global_cursor_default=0 intel_idle.max_cstate=1
Jan 16 06:41:34 localhost kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

hedss · January 16, 2020, 1:22pm

Hi,

Thanks for the logs. There’s nothing obvious to us there, though we can’t see what the service containers are doing. Would it be possible to give us the unaltered logs (eg. not just a text output). This will allow us to see the entire service output too (though I’m not sure this will help us). Just a quick question to make sure, none of your service containers are attempting to reboot the device (for example, via Dbus)?

Thanks and best regards,

Heds

janvda · January 16, 2020, 4:07pm

@hedss thanks for the response.

My containers are not attempting to reboot the device (at least not that I am aware off).

Note that I have enabled device support. I will share the device UUID through a message.

thundron · January 16, 2020, 7:18pm

Hi, thank you for the extended support access.
We weren’t able to identify a root cause yet and are still searching through the various logs as I’m writing you, in the meantime could you try to include the specific firmware for your NUC in a custom image? I see there’s a PR already up with what you need here to start building a custom image.
You can find more information on the thread you linked, while also having a look at our docs on custom builds here

Debugging this with only 1 working interface is kinda hard, so setting up the wifi would speed this up a lot

janvda · January 16, 2020, 11:03pm

At 2020-01-17 00:03 it went down again. I have restarted the device (power off - power on).

2020-01-16T23:57:00.539Z
2020-01-16T23:58:00.552Z
2020-01-16T23:59:00.565Z
2020-01-17T00:00:00.578Z
2020-01-17T00:01:00.590Z
2020-01-17T00:02:00.604Z
2020-01-17T00:03:00.607Z
2020-01-17T00:09:39.367Z
2020-01-17T00:09:55.247Z
2020-01-17T00:10:55.247Z
2020-01-17T00:11:55.254Z
2020-01-17T00:12:55.269Z

Here an extract of what is logged in journalctl around that time:

Jan 16 23:03:01 0365728 7e1ec30bd130[787]: [175B blob data]
Jan 16 23:03:07 0365728 balenad[787]: time="2020-01-16T23:03:07.871917958Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/5f518a55e64d600e2f4f82f548cf5ad724453d4513419da9f
38a022c309cc4b2/shim.sock debug=false pid=19423
Jan 16 23:03:08 0365728 balenad[787]: time="2020-01-16T23:03:08.818998855Z" level=info msg="shim reaped" id=5f518a55e64d600e2f4f82f548cf5ad724453d4513419da9f38a022c309cc4b2
Jan 16 23:03:08 0365728 balenad[787]: time="2020-01-16T23:03:08.827824917Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 16 23:03:18 0365728 b2bc00404067[787]: [api]     GET /v1/healthy 200 - 8.384 ms
Jan 16 23:03:18 0365728 resin-supervisor[2818]: [api]     GET /v1/healthy 200 - 8.384 ms
-- Reboot --
Jan 16 23:09:18 localhost kernel: Linux version 5.2.10-yocto-standard (oe-user@oe-host) (gcc version 8.3.0 (GCC)) #1 SMP PREEMPT Tue Dec 24 06:42:14 UTC 2019
Jan 16 23:09:18 localhost kernel: Command line: BOOT_IMAGE=/boot/bzImage root=LABEL=resin-rootA rootwait quiet loglevel=0 splash udev.log-priority=3 vt.global_cursor_default=0 intel_idle.max_cstate=1
Jan 16 23:09:18 localhost kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jan 16 23:09:18 localhost kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'

In the mean time I am trying to create a custom build as you proposed.

So I have forked https://github.com/balena-os/balena-intel and integrated the 1 line change of pull request #276 in my fork: see => https://github.com/janvda/balena-intel
I have created and run a crops/poky container on my macbook as outlined in https://github.com/crops/docker-win-mac-docs/wiki
Then I have cloned my balena-intel fork in that poky container and updated the submodule as specified in https://www.balena.io/os/docs/custom-build/ but when running ./balena-yocto-scripts/build/barys -m intel I am getting below error regarding npm not being installed in my poky container.

pokyuser@ae30fa5181cf:/workdir/balena-intel$ ./balena-yocto-scripts/build/barys -m intel
Building JSON manifest...
/workdir/balena-intel/balena-yocto-scripts/build/build-device-type-json.sh: line 20: npm: command not found
/workdir/balena-intel/balena-yocto-scripts/build/build-device-type-json.sh: ERROR - Please make sure the 'npm' package is installed and working before running this script.
[000000000][ERROR]Could not generate .json file(s).
pokyuser@ae30fa5181cf:/workdir/balena-intel$

I tried to install npm but I don’t have permissions to do this in my container.

So I am stuck and I am also not sure if I am doing it right.

karaxuna · January 17, 2020, 11:05am

Hi, can you try solution described here? https://github.com/crops/poky-container/issues/19#issuecomment-343571950

karaxuna · January 17, 2020, 11:06am

This looks even better: https://github.com/crops/poky-container/issues/19#issuecomment-343601187

janvda · January 17, 2020, 4:32pm

Thanks, by running the 2 following commands in a terminal window to install npm and jq in my docker container I was able to run ./balena-yocto-scripts/build/barys -h

docker exec -u root ae30fa5181cf apt-get install npm -y
docker exec -u root ae30fa5181cf apt-get install jq -y

… but when launching the command ./balena-yocto-scripts/build/barys -m genericx86-64
I am getting below error:

[000000001][LOG]BalenaOS build initialized in directory: build.
[000000002][LOG]Run build for genericx86-64: MACHINE=genericx86-64 bitbake resin-image-flasher 
[000000002][LOG]This might take a while ...
ERROR: Unable to start bitbake server (None)
ERROR: Server log for this session (/workdir/balena-intel/build/bitbake-cookerdaemon.log):
--- Starting bitbake server pid 1595 at 2020-01-17 16:39:44.800394 ---
Traceback (most recent call last):
  File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/cookerdata.py", line 282, in parseBaseConfiguration
    bb.event.fire(bb.event.ConfigParsed(), self.data)
  File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/event.py", line 215, in fire
    fire_class_handlers(event, d)
  File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/event.py", line 124, in fire_class_handlers
    execute_handler(name, handler, event, d)
  File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/event.py", line 96, in execute_handler
    ret = handler(event)
  File "/workdir/balena-intel/build/../layers/poky/meta/classes/base.bbclass", line 238, in base_eventhandler
    setup_hosttools_dir(d.getVar('HOSTTOOLS_DIR'), 'HOSTTOOLS', d)
  File "/workdir/balena-intel/build/../layers/poky/meta/classes/base.bbclass", line 142, in setup_hosttools_dir
    bb.fatal("The following required tools (as specified by HOSTTOOLS) appear to be unavailable in PATH, please install them in order to proceed:\n  %s" % " ".join(notfound))
  File "/workdir/balena-intel/layers/poky/bitbake/lib/bb/__init__.py", line 110, in fatal
    raise BBHandledException()
bb.BBHandledException

Topic		Replies	Views
Intel NUC 10 - no wifi with balenaOS balenaOS	13	904	June 17, 2021
Custom BalenaOS NUC build 'find' error balenaOS	11	1029	April 14, 2020
Intel NUC Kit NUC6CAYH Intel Celeron J3455 , Wifi not provisioning Product support	26	5129	January 29, 2020
[FIXED- USE 2.29.2+rev2] Intel NUC WiFi Doesn't Work Product support	19	2858	March 6, 2019
Unable to complete Getting Started on Intel NUC balenaOS	60	2523	August 10, 2020

My Intel NUC Kit NUC6CAYS frequently goes offline / down

Related topics