Multiple Health Check Errors / Root Cause of Container Failure on NUC?

I have a multi-container system operating on a NUC, and am stress-testing it. When the containers fail, I see multiple errors in Health Check. I’m having difficulty sorting out what the actual root problem is.

Sometimes I see low memory, however I often get the same symptoms/container failure with plenty of memory.

I’ve read through the tutorials provided by Balena, but I think my Linux background is too sparse to help me resolve this. Can anyone point me in the right direction for diagnosing & resolving the root problem(s)?

Below is Health Check and attached Diagnostics run.

Thanks for any advice and please advise if this is incorrect use of this forum.
Sandy

Health Check
{“diagnose_version”:“4.21.3”,“checks”:[{“name”:“check_balenaOS”,“success”:true,“status”:“Supported balenaOS 2.x detected”},{“name”:“check_container_engine”,“success”:false,“status”:“Some container_engine issues detected: \ntest_container_engine_running_now Container engine balena is NOT running\ntest_container_engine_restarts Container engine balena has 2894 restarts and may be crashlooping (most recent start time: Mon 2021-10-18 15:26:22 UTC)\ntest_container_engine_responding Error querying container engine: Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?”},{“name”:“check_localdisk”,“success”:false,“status”:“Some localdisk issues detected: \ntest_data_partition_mounted Data partition not mounted read-write”},{“name”:“check_memory”,“success”:false,“status”:“Low memory: 2% (96MB) available, 3680MB/3776MB used”},{“name”:“check_networking”,“success”:true,“status”:“No networking issues detected”},{“name”:“check_os_rollback”,“success”:true,“status”:“No OS rollbacks detected”},{“name”:“check_service_restarts”,“success”:true,“status”:“No services are restarting unexpectedly”},{“name”:“check_supervisor”,“success”:false,“status”:“Supervisor is NOT running”},{“name”:“check_temperature”,“success”:true,“status”:“No temperature issues detected”},{“name”:“check_timesync”,“success”:true,“status”:“Time is synchronized”}]}

68e5f8598c35b231be3bdf337eb1916d_diagnostics_2021.10.18_15.30.52+0000.txt (975.3 KB)

Hi,

From the uploaded diagnostics file I see that the main issue is that balenaEngine isn’t reachable. This is verified by the checks JSON you pasted:

{
         "name":"check_container_engine",
         "success":false,
         "status":"Some container_engine issues detected: \ntest_container_engine_running_now Container engine balena is NOT running\ntest_container_engine_restarts Container engine balena has 2894 restarts and may be crashlooping (most recent start time: Mon 2021-10-18 15:26:22 UTC)\ntest_container_engine_responding Error querying container engine: Cannot connect to the balenaEngine daemon at unix:///var/run/balena-engine.sock. Is the balenaEngine daemon running?"
      },
      {
         "name":"check_localdisk",
         "success":false,
         "status":"Some localdisk issues detected: \ntest_data_partition_mounted Data partition not mounted read-write"
      },
      {
         "name":"check_memory",
         "success":false,
         "status":"Low memory: 2% (96MB) available, 3680MB/3776MB used"
      },
      {
         "name":"check_supervisor",
         "success":false,
         "status":"Supervisor is NOT running"
      },

Here are the check descriptions for reference: Check Descriptions - Balena Documentation

check_supervisor is related to check_container_engine, as the Supervisor cannot run with an unhealthy engine. The engine on this device appears to be crash looping, having restarted 2894 times. One possible root cause could be this crash loop which can occur in low-memory situation, however there may be other factors at play. check_memory and check_localdisk failing really depend on how you’re stress testing the system, for example. You mentioned that you’re also seeing similar symptoms with plenty of device memory remaining. Are you also seeing the engine crash loop when there’s plenty of memory?

Also, what are you using to stress test your NUC? Thanks!

Regards,
Christina

Christina,
Thanks for the speedy and informative response. A large part of the stress testing is allocating and freeing most of the NUC’s available memory, and indeed we uncovered a memory management problem in how we configured redis.

A few days ago, I did also see the same Health Check errors without the low memory error, but did not save the Diagnostic File or Health Check report, so I see this as anecdotal until I can reproduce and save the results for viewing.

And, frankly, I think it makes the most sense to fix the obvious memory management problem. Then we can see if there are other lingering problems.

Thanks again,
Sandy

Hi,

Thanks for letting us know. Please let us know if you run into any other issues.

Regards

Hi, Christina,
Per your previous recommendation, I was able to deploy memory management improvements. However I am still seeing failure with multiple errors, fortunately no longer memory errors. Can you review these most recent Health Check and Diagnostic files to recommend where I should next focus my debugging efforts?

The test scenario is a 4 GB NUC is engaging in XBee communication with 5 sensors at 1-minute intervals, receiving sensor data via up to 22KB packets, and then uploading the data to the cloud after some minimal processing.

Thanks for your time and any advice you can provide.
Sandy

68e5f8598c35b231be3bdf337eb1916d_diagnostics_2021.10.25_22.33.03+0000.txt (887 KB)

(Attachment 68e5f8598c35b231be3bdf337eb1916d_checks_2021.10.26_20.50.10+0000.json is missing)

Hi, Christina,
Per your previous recommendation, I was able to deploy memory management improvements. However I am still seeing failure with multiple errors, fortunately no longer memory errors. Can you please review the provided most recent Health Check (below) and Diagnostic file (attached) to recommend where I should next focus my debugging efforts?

The test scenario is a 4 GB NUC is engaging in XBee communication with 5 sensors at 1-minute intervals, receiving sensor data via up to 22KB packets, and then uploading the data to the cloud after some minimal processing.

Thanks for your time and any advice you can provide.

Sandy

{
   "diagnose_version":"4.21.3",
   "checks":[
      {
         "name":"check_balenaOS",
         "success":true,
         "status":"Supported balenaOS 2.x detected"
      },
      {
         "name":"check_container_engine",
         "success":false,
         "status":"Some container_engine issues detected: \ntest_container_engine_running_now Container engine balena is NOT running\ntest_container_engine_restarts Container engine balena has 86 restarts and may be crashlooping (most recent start time: Tue 2021-10-26 20:48:07 UTC)\ntest_container_engine_responding Error querying container engine: "
      },
      {
         "name":"check_localdisk",
         "success":false,
         "status":"Some localdisk issues detected: \ntest_data_partition_mounted Data partition not mounted read-write"
      },
      {
         "name":"check_memory",
         "success":true,
         "status":"56% memory available"
      },
      {
         "name":"check_networking",
         "success":false,
         "status":"Some networking issues detected: \ntest_balena_registry: Could not communicate with [registry2.balena-cloud.com](http://registry2.balena-cloud.com) for authentication"
      },
      {
         "name":"check_os_rollback",
         "success":true,
         "status":"No OS rollbacks detected"
      },
      {
         "name":"check_service_restarts",
         "success":true,
         "status":"No services are restarting unexpectedly"
      },
      {
         "name":"check_supervisor",
         "success":false,
         "status":"Supervisor is NOT running"
      },
      {
         "name":"check_temperature",
         "success":true,
         "status":"No temperature issues detected"
      },
      {
         "name":"check_timesync",
         "success":false,
         "status":"Time is not being synchronized via NTP"
      }
   ]
}

68e5f8598c35b231be3bdf337eb1916d_diagnostics_2021.10.25_22.33.03+0000.txt (887 KB)

Hi @smoore,

Thanks for pasting more info. By 4GB, you mean memory right?

Which version of balenaOS are you running? Maybe we can get a look at the journal logs on the device since you are testing this locally (I presume)? Is there anything relevant when you run journalctl -u balena -xef? I noticed in this latest diagnostics JSON that the engine is crash looping… we may be able to get the reason. If the journal logs say that the engine is getting killed with “timeout” being the reason, could you also paste the output of dmesg? There could be a kernel oops or something – I saw this recently.

Thanks,
Christina

Hi, Christina,
Thanks for checking in. Here are some answers…

  1. 4GB refers to RAM.
  2. balenaOS 2.83.18+rev1
  3. I did not run journalctl, but will be happy to do so next crash
  4. I did not run dmesg, but will be happy to do so next crash

The NUC is running at a remote field site. I access it through the balena dashboard. When it locks up, I am able to revive it with a reboot command from the Host OS Terminal.

Great, thank you! Please paste the logs here when you get access to them.

Christina,
So, weird turn of events…Recent lockups are back to complaining about “out of memory” among other health check errors. When I investigated with ps -eo pid,rss,vsz,cmd --sort -vsz | more, I see that all the memory is taken up by a gazillion processes for:

  • balena info
  • /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 12236
  • bin/sh /usr/lib/balena/balena-healthcheck

My services are still using reasonable amounts of memory. Am I correct that perhaps there is some original error – not sure what it was – that then spawns (maybe due to crashlooping?) these balena processes which proceed to gobble up all the memory?

(Attachment 20211102ps.rtf is missing)

(Attachment 20211102dmesg.rtf is missing)

(Attachment 20211102journalctl-ubalena-xef.rtf is missing)

[Resend with correct attachment types]
Christina,So, weird turn of events…Recent lockups are back to complaining about “out of memory” among other health check errors. When I investigated with ps -eo pid,rss,vsz,cmd --sort -vsz | more, I see that all the memory is taken up by a gazillion processes for:

  • balena info
  • /proc/self/exe --healthcheck /usr/lib/balena/balena-healthcheck --pid 12236
  • bin/sh /usr/lib/balena/balena-healthcheck

My services are still using reasonable amounts of memory. Am I correct that perhaps there is some original error – not sure what it was – that then spawns (maybe due to crashlooping?) these balena processes which proceed to gobble up all the memory?

20211102ps.pdf (22.8 KB)

20211102dmesg.pdf (99.4 KB)

20211102journalctl-ubalena-xef.pdf (120 KB)

Hi @smoore,

Apologies for the lateness. This appears to be a known internal behavior of engine, and I believe it’s unintentional / a bug, caused by the engine getting killed instead of gracefully stopped, which doesn’t give it time to clean up its leftover processes. I’ve created an issue for this here which you can track: Leftover healthcheck processes from engine crash loops · Issue #280 · balena-os/balena-engine · GitHub

From what I’ve read of our internal notes over this particular behavior, it hasn’t seen further investigation. How much memory do these leftover processes take up as far as you can tell? Thanks!

Regards,
Christina

Hi, Christina,
Thanks for checking in. It looks like when the crash first occurs, I have around 80% of memory. Then, memory gets progressively cluttered with crash loop processes until there’s 1-3% memory left.

I appreciate being connected with the bug report so I can follow progress.

I’m interested to learn more about balenaEngine not being reachable, for debugging on my end. Also, how critical is constant internet connection between our NUC and Balena? We are using a cell modem at that site and service is usually good.

Thank you for your time,
Sandy

Hey @smoore,
Thanks for sharing information about memory taken up by balena processes.

Regarding ‘how critical is constant internet connection between our NUC and Balena’ : balenaOS per se does not depend on Internet connectivity and your device will continue to function well without Internet (or even without any network) connectivity. An Internet connection will definitely be needed if you wish to fruitfully interact with your device (i.e via balena ssh or webterminal ) or you wish to trigger device checks or device diagnostics.
We have numerous customers employing cellular connectivity for their balena devices, hence I don’t think using a cell modem with balenaOS is anything to be worried about :slight_smile:
Hope that helps.

Thanks and regards,
Pranav

Attached/pasted below please find a current example of a 12/28/21 system lock-up with the health checks, dmesg and journalctl data requested awhile back. I was able to run Device health checks, however Device diagnostics just spins.

Thanks for any insight you can provide about what might be driving this situation. My current solution is reboot from Host terminal session, however I am still looking for a deeper understanding of what drives this error state, and how I can prevent this from happening.

Thanks again,
Sandy

Health Checks
{
“diagnose_version”:“4.21.3”,
“checks”:[
{
“name”:“check_balenaOS”,
“success”:true,
“status”:“Supported balenaOS 2.x detected”
},
{
“name”:“check_container_engine”,
“success”:false,
“status”:"Some container_engine issues detected: \ntest_container_engine_running_now Container engine balena is NOT running\ntest_container_engine_restarts Container engine balena has 16 restarts and may be crashlooping (most recent start time: Tue 2021-12-28 16:17:10 UTC)\ntest_container_engine_responding Error querying container engine: "
},
{
“name”:“check_localdisk”,
“success”:false,
“status”:“Some localdisk issues detected: \ntest_data_partition_mounted Data partition not mounted read-write”
},
{
“name”:“check_memory”,
“success”:true,
“status”:“91% memory available”
},
{
“name”:“check_networking”,
“success”:false,
“status”:“Some networking issues detected: \ntest_balena_registry: Could not communicate with registry2.balena-cloud.com for authentication”
},
{
“name”:“check_os_rollback”,
“success”:true,
“status”:“No OS rollbacks detected”
},
{
“name”:“check_service_restarts”,
“success”:true,
“status”:“No services are restarting unexpectedly”
},
{
“name”:“check_supervisor”,
“success”:false,
“status”:“Supervisor is NOT running”
},
{
“name”:“check_temperature”,
“success”:true,
“status”:“No temperature issues detected”
},
{
“name”:“check_timesync”,
“success”:true,
“status”:“Time is synchronized”
}
]
}

journalctl -u balena -xef

Dec 28 16:29:53 8447c70 systemd[1]: balena.service: Found left-over process 2728 (balena-healthch) in control group while starting unit. Ignoring.

Dec 28 16:29:53 8447c70 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Dec 28 16:29:53 8447c70 systemd[1]: balena.service: Found left-over process 2734 (balena) in control group while starting unit. Ignoring.

Dec 28 16:29:53 8447c70 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Dec 28 16:29:53 8447c70 balenad[2745]: time=“2021-12-28T16:29:53.209614083Z” level=info msg=“Starting up”

Dec 28 16:29:53 8447c70 balenad[2745]: time=“2021-12-28T16:29:53.209738327Z” level=warning msg=“Running experimental build”

Dec 28 16:29:53 8447c70 balenad[2745]: chmod /var/lib/docker: read-only file system

Dec 28 16:29:53 8447c70 systemd[1]: balena.service: Main process exited, code=exited, status=1/FAILURE

Dec 28 16:29:53 8447c70 systemd[1]: balena.service: Failed with result ‘exit-code’.

Dec 28 16:29:53 8447c70 systemd[1]: Failed to start Balena Application Container Engine.

Dec 28 16:29:53 8447c70 systemd[1]: Dependency failed for Balena Application Container Engine.

Dec 28 16:29:53 8447c70 systemd[1]: balena.service: Job balena.service/start failed with result ‘dependency’.

GR_Sandia_wild_sun-dmesg.pdf (114 KB)