Container fails to start - "failed to attach 1 to compat systemd cgroup"

@ajs1k please ping us when you see the problem occur again and we will look into it.

1 Like

So far so good. Out of my five wallboards one is offline and I’m going to get it restarted shortly, and the other four are running happily. I’ll keep you posted…

468439d25ffaf42298bb7caace3fb776 is now stuck in this crash loop…

Any thoughts ? It’s still stuck here…

@ajs1k I’ve tried to reproduce this issue locally using the Dockerfile you shared.
Unfortunately, I haven’t managed to reproduce it. (Had to trim down non-existent files etc)

I’ve tried to trace it by accessing the device as well. If I try to manually run the image like balena run --rm IMAGE, i was able to run it just fine.

root@468439d:~# balena run --privileged --rm -i -t 18f9cc788806
Systemd init system enabled.
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture arm.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <f21409246a60>.
Failed to bump fs.file-max, ignoring: Invalid argument
File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[  OK  ] Reached target Remote File Systems.
[  OK  ] Set up automount Arbitrary Executable File Formats File System Automount Point.
[  OK  ] Listening on initctl Compatibility Named Pipe.
[  OK  ] Reached target Slices.
[  OK  ] Listening on udev Kernel Socket.

Which led me to believe something else is fishy here.

I stopped the supervisor, deleted the previous container, started the supervisor and the same image started in a new container just fine.

Strange. I wish there was an easy way to reproduce the issue as I’ve seen it on two forum threads now…

Would it be ok if I tried to reboot the device in an attempt to make it go back into the bad state? (this hunch is according to the other forum thread where a ‘reboot’ made the device go into a bad state and a restart of the container fixed the issue.) Container reboot and "Failed to attach 1 to compat systemd cgroup"

Feel free to reboot/restart/whatever you need - the device is unusable for us at the moment anyway.

The problem also doesn’t occur straight away: it can take up to about a day before it throws a fit and gets into this weird state.

For the rest of our fleet I might go back to trying 2.36 and seeing if that brings us stability. The problem is that due to what we’re displaying on the wallboard it requires us to manually log in every time it restarts so it’s a fairly visible failure-mode.

I see the issue again. This time I managed to spot something in the logs
balena-engine crashed. Unfortunately, I didn’t see the initial part of the stack trace as the logs rotated. I’m going to try to reproduce and see if I can catch it in action.

Sep 03 14:15:41 468439d balenad[780]:         /usr/lib/go/src/io/io.go:400 +0x14c
Sep 03 14:15:41 468439d balenad[780]: io.CopyBuffer(0x13f8c60, 0x13e38fc0, 0x7512fdd0, 0x1389e380, 0x144e6000, 0x8000, 0x8000, 0x1, 0x55601, 0xd6a1b4, ...)
Sep 03 14:15:41 468439d balenad[780]:         /usr/lib/go/src/io/io.go:373 +0x5c
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/pkg/pools.Copy(0x13f8c60, 0x13e38fc0, 0x7512fdd0, 0x1389e380, 0x1389e380, 0x0, 0x0, 0x0)
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1(0x13f8c60, 0x13e38fc0, 0x7512fd70, 0x1389e380, 0>
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: created by github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: goroutine 368 [select, 4 minutes]:
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/api/server/router/system.(*systemRouter).getEvents(0x13c5a2a0, 0x1409388, 0x145d0a00, 0x1406c8>
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/api/server/router/system.(*systemRouter).(github.com/docker/docker/api/server/router/system.ge>
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/api/server/router.cancellableHandler.func1(0x14093e8, 0x145d09e0, 0x1406c88, 0x145a77a0, 0x143>
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1(0x14093e8, 0x145d09e0, 0x1406c8
1 Like

one more step forward.

If I restart the balena-engine service twice, the app container goes into the bad state.
So we have a way of trying to reproduce it faster now.

Ah splendid - that ought to speed things up if you can repro it. (-;

and I’ve managed to reproduce it on my own device too. Restart balena service twice and watch that same error message.

1 Like

I do wonder what’s going on with my devices that causes the service to restart anyway? I’m often seeing that the container restarts itself (which causes me to have to log in again on my wallboards). Maybe there’s still another problem to solve beyond the service restarting causing a crash-loop.

Indeed. nothing should just restart balena-engine and maybe we are looking at two issues. One that crashes balena-engine, and this systemd restart loop.

I have a suggestion to try.
Can you change the first line of your dockerfile from

FROM balenalib/%%BALENA_MACHINE_NAME%%

to

FROM balenalib/%%BALENA_MACHINE_NAME%%:stretch

You fetch the latest debian which is buster which has systemd 241. I am unable to reproduce the “Failed to attach 1 to compat systemd cgroup” issue in stretch which is using the older systemd.

Now is the issue a mix of hostOS systemd, kernel, containerOS systemd. I don’t know yet and this still needs looking into/fixing. My suggestion is to let you proceed with your work.
And perhaps that will let us find the other issue i.e. why the balenaEngine restarts (which it shouldn’t for no reason)

Thanks - I’m giving this a try - will let you know tomorrow if there were any restarts during the night.

Not really any better I’m afraid. Around two hours in the web browser locked up. I tried restarting the container and it’s just stuck somehow - VNC connects but gives me a black screen. I had to reboot the Pi to regain control.

Thanks for the update. I have passed on your findings to the engineer who is looking at this, as he is offline right now.

The logs seem to point to something else failing… sigh
Would it be possible for you to share your application in a zip with me via a direct message? I think it manages to rustle up quite a few things specially on 2.41 (which has newer pi kernel, pi firmware, new systemd etc)
Changing those shouldn’t ‘technically’ make a difference… but there are always subtleties…

04.09.19 20:31:02 (+0100)  wallboard  (xfce4-session:129): xfce4-session-WARNING **: failed to run script: Failed to execute child process "/usr/bin/pm-is-supported" (No such file or directory)
04.09.19 20:31:02 (+0100)  wallboard  
04.09.19 20:31:02 (+0100)  wallboard  (xfce4-session:129): xfce4-session-WARNING **: failed to run script: Failed to execute child process "/usr/bin/pm-is-supported" (No such file or directory)
04.09.19 20:31:02 (+0100)  wallboard  
04.09.19 20:31:02 (+0100)  wallboard  (xfce4-panel:147): Wnck-CRITICAL **: wnck_workspace_is_virtual: assertion 'WNCK_IS_WORKSPACE (space)' failed
04.09.19 20:31:04 (+0100)  wallboard  libGL error: MESA-LOADER: failed to retrieve device information
04.09.19 20:31:05 (+0100)  wallboard  MESA-LOADER: failed to retrieve device information
04.09.19 20:31:05 (+0100)  wallboard  MESA-LOADER: failed to retrieve device information
04.09.19 20:31:06 (+0100)  wallboard  Fontconfig warning: "/etc/fonts/fonts.conf", line 100: unknown element "blank"
04.09.19 20:31:07 (+0100)  wallboard  libGL error: MESA-LOADER: failed to retrieve device information
04.09.19 20:31:07 (+0100)  wallboard  [153:214:0904/193107.446893:ERROR:object_proxy.cc(621)] Failed to call method: org.freedesktop.Notifications.GetCapabilities: object_path= /org/freedesktop/Notifications: org.freedesktop.DBus.Error.ServiceUnknown: The name org.freedesktop.Notifications was not provided by any .service files
04.09.19 20:31:07 (+0100)  wallboard  MESA-LOADER: failed to retrieve device information
04.09.19 20:31:07 (+0100)  wallboard  MESA-LOADER: failed to retrieve device information
04.09.19 20:31:07 (+0100)  wallboard  ATTENTION: default value of option force_s3tc_enable overridden by environment.
04.09.19 20:31:08 (+0100)  wallboard  [153:195:0904/193108.110697:ERROR:top_sites_backend.cc(92)] Failed to initialize database.
04.09.19 20:31:08 (+0100)  wallboard  Draw call returned Invalid argument.  Expect corruption.
04.09.19 20:31:19 (+0100)  wallboard  Wed Sep  4 19:31:19 UTC 2019
04.09.19 20:31:24 (+0100)  wallboard  Updating DNS... Traceback (most recent call last):
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/urllib3/connection.py", line 160, in _new_conn
04.09.19 20:31:24 (+0100)  wallboard      (self._dns_host, self.port), self.timeout, **extra_kw)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/urllib3/util/connection.py", line 80, in create_connection
04.09.19 20:31:24 (+0100)  wallboard      raise err
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/urllib3/util/connection.py", line 70, in create_connection
04.09.19 20:31:24 (+0100)  wallboard      sock.connect(sa)
04.09.19 20:31:24 (+0100)  wallboard  ConnectionRefusedError: [Errno 111] Connection refused
04.09.19 20:31:24 (+0100)  wallboard  
04.09.19 20:31:24 (+0100)  wallboard  During handling of the above exception, another exception occurred:
04.09.19 20:31:24 (+0100)  wallboard  
04.09.19 20:31:24 (+0100)  wallboard  Traceback (most recent call last):
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 603, in urlopen
04.09.19 20:31:24 (+0100)  wallboard      chunked=chunked)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 355, in _make_request
04.09.19 20:31:24 (+0100)  wallboard      conn.request(method, url, **httplib_request_kw)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/lib/python3.5/http/client.py", line 1107, in request
04.09.19 20:31:24 (+0100)  wallboard      self._send_request(method, url, body, headers)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/lib/python3.5/http/client.py", line 1152, in _send_request
04.09.19 20:31:24 (+0100)  wallboard      self.endheaders(body)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/lib/python3.5/http/client.py", line 1103, in endheaders
04.09.19 20:31:24 (+0100)  wallboard      self._send_output(message_body)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
04.09.19 20:31:24 (+0100)  wallboard      self.send(msg)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/lib/python3.5/http/client.py", line 877, in send
04.09.19 20:31:24 (+0100)  wallboard      self.connect()
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/urllib3/connection.py", line 183, in connect
04.09.19 20:31:24 (+0100)  wallboard      conn = self._new_conn()
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/urllib3/connection.py", line 169, in _new_conn
04.09.19 20:31:24 (+0100)  wallboard      self, "Failed to establish a new connection: %s" % e)
04.09.19 20:31:24 (+0100)  wallboard  urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x76006170>: Failed to establish a new connection: [Errno 111] Connection refused
04.09.19 20:31:24 (+0100)  wallboard  
04.09.19 20:31:24 (+0100)  wallboard  During handling of the above exception, another exception occurred:
04.09.19 20:31:24 (+0100)  wallboard  
04.09.19 20:31:24 (+0100)  wallboard  Traceback (most recent call last):
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 449, in send
04.09.19 20:31:24 (+0100)  wallboard      timeout=timeout
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 641, in urlopen
04.09.19 20:31:24 (+0100)  wallboard      _stacktrace=sys.exc_info()[2])
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/urllib3/util/retry.py", line 399, in increment
04.09.19 20:31:24 (+0100)  wallboard      raise MaxRetryError(_pool, url, error or ResponseError(cause))
04.09.19 20:31:24 (+0100)  wallboard  urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.114.104.1', port=48484): Max retries exceeded with url: /v1/device?apikey=80d7e0a9ced6e88bf0260d53468fa6c5bb5b3c612276faa5699ee89a085acd (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x76006170>: Failed to establish a new connection: [Errno 111] Connection refused',))
04.09.19 20:31:24 (+0100)  wallboard  
04.09.19 20:31:24 (+0100)  wallboard  During handling of the above exception, another exception occurred:
04.09.19 20:31:24 (+0100)  wallboard  
04.09.19 20:31:24 (+0100)  wallboard  Traceback (most recent call last):
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/bin/dns.py", line 47, in <module>
04.09.19 20:31:24 (+0100)  wallboard      sys.exit(main())
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/bin/dns.py", line 22, in main
04.09.19 20:31:24 (+0100)  wallboard      (ip, hostname) = getSystemInfo()
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/bin/dns.py", line 13, in getSystemInfo
04.09.19 20:31:24 (+0100)  wallboard      ipInfo = requests.get('%s/v1/device?apikey=%s' % (os.environ['RESIN_SUPERVISOR_ADDRESS'], os.environ['RESIN_SUPERVISOR_API_KEY']), headers={'content-type': 'application/json'}).json()
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 75, in get
04.09.19 20:31:24 (+0100)  wallboard      return request('get', url, params=params, **kwargs)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 60, in request
04.09.19 20:31:24 (+0100)  wallboard      return session.request(method=method, url=url, **kwargs)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 533, in request
04.09.19 20:31:24 (+0100)  wallboard      resp = self.send(prep, **send_kwargs)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 646, in send
04.09.19 20:31:24 (+0100)  wallboard      r = adapter.send(request, **kwargs)
04.09.19 20:31:24 (+0100)  wallboard    File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 516, in send
04.09.19 20:31:24 (+0100)  wallboard      raise ConnectionError(e, request=request)
04.09.19 20:31:24 (+0100)  wallboard  requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.114.104.1', port=48484): Max retries exceeded with url: /v1/device?apikey=80d7e0a9ced6e88bf0260d53468fa6c5bb5b3c612276faa5699ee89a085acd (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x76006170>: Failed to establish a new connection: [Errno 111] Connection refused',))
04.09.19 20:31:24 (+0100)  wallboard   done.
04.09.19 20:32:25 (+0100)  wallboard  [ TIME ] Timed out waiting for device /dev/mmcblk0p6.

That last line is quite suspicious. But I think there are other things happening here too.

Managed to make some progress on the original Failed to attach 1 issue where we started.
I’ve added detail on github https://github.com/balena-os/meta-balena/issues/1645#issuecomment-528338112

Ok. We have workaround number 1 for the Failed to attach 1 to compat systemd cgroup issue…

root@468439d:~# cat /etc/docker/daemon.json 
{ 
"exec-opts": ["native.cgroupdriver=systemd"] 
}
root@468439d:~#

Now to try workaround 2

Can you please add this initcall_blacklist=bcm2708_fb_init to /mnt/boot/cmdline.txt ? Does that make xfce start working on your wallboard?

I tried your app, I can see chrome open with the above workaround on 2.41. But I can’t run the wallboard fully as I think you have some env-vars which have secrets api keys.

p.s. We have taken down 2.41 from production due to these issues. We are tracing and fixing them. Mix of kernel/systemd/firmware issues coming together.

And I have also started an internal thread to improve testing in this area. Our testing uses https://github.com/balena-io-playground/x11-window-manager but that doesn’t use systemd… So I’m trying to see if we can add x11+systemd into our testing flow as well.

Thanks - do appreciate the help!

All I really want to be able to do is to have a web browser that can display two or three tabs and cycle between then. And I need VNC because I’ll need to provide login credentials as things start up. If I don’t need systemd then I’m happy to do without it…

I’ll add that text to the cmdline.txt file now.