Container fails to start - "failed to attach 1 to compat systemd cgroup"

After the upgrade to 2.41.0r3 the system has gone completely crackers and every minute the following happens - any idea? I don’t really know where to start on that one…

28.08.19 16:05:00 (+0200) wallboard Systemd init system enabled.
28.08.19 16:05:00 (+0200) wallboard systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
28.08.19 16:05:00 (+0200) wallboard Detected virtualization docker.
28.08.19 16:05:00 (+0200) wallboard Detected architecture arm.
28.08.19 16:05:00 (+0200) wallboard Set hostname to <b83d65ecdb5b>.
28.08.19 16:05:00 (+0200) wallboard Failed to bump fs.file-max, ignoring: Invalid argument
28.08.19 16:05:00 (+0200) wallboard Failed to attach 1 to compat systemd cgroup /docker/b83d65ecdb5bc27bd19d096c89942fb8b7376f30532ad899aaefac091d0a0803/init.scope: No such file or directory
28.08.19 16:05:00 (+0200) wallboard Failed to open pin file: No such file or directory
28.08.19 16:05:00 (+0200) wallboard Failed to allocate manager object: No such file or directory
28.08.19 16:05:00 (+0200) wallboard [!!!!!!] Failed to allocate manager object.

Hi,

Can you please share your Dockerfile snippet?
And the device?

Thanks

`FROM balenalib/%%BALENA_MACHINE_NAME%%

ENV container docker

Install other apt deps

RUN apt-get update && apt-get install -y --no-install-recommends
apt-utils
curl
dbus
wget
xserver-xorg-core
xserver-xorg-legacy
xserver-xorg-input-all
xserver-xorg-video-fbdev
x11vnc
x11-xserver-utils
xorg
chromium-browser
libglvnd-dev
libxcb-image0
libxcb-util0
xdg-utils
libdbus-1-dev
libcap-dev
libxtst-dev
libxss1
lsb-release
fbset
systemd-sysv
libexpat-dev && rm -rf /var/lib/apt/lists/*

RUN install_packages xfce4 dbus-x11 xdotool

ENV XFCE_PANEL_MIGRATE_DEFAULT=1

RUN echo “#!/bin/bash” > /etc/X11/xinit/xserverrc
&& echo “” >> /etc/X11/xinit/xserverrc
&& echo ‘exec /usr/bin/X -s 0 dpms -nocursor -nolisten tcp “$@”’ >> /etc/X11/xinit/xserverrc

advised by balena to add the below for fixing local input issues

RUN systemctl mask
dev-hugepages.mount
sys-fs-fuse-connections.mount
sys-kernel-config.mount
getty@.service
systemd-logind.service
systemd-remount-fs.service
getty.target
graphical.target

RUN apt-get update
RUN apt-get install python3-distutils

COPY get-pip.py /tmp/get-pip.py
RUN python3 /tmp/get-pip.py
RUN rm /tmp/get-pip.py
RUN pip install google-cloud-dns requests
COPY dns.py /usr/bin
RUN chmod 755 /usr/bin/dns.py

COPY entry.sh /usr/bin/entry.sh
COPY balena.service /etc/systemd/system/balena.service
RUN chmod 644 /etc/systemd/system/balena.service

RUN systemctl enable /etc/systemd/system/balena.service

STOPSIGNAL 37
#VOLUME ["/sys/fs/cgroup"]
ENTRYPOINT ["/usr/bin/entry.sh"]

RUN mkdir -p /home/chromium/.config
COPY config /home/chromium/.config

Move to app dir

WORKDIR /usr/src/app

Move app to filesystem

COPY ./app .

uncomment if you want systemd

ENV INITSYSTEM on

Start app

CMD [“bash”, “/usr/src/app/start.sh”]`

Hello, can you try modifying the entry.sh script so that systemd is not started in quiet mode? I see from your dockerfile you are copying your own one over the default one. Hopefully there should be a line similar to https://github.com/balena-io-library/base-images/blob/master/examples/INITSYSTEM/systemd/systemd.v230/entry.sh#L82 in there. Can you try removing SYSTEMD_LOG_LEVEL=info /sbin/init quiet systemd.show_status=0 from there so we can see all the systemd logs?

Trying that now…

You have support access to 468439d25ffaf42298bb7caace3fb776 - it’s up and running at the moment but it probably won’t last long. I’ll update the thread if/when it does.

currently it’s running like this: exec env DBUS_SYSTEM_BUS_ADDRESS=unix:path=/run/dbus/system_bus_socket /sbin/init

Hmm I can’t see the error message you mention in the logs, did the device previously start to print this after some time?

Yes, it’s still up and running at the moment…

@ajs1k please ping us when you see the problem occur again and we will look into it.

1 Like

So far so good. Out of my five wallboards one is offline and I’m going to get it restarted shortly, and the other four are running happily. I’ll keep you posted…

468439d25ffaf42298bb7caace3fb776 is now stuck in this crash loop…

Any thoughts ? It’s still stuck here…

@ajs1k I’ve tried to reproduce this issue locally using the Dockerfile you shared.
Unfortunately, I haven’t managed to reproduce it. (Had to trim down non-existent files etc)

I’ve tried to trace it by accessing the device as well. If I try to manually run the image like balena run --rm IMAGE, i was able to run it just fine.

root@468439d:~# balena run --privileged --rm -i -t 18f9cc788806
Systemd init system enabled.
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture arm.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <f21409246a60>.
Failed to bump fs.file-max, ignoring: Invalid argument
File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[  OK  ] Reached target Remote File Systems.
[  OK  ] Set up automount Arbitrary Executable File Formats File System Automount Point.
[  OK  ] Listening on initctl Compatibility Named Pipe.
[  OK  ] Reached target Slices.
[  OK  ] Listening on udev Kernel Socket.

Which led me to believe something else is fishy here.

I stopped the supervisor, deleted the previous container, started the supervisor and the same image started in a new container just fine.

Strange. I wish there was an easy way to reproduce the issue as I’ve seen it on two forum threads now…

Would it be ok if I tried to reboot the device in an attempt to make it go back into the bad state? (this hunch is according to the other forum thread where a ‘reboot’ made the device go into a bad state and a restart of the container fixed the issue.) Container reboot and "Failed to attach 1 to compat systemd cgroup"

Feel free to reboot/restart/whatever you need - the device is unusable for us at the moment anyway.

The problem also doesn’t occur straight away: it can take up to about a day before it throws a fit and gets into this weird state.

For the rest of our fleet I might go back to trying 2.36 and seeing if that brings us stability. The problem is that due to what we’re displaying on the wallboard it requires us to manually log in every time it restarts so it’s a fairly visible failure-mode.

I see the issue again. This time I managed to spot something in the logs
balena-engine crashed. Unfortunately, I didn’t see the initial part of the stack trace as the logs rotated. I’m going to try to reproduce and see if I can catch it in action.

Sep 03 14:15:41 468439d balenad[780]:         /usr/lib/go/src/io/io.go:400 +0x14c
Sep 03 14:15:41 468439d balenad[780]: io.CopyBuffer(0x13f8c60, 0x13e38fc0, 0x7512fdd0, 0x1389e380, 0x144e6000, 0x8000, 0x8000, 0x1, 0x55601, 0xd6a1b4, ...)
Sep 03 14:15:41 468439d balenad[780]:         /usr/lib/go/src/io/io.go:373 +0x5c
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/pkg/pools.Copy(0x13f8c60, 0x13e38fc0, 0x7512fdd0, 0x1389e380, 0x1389e380, 0x0, 0x0, 0x0)
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1(0x13f8c60, 0x13e38fc0, 0x7512fd70, 0x1389e380, 0>
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: created by github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: goroutine 368 [select, 4 minutes]:
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/api/server/router/system.(*systemRouter).getEvents(0x13c5a2a0, 0x1409388, 0x145d0a00, 0x1406c8>
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/api/server/router/system.(*systemRouter).(github.com/docker/docker/api/server/router/system.ge>
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/api/server/router.cancellableHandler.func1(0x14093e8, 0x145d09e0, 0x1406c88, 0x145a77a0, 0x143>
Sep 03 14:15:41 468439d balenad[780]:         /yocto/resin-board/build/tmp/work/cortexa7t2hf-neon-vfpv4-poky-linux-gnueabi/balena/18.09.8-dev+git80d443d400dc>
Sep 03 14:15:41 468439d balenad[780]: github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1(0x14093e8, 0x145d09e0, 0x1406c8
1 Like

one more step forward.

If I restart the balena-engine service twice, the app container goes into the bad state.
So we have a way of trying to reproduce it faster now.

Ah splendid - that ought to speed things up if you can repro it. (-;

and I’ve managed to reproduce it on my own device too. Restart balena service twice and watch that same error message.

1 Like