Multicontainer project that works on RPI3 fails on Fin board (avahi & supervisor interaction)

I have just received my fin board and the first thing I did was to push a multi container project that works like a charm on a regular RPI3. The image was built and pulled fine on the fin board however all but one of my containers fails to start (repeatedly due to the restart=always setting). Any ideas on how to troubleshoot this as I am unable to ssh into the container because it instantly dies. Please note that the exact same container works fine if I create a single container project.

thanks in advance,

Alex

You have checked the log for that container in the balena dashboard ?

Yes. There is nothing printed other than:

05.12.18 14:19:10 (+0200) Installing service ‘signalk sha256:83c3797b7144bb8fa3a9632c3edc54ad56088179dd5c7c25fc324588c20c461d’
05.12.18 14:19:19 (+0200) Installed service ‘signalk sha256:83c3797b7144bb8fa3a9632c3edc54ad56088179dd5c7c25fc324588c20c461d’
05.12.18 14:19:19 (+0200) Starting service ‘signalk sha256:83c3797b7144bb8fa3a9632c3edc54ad56088179dd5c7c25fc324588c20c461d’
05.12.18 14:19:22 (+0200) Started service ‘signalk sha256:83c3797b7144bb8fa3a9632c3edc54ad56088179dd5c7c25fc324588c20c461d’
05.12.18 14:19:23 (+0200) Killing service ‘signalk sha256:83c3797b7144bb8fa3a9632c3edc54ad56088179dd5c7c25fc324588c20c461d’
05.12.18 14:19:27 (+0200) Service exited ‘signalk sha256:83c3797b7144bb8fa3a9632c3edc54ad56088179dd5c7c25fc324588c20c461d’
05.12.18 14:19:29 (+0200) Killed service ‘signalk sha256:83c3797b7144bb8fa3a9632c3edc54ad56088179dd5c7c25fc324588c20c461d’

in a loop every 10 sec or so…

I also tried executing the below on the host:

journalctl --no-pager -u resin-supervisor

which produces the below in a loop:

ec 05 12:23:47 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:47.733Z] Replacing container for service signalk because of config changes:
Dec 05 12:23:47 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:47.737Z] Non-array fields: {“added”:{},“deleted”:{},“updated”:{“portMaps”:{“0”:{“ports”:{“internalStart”:3000,“internalEnd”:3000,“externalStart”:3000,“externalEnd”:3000}},“1”:{“ports”:{“internalStart”:1883,“internalEnd”:1883,“externalStart”:1883,“externalEnd”:1883}}}}}
Dec 05 12:23:47 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:47.740Z] Replacing container for service signalk because of config changes:
Dec 05 12:23:47 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:47.743Z] Non-array fields: {“added”:{},“deleted”:{},“updated”:{“portMaps”:{“0”:{“ports”:{“internalStart”:3000,“internalEnd”:3000,“externalStart”:3000,“externalEnd”:3000}},“1”:{“ports”:{“internalStart”:1883,“internalEnd”:1883,“externalStart”:1883,“externalEnd”:1883}}}}}
Dec 05 12:23:49 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:49.042Z] Replacing container for service signalk because of config changes:
Dec 05 12:23:49 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:49.044Z] Non-array fields: {“added”:{},“deleted”:{},“updated”:{“portMaps”:{“0”:{“ports”:{“internalStart”:3000,“internalEnd”:3000,“externalStart”:3000,“externalEnd”:3000}},“1”:{“ports”:{“internalStart”:1883,“internalEnd”:1883,“externalStart”:1883,“externalEnd”:1883}}}}}
Dec 05 12:23:49 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:49.047Z] Replacing container for service signalk because of config changes:
Dec 05 12:23:49 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:49.048Z] Non-array fields: {“added”:{},“deleted”:{},“updated”:{“portMaps”:{“0”:{“ports”:{“internalStart”:3000,“internalEnd”:3000,“externalStart”:3000,“externalEnd”:3000}},“1”:{“ports”:{“internalStart”:1883,“internalEnd”:1883,“externalStart”:1883,“externalEnd”:1883}}}}}
Dec 05 12:23:51 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:51.420Z] Event: Service exit {“service”:{“appId”:1312298,“serviceId”:135772,“serviceName”:“signalk”,“releaseId”:686247}}
Dec 05 12:23:51 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:51.540Z] Replacing container for service signalk because of config changes:
Dec 05 12:23:51 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:51.542Z] Non-array fields: {“added”:{},“deleted”:{},“updated”:{“portMaps”:{“0”:{“ports”:{“internalStart”:3000,“internalEnd”:3000,“externalStart”:3000,“externalEnd”:3000}},“1”:{“ports”:{“internalStart”:1883,“internalEnd”:1883,“externalStart”:1883,“externalEnd”:1883}}}}}
Dec 05 12:23:51 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:51.543Z] Replacing container for service signalk because of config changes:
Dec 05 12:23:51 b6bb9bb resin-supervisor[17600]: [2018-12-05T12:23:51.545Z] Non-array fields: {“added”:{},“deleted”:{},“updated”:{“portMaps”:{“0”:{“ports”:{“internalStart”:3000,“internalEnd”:3000,“externalStart”:3000,“externalEnd”:3000}},“1”:{“ports”:{“internalStart”:1883,“internalEnd”:1883,“externalStart”:1883,“externalEnd”:1883}}}}}

thanks,

Alex

Just to update that looking at the entire output of journalctl it seems to me that the config changes are caused by changes at the network level. So I went ahead and refreshed my fin with an ethernet only image and while initially the problem appeared to persist eventually after several restarts it seems to have stabilised. I believe the issue becomes worse because my container uses avahi/mDNS and it seems that as the various interfaces are renamed, added to the bridge etc this makes my container detect this as config change and restarts. Does this theory make sense?

Please find attached the journalctl.log

Thanks,

Alex
journalctl.log (1.3 MB)
journalctl.log (1.3 MB)

Hi @krital , can you tell us a bit more about your situation: Are you using balenaOS? What version?

Hi @ntzovanis,

I am using the latest version of balenaOS on the fin (fincm3-2.26.0+rev1-dev-v8.0.0) running a project with 5 containers. One of the containers advertises itself over mDNS which is where I believe the problem occurs. What I have narrowed it down to currently is that it seems that if WIFI is not configured, or is configured but the SSID can not be found, then as networks are scanned, this causes repeatedly ‘network config change’ events. These are then picked by that container instance which is killed and reinstalled, and this happens repeatedly in a loop until eventually it stabilises. Let me know if you need more information than what I have already uploaded in the thread.

Thanks,

Alex

Hi @krital there are some known bugs in Avahi. Can you please check whether this affects you: https://github.com/balena-os/meta-balena/issues/1287

Also you mentioned that your application worked on RPi and it does not on the Fin, but maybe you were using a different OS version for both of those? Can you please confirm that the OS version you used for both of those was the same? This would be very important information. I am asking because from the looks of this it does not look related as both of the boards are using the same kernel and tools versions. There are only some hardware specific settings that are different.

I will check the journal logs now.

I checked the logs now, and we will need some of your assistance in being able to reproduce this on our side, so that we can investigate this further. Preferably we will need some minimal setup that reproduces the issue. Can you please explain more interactions with Avahi are you doing and possibly share some code with us?

Thanks for looking at the logs. I also looked at my RPI3 image and it is indeed an earlier version of Balena OS but I doubt that this is related. It seems that the only difference was that I either configured for ethernet only or configured for WIFI & eth but the said AP was switched off / temporary unavailable.
To recap I think the subject of this thread is incorrect now, my problem is not related to the fin board but to what you noticed with Avahi interaction. I am happy to share the Dockerfile of the signalk container that should be able to reproduce this if this is what you need to assist. Do we need to move to a new thread?

There were some supervisor changes related to Avahi that could possibly may have to do with this. So I would not rule out the OS version. Let me find the corresponding versions.

No need to start a new thread, but you may change the subject to the thread to include some Avahi mentioning and also move the thread to the balenaCloud folder. There is an edit icon at the very top of the page (pencil symbol) that would allow you to do that.

ok moved… I found the versions… So on RPI3 I was running raspberrypi3-2.15.1+rev2-dev-v7.16.6 while when on the fin I am running fincm3-2.26.0+rev1-dev-v8.0.0

Just to recap I would like to describe what I think is happening:

I have a multi container project where one of the containers is running avahi daemon. On the older version of the OS (RPI3) the actual registration was failing and I had not yet dealt with it, I assume it still fails on the fin board. When I built the image for the fin board (newer supervisor version) I configured an WIFI AP that was not on at the time when I booted. As the WIFI SSIDS where being scanned or joining the configured one failed, the supervisor seems to propagate these failures via avahi. As a result my container believes its network configuration has changed and restarts.

Does this match what you noticed in the logs?

The mDNS related change was introduced in v7.18.0 of supervisor and this matches the version difference. It could be something else, but most probably this is why registration was failing on the older version.

I will check the logs again to see whether wifi scanning was related to it, as it looked more like it was related to the Ethernet interface.

I do not think this is related to the WiFi scanning. Can you please share with me the docker-compose file, the Dockerfile for the container, and the part of your code that deals with Avahi. You may do that on a private message. You can paste them in a private gist for example and share the links with me on a PM.

I received the files on my side. I have an idea. Can you please comment out the mdns related calls from your application and check the logs of the device to see whether the Avahi still exhibits the same behavior?

I have rebooted the fin and the problem is occurring again, currently stuck in the restart loop. I have granted support access to the device so perhaps this would be easier because my code never gets a chance to run all I see is:
12.12.18 15:49:01 (+0200) Killing service ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:06 (+0200) Service exited ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:08 (+0200) Killed service ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:08 (+0200) Installing service ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:39 (+0200) Installed service ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:39 (+0200) Starting service ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:41 (+0200) Started service ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:42 (+0200) Killing service ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:46 (+0200) Service exited ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:49 (+0200) Killed service ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’
12.12.18 15:49:49 (+0200) Installing service ‘signalk sha256:2b440092bb4e77941ffa4f47231aaedf0c6863d91160624e93a768d5fa77162a’

Can you access it?

I think this is most probably due to the fact that you are not using CMD, but ENTRYPOINT at the end of your Dockerfile. Can you try that please?

I tried changing entry point to cmd but now it fails to start with the below:

13.12.18 17:22:03 (+0200) Installed service ‘signalk sha256:9206eb8186dc8ae6df37bf7e2be2f4e85d38dd80774fe20ddcccf2f3fe0ea602’
13.12.18 17:22:03 (+0200) Starting service ‘signalk sha256:9206eb8186dc8ae6df37bf7e2be2f4e85d38dd80774fe20ddcccf2f3fe0ea602’
13.12.18 17:22:05 (+0200) signalk Systemd init system enabled.
13.12.18 17:22:05 (+0200) signalk /usr/bin/entry.sh: line 76: /etc/docker.env: Permission denied
13.12.18 17:22:05 (+0200) signalk /usr/bin/entry.sh: line 84: /etc/resinApp.sh: Permission denied
13.12.18 17:22:05 (+0200) signalk /usr/bin/entry.sh: line 85: /etc/resinApp.sh: Permission denied
13.12.18 17:22:05 (+0200) signalk chmod: cannot access ‘/etc/resinApp.sh’: No such file or directory
13.12.18 17:22:05 (+0200) signalk mkdir: cannot create directory ‘/etc/systemd/system/launch.service.d’: Permission denied
13.12.18 17:22:05 (+0200) signalk /usr/bin/entry.sh: line 89: /etc/systemd/system/launch.service.d/override.conf: No such file or directory
13.12.18 17:22:05 (+0200) signalk Failed to mount tmpfs at /run: Operation not permitted
13.12.18 17:22:05 (+0200) Started service ‘signalk sha256:9206eb8186dc8ae6df37bf7e2be2f4e85d38dd80774fe20ddcccf2f3fe0ea602’
13.12.18 17:22:06 (+0200) Killing service ‘signalk sha256:9206eb8186dc8ae6df37bf7e2be2f4e85d38dd80774fe20ddcccf2f3fe0ea602’
13.12.18 17:22:10 (+0200) Service exited ‘signalk sha256:9206eb8186dc8ae6df37bf7e2be2f4e85d38dd80774fe20ddcccf2f3fe0ea602’

Hi there,
Could you try to create a minimal application, single container and a simpler Dockerfile with the same issue, so we can reproduce and debug locally?

I will give it a try for sure. However in previous attempts this was not happening in single containers, but now I know more about the issue I can try again.