All services restarted randomly

Every 10 minutes all the services running on our balenaCloud device restart. Here is a section of the logs from the device. I cannot see any error in the code that has caused all the services to restart at the same time.

19.02.19 18:18:52 (+0000) Service is already running 'controller sha256:2bc98d7493a0b5bd98b0ed316536def139a85055b8d03d9fec05dfdc390c6529'
19.02.19 18:18:52 (+0000) Service is already running 'camera sha256:8dfb371313a91168515574ab913d8a1b721a8e8136cc9173db759bc23e63d7d0'
19.02.19 18:18:52 (+0000) Service is already running 'proxy sha256:9491ab3748924ff6d19af6be5c66df6257138c418c8c7399203a05274956805e'
19.02.19 18:18:52 (+0000) Service is already running 'interface sha256:4595c12ed503140d5aaf00312442dba8e648edc42a200e156e6c159c33367845'
19.02.19 18:18:52 (+0000) Service is already running 'metrics sha256:ad08a3c3220a92e5ce50f55cee697bdd450de848ab457d317a66d5ed707e5e7f'
19.02.19 18:18:52 (+0000) Service is already running 'hardware sha256:e1f197d54248ca64acc3e132dda9ae30944bac1880f2907adf2587777f1414f5'

Here is the output from the balena ps command. What is interesting here is that all the containers including the supervisor have restarted.

CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS                   PORTS                NAMES
ac9eaf74f144        2bc98d7493a0                       "/usr/bin/entry.sh .…"   About an hour ago   Up 4 minutes             8082/tcp             controller_888819_793723
e50fcfcb70a8        8dfb371313a9                       "/usr/bin/entry.sh .…"   About an hour ago   Up 6 minutes             8084/tcp             camera_888818_793723
338d4763c857        9491ab374892                       "/usr/bin/entry.sh n…"   4 days ago          Up 6 minutes             0.0.0.0:80->80/tcp   proxy_888823_793723
46501db12e26        4595c12ed503                       "/usr/bin/entry.sh n…"   4 days ago          Up 6 minutes             8081/tcp             interface_888821_793723
d2bfb6dba414        ad08a3c3220a                       "/usr/bin/entry.sh n…"   4 days ago          Up 6 minutes                                  metrics_888822_793723
bc1446718096        e1f197d54248                       "/usr/bin/entry.sh .…"   4 days ago          Up 6 minutes             8083/tcp             hardware_888820_793723
dfcef19cba3b        balena/armv7hf-supervisor:v9.0.1   "./entry.sh"             4 weeks ago         Up 6 minutes (healthy)                        resin_supervisor

I have looked into the journalctl logs and it looks like the whole device is rebooting and the dashboard says the device has only been online for 18 minutes.

What could be the issue here?

Hi Henry
thanks for reaching out to us.
I can take a look at your device if you like.
I will send you a private message. If you grant support access to the device and send me the device dashboard link via PM I can take a look.
Regards
Thomas

Hi Henry,
I have made the link available for other support agents but it looks like support access has not been enabled. Please post us when the device comes back on and support access has been granted.
Regards
Thomas

Hi all,

I have enabled the support mode and turned on the device. Let me know what else I can do.

Thanks,

Henry

Hi,

I know its not been long but has anyone had a chance to take a look at the device?

Our scientist is looking to use the device for some testing so later this morning I am going to roll back to a previous version of our software. Originally we were using resin base images as opposed to balena base images as we were not experiencing any issues.

The change was originally brought about by updating from Node version 9 to version 10. There were no resin images available for Node version 10 therefore we needed to change to the multi layered balena images which use a build and run base image.

Thanks,
Henry

I am going to force push in order to roll back the device now.

Perhaps an interesting update. When the system had downloaded the rolled back code it did not replace the running containers - we have update locking enabled - until a random point when all the services logged that they are already running.

20.02.19 12:14:25 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/c76f103c0343716475818ff8c7d15918@sha256:c892d7df296ba3ef431cce7c679402be30846a39913906a06c7cb7b03f9a0976'
20.02.19 12:14:51 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/c97175bef2590fb644e118b8c1bb5ba2@sha256:20988763ce0600a29bcf8faccfed722062661d3768bd3e30afd0456441bdc98d'
20.02.19 12:14:52 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/63721429ba82b3be3ed5b5d53b451947@sha256:85cb4c21c2ac8dbc28d0978fd8ed3d2843b251d67c9e845ba43187a483058701'
20.02.19 12:15:21 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/d6e5407d21206e3d6b782fbadc88f8bd@sha256:ba20a3f57f2c1da2328933dfe1061f6ac86afd9ba2c7c91ea9afb69f54b9726a'
20.02.19 12:15:31 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/22064dddcd5d1a2d17319c4f46a275de@sha256:b7fbfed0589dec6d613f3cc5d9294a9e5c9918ef9fc8039c999129575fa67a81'
20.02.19 12:19:04 (+0000) Service is already running 'controller sha256:447a2760eddea45a3ad4a9c18da5cac9be8287d45ec023aebef9204a6c315940'
20.02.19 12:19:04 (+0000) Service is already running 'interface sha256:10f21a1014f58f0a4d6f8c4e79118115a63c4acd88f5a9a9c51e887dc5d16273'
20.02.19 12:19:04 (+0000) Service is already running 'camera sha256:8dfb371313a91168515574ab913d8a1b721a8e8136cc9173db759bc23e63d7d0'
20.02.19 12:19:04 (+0000) Service is already running 'proxy sha256:9491ab3748924ff6d19af6be5c66df6257138c418c8c7399203a05274956805e'
20.02.19 12:19:05 (+0000) Service is already running 'metrics sha256:ad08a3c3220a92e5ce50f55cee697bdd450de848ab457d317a66d5ed707e5e7f'
20.02.19 12:19:05 (+0000) Service is already running 'hardware sha256:e1f197d54248ca64acc3e132dda9ae30944bac1880f2907adf2587777f1414f5'

I think I have fixed the issue by restoring our previous dockerfile which used the original resin base images. The following Dockerfile is from one of our services that was causing the issue.

FROM balenalib/%%BALENA_MACHINE_NAME%%-node:10-stretch-build AS buildstep

RUN npm install -g npm@latest

WORKDIR /build
COPY .npmrc package.json ./
RUN npm install --only=production --unsafe-perm

FROM balenalib/%%BALENA_MACHINE_NAME%%-node:10-stretch-run

ENV INITSYSTEM on

RUN install_packages dbus udev usbutils

WORKDIR /app
COPY --from=buildstep /build/node_modules ./node_modules
COPY babel.config.js index.js package.json start.sh ./
RUN chmod +x start.sh

CMD ["./start.sh"]

Here is the original Dockerfile which I have now reverted to using which has not yet experienced any restart issues.

FROM resin/%%RESIN_MACHINE_NAME%%-node:9

ENV INITSYSTEM on

RUN apt-get update && apt-get install dbus
RUN npm install -g npm@latest

WORKDIR /app
COPY .npmrc package.json ./
RUN JOBS=MAX npm install --silent --production --unsafe-perm && npm cache clean --force

COPY babel.config.js index.js start.sh ./
RUN chmod +x start.sh

CMD ["./start.sh"]

To give context we have a number of similar services. I only updated 3 of the 4 node services to the latest base images. Another I had not bothered updating as it has a number of OS packages and I thought it could cause issues.

Perhaps a cause of the issue is that there is too much memory consumed by the multiple base images? If I had of changed all services to the new images it could have not caused an issue.

Hi
we are happy you managed to fix the problem.
We have been discussing your issue internally and would be interested in learning what caused the problem in the first place. Due to non persistent log files we will not find many traces on the repaired device. If you have a test device that you can reproduce the failing configuration on we would be happy to take a look at it .

Regards,
Steve

Hi Steve,

I would be happy to deploy the code that was causing an issue on the device when I am back in the office on Monday.

As the continual restarts are not very good for the hardware, we have a moving gantry that calibrates it’s location every time the device starts, I would prefer to not leave it in the problematic state for too long. Perhaps it would be possible to arrange a time at which you might be able to take a look? I am on GMT but fairly flexible what time of day.

Thanks,

Henry

Hi Henry,

Could you please share with us a minimal piece of code to reproduce this issue on our end?
If you prefer to share in private, let us know. We’ll send a private message to collect it.

Due to resource constraints, we cannot promise to provide support on a specific time slot for the forum posts (though we do provide it for our paid support).

Cheers…
Fırat