Every 10 minutes all the services running on our balenaCloud device restart. Here is a section of the logs from the device. I cannot see any error in the code that has caused all the services to restart at the same time.
19.02.19 18:18:52 (+0000) Service is already running 'controller sha256:2bc98d7493a0b5bd98b0ed316536def139a85055b8d03d9fec05dfdc390c6529'
19.02.19 18:18:52 (+0000) Service is already running 'camera sha256:8dfb371313a91168515574ab913d8a1b721a8e8136cc9173db759bc23e63d7d0'
19.02.19 18:18:52 (+0000) Service is already running 'proxy sha256:9491ab3748924ff6d19af6be5c66df6257138c418c8c7399203a05274956805e'
19.02.19 18:18:52 (+0000) Service is already running 'interface sha256:4595c12ed503140d5aaf00312442dba8e648edc42a200e156e6c159c33367845'
19.02.19 18:18:52 (+0000) Service is already running 'metrics sha256:ad08a3c3220a92e5ce50f55cee697bdd450de848ab457d317a66d5ed707e5e7f'
19.02.19 18:18:52 (+0000) Service is already running 'hardware sha256:e1f197d54248ca64acc3e132dda9ae30944bac1880f2907adf2587777f1414f5'
Here is the output from the balena ps command. What is interesting here is that all the containers including the supervisor have restarted.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ac9eaf74f144 2bc98d7493a0 "/usr/bin/entry.sh .…" About an hour ago Up 4 minutes 8082/tcp controller_888819_793723
e50fcfcb70a8 8dfb371313a9 "/usr/bin/entry.sh .…" About an hour ago Up 6 minutes 8084/tcp camera_888818_793723
338d4763c857 9491ab374892 "/usr/bin/entry.sh n…" 4 days ago Up 6 minutes 0.0.0.0:80->80/tcp proxy_888823_793723
46501db12e26 4595c12ed503 "/usr/bin/entry.sh n…" 4 days ago Up 6 minutes 8081/tcp interface_888821_793723
d2bfb6dba414 ad08a3c3220a "/usr/bin/entry.sh n…" 4 days ago Up 6 minutes metrics_888822_793723
bc1446718096 e1f197d54248 "/usr/bin/entry.sh .…" 4 days ago Up 6 minutes 8083/tcp hardware_888820_793723
dfcef19cba3b balena/armv7hf-supervisor:v9.0.1 "./entry.sh" 4 weeks ago Up 6 minutes (healthy) resin_supervisor
I have looked into the journalctl logs and it looks like the whole device is rebooting and the dashboard says the device has only been online for 18 minutes.
Hi Henry
thanks for reaching out to us.
I can take a look at your device if you like.
I will send you a private message. If you grant support access to the device and send me the device dashboard link via PM I can take a look.
Regards
Thomas
Hi Henry,
I have made the link available for other support agents but it looks like support access has not been enabled. Please post us when the device comes back on and support access has been granted.
Regards
Thomas
I know its not been long but has anyone had a chance to take a look at the device?
Our scientist is looking to use the device for some testing so later this morning I am going to roll back to a previous version of our software. Originally we were using resin base images as opposed to balena base images as we were not experiencing any issues.
The change was originally brought about by updating from Node version 9 to version 10. There were no resin images available for Node version 10 therefore we needed to change to the multi layered balena images which use a build and run base image.
Perhaps an interesting update. When the system had downloaded the rolled back code it did not replace the running containers - we have update locking enabled - until a random point when all the services logged that they are already running.
20.02.19 12:14:25 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/c76f103c0343716475818ff8c7d15918@sha256:c892d7df296ba3ef431cce7c679402be30846a39913906a06c7cb7b03f9a0976'
20.02.19 12:14:51 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/c97175bef2590fb644e118b8c1bb5ba2@sha256:20988763ce0600a29bcf8faccfed722062661d3768bd3e30afd0456441bdc98d'
20.02.19 12:14:52 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/63721429ba82b3be3ed5b5d53b451947@sha256:85cb4c21c2ac8dbc28d0978fd8ed3d2843b251d67c9e845ba43187a483058701'
20.02.19 12:15:21 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/d6e5407d21206e3d6b782fbadc88f8bd@sha256:ba20a3f57f2c1da2328933dfe1061f6ac86afd9ba2c7c91ea9afb69f54b9726a'
20.02.19 12:15:31 (+0000) Downloaded image 'registry2.balena-cloud.com/v2/22064dddcd5d1a2d17319c4f46a275de@sha256:b7fbfed0589dec6d613f3cc5d9294a9e5c9918ef9fc8039c999129575fa67a81'
20.02.19 12:19:04 (+0000) Service is already running 'controller sha256:447a2760eddea45a3ad4a9c18da5cac9be8287d45ec023aebef9204a6c315940'
20.02.19 12:19:04 (+0000) Service is already running 'interface sha256:10f21a1014f58f0a4d6f8c4e79118115a63c4acd88f5a9a9c51e887dc5d16273'
20.02.19 12:19:04 (+0000) Service is already running 'camera sha256:8dfb371313a91168515574ab913d8a1b721a8e8136cc9173db759bc23e63d7d0'
20.02.19 12:19:04 (+0000) Service is already running 'proxy sha256:9491ab3748924ff6d19af6be5c66df6257138c418c8c7399203a05274956805e'
20.02.19 12:19:05 (+0000) Service is already running 'metrics sha256:ad08a3c3220a92e5ce50f55cee697bdd450de848ab457d317a66d5ed707e5e7f'
20.02.19 12:19:05 (+0000) Service is already running 'hardware sha256:e1f197d54248ca64acc3e132dda9ae30944bac1880f2907adf2587777f1414f5'
I think I have fixed the issue by restoring our previous dockerfile which used the original resin base images. The following Dockerfile is from one of our services that was causing the issue.
FROM balenalib/%%BALENA_MACHINE_NAME%%-node:10-stretch-build AS buildstep
RUN npm install -g npm@latest
WORKDIR /build
COPY .npmrc package.json ./
RUN npm install --only=production --unsafe-perm
FROM balenalib/%%BALENA_MACHINE_NAME%%-node:10-stretch-run
ENV INITSYSTEM on
RUN install_packages dbus udev usbutils
WORKDIR /app
COPY --from=buildstep /build/node_modules ./node_modules
COPY babel.config.js index.js package.json start.sh ./
RUN chmod +x start.sh
CMD ["./start.sh"]
Here is the original Dockerfile which I have now reverted to using which has not yet experienced any restart issues.
FROM resin/%%RESIN_MACHINE_NAME%%-node:9
ENV INITSYSTEM on
RUN apt-get update && apt-get install dbus
RUN npm install -g npm@latest
WORKDIR /app
COPY .npmrc package.json ./
RUN JOBS=MAX npm install --silent --production --unsafe-perm && npm cache clean --force
COPY babel.config.js index.js start.sh ./
RUN chmod +x start.sh
CMD ["./start.sh"]
To give context we have a number of similar services. I only updated 3 of the 4 node services to the latest base images. Another I had not bothered updating as it has a number of OS packages and I thought it could cause issues.
Perhaps a cause of the issue is that there is too much memory consumed by the multiple base images? If I had of changed all services to the new images it could have not caused an issue.
Hi
we are happy you managed to fix the problem.
We have been discussing your issue internally and would be interested in learning what caused the problem in the first place. Due to non persistent log files we will not find many traces on the repaired device. If you have a test device that you can reproduce the failing configuration on we would be happy to take a look at it .
I would be happy to deploy the code that was causing an issue on the device when I am back in the office on Monday.
As the continual restarts are not very good for the hardware, we have a moving gantry that calibrates it’s location every time the device starts, I would prefer to not leave it in the problematic state for too long. Perhaps it would be possible to arrange a time at which you might be able to take a look? I am on GMT but fairly flexible what time of day.
Could you please share with us a minimal piece of code to reproduce this issue on our end?
If you prefer to share in private, let us know. We’ll send a private message to collect it.
Due to resource constraints, we cannot promise to provide support on a specific time slot for the forum posts (though we do provide it for our paid support).