Resin data deleted

We had a situation today where one of our devices failed. On closer inspection it became clear that ResinOS decided to delete the data volume.

The following was logged:

11.07.18 02:13:59 (+1000) Removing volume 'resin-data'

It then went on to error (and loop attempting to remove the volume) due to a file being held open, our application in container running in dev mode was not designed to have it’s data destroyed like this.

The device has never changed applications. The application is a classic type, so single service. On the device summary page on the resin panel no services are currently displayed.

Device UUID: dcea8b0b6e97154dfecb4f09c929bc34 (support access granted).

If this was in production this would have been a critical failure, so obviously this concerns me.

Hey @SplitIce , I took a look at the device, and it appears it was rebooted after the volume was removed, is this correct? I was going to check the supervisor logs to see what would have caused the volume removal, but they are no longer present.

After the volume was deleted (but became stuck in an infinite loop as we had a swapfile on that volume) I gave it a reboot and requested that the device move to a newer container release.

It was the new container release that led to the device resuming operation.

@CameronDiver Never mind it failed again.

I have to go now, so I’ll leave the device as is.

@SplitIce I’ve been looking into this device as best I can, and there isn’t anything obvious to point to why this would occur. Is there any information you can provide me about how to reproduce, or how this device may be different to a standard resin device in any way?

This isn’t something we’ve ever seen before, so any information should be helpful.

@CameronDiver

There is nothing special or new about this device that should be relevant. I don’t have replication instructions, as far as I know we did nothing at the time. It looks like the service was deleted on the resin cloud end (no services displayed in the UI) and that does not even look possible for us to do.

This device runs the pre-release resin-allwinner BSP (as do all the other devices in that application). However this has never happened before this date. This device is running kernel 4.14.53 which is an increase from 4.14.52 ran on all our others.

This should have no effect on the operation of resin-supervisor however.

For the sake of anyone googling here are the resin-supervisor logs from the time the device decided to delete all user data.

[2018-07-10T17:26:58.844Z] Applying target state
[2018-07-10T17:26:59.710Z] Event: Service kill {"service":{"appId":**,"serviceId":**,"serviceName":"main","releaseId":**,"image":"sha256:a8f***efd","labels":{"io.resin.app-id":"**","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.service-id":"2092","io.resin.service-name":"main","io.resin.supervised":"true"}}}
[2018-07-10T17:27:03.470Z] Event: Service exit {"service":{"appId":**,"serviceId":2092,"serviceName":"main","releaseId":**,"image":"sha256:a8***efd","labels":{"io.resin.app-id":"**","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.service-id":"2092","io.resin.service-name":"main","io.resin.supervised":"true"}}}
[2018-07-10T17:27:03.584Z] Event: Service stop {"service":{"appId":**,"serviceId":**,"serviceName":"main","releaseId":**,"image":"sha256:a8f**efd","labels":{"io.resin.app-id":"**","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.service-id":"2092","io.resin.service-name":"main","io.resin.supervised":"true"}}}
[2018-07-10T17:27:04.818Z] Event: Image removal {"image":{"name":"registry2.resin.io/v2/1fd**39","appId":**,"serviceId":**,"serviceName":"main","imageId":**,"releaseId":**,"dependent":0}}
[2018-07-10T17:27:13.060Z] Event: Image removed {"image":{"name":"registry2.resin.io/v2/1fd**39","appId":**,"serviceId":**,"serviceName":"main","imageId":**,"releaseId":**,"dependent":0}}
[2018-07-10T17:27:13.626Z] Event: Volume removal {}
[2018-07-10T17:27:14.251Z] Finished applying target state
[2018-07-10T17:27:14.254Z] Apply success!

@CameronDiver Something I just noticed. In the resin UI the pin which indicates the device is set to a specific commit is gone on that device.

The device however remains on the previously pinned commit according to the UI (actual device no service).

Also @CameronDiver

The next bug I’ve got on my list to report is also shown on this hub at the moment, perhaps the two are related? Maybe the supervisor was unable to correctly communicate with balena and that’s why it decided to delete the volume.

# balena exec -it 2ec2246c603b /bin/bash
rpc error: code = Unknown desc = invalid character 'e' looking for beginning of value

We see this all the time on many devices. Typically when there is less than 25% memory free.

root@dcea8b0:~# free -m
             total       used       free     shared    buffers     cached
Mem:        504284     445464      58820      10616       6424     112424
-/+ buffers/cache:     326616     177668
Swap:            0          0          0

It’s not exactly memory pressure, but perhaps fragmentation? I can see we are not the first people to hit this issue either, there a few posts on this forum regarding this particular error.

Any of the following actions typically fix the issue (temporarily):

  • Killing the daemon running in the container (ruby). This application has a RSS of just under 100MB
  • restarting the container with balena restart
  • restarting balena with systemctl restart balena

Hi @SplitIce

I just wanted to update you, to let you know that we are still working on this one.

Could you tell me if you’ve ever patched the commit value for this application before?

Thanks.

@richbayliss

Yes all our devices move forward in commits via the staged-releases scripts, as I understand it these scripts set the devices build id.

Also it deleted the volume again today. The supervisor container logs for the last 15 or so minute before the volume deletion:

[2018-07-15T04:33:14.715Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Starting delta to registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5
[2018-07-15T04:33:35.161Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Delta failed with: Error: Delta server is still processing the delta, will retry
[2018-07-15T04:33:35.173Z] Event: Image download error {"error":{"message":"Delta server is still processing the delta, will retry","stack":"Error: Delta server is still processing the delta, will retry\n    at /usr/src/app/dist/app.js:618:34021\n    at u (/usr/src/app/dist/app.js:9:75210)\n    at A._settlePromiseFromHandler (/usr/src/app/dist/app.js:315:128550)\n    at A._settlePromise (/usr/src/app/dist/app.js:315:129450)\n    at A._settlePromise0 (/usr/src/app/dist/app.js:315:130149)\n    at A._settlePromises (/usr/src/app/dist/app.js:315:131476)\n    at A._fulfill (/usr/src/app/dist/app.js:315:130519)\n    at s._resolve (/usr/src/app/dist/app.js:315:139433)\n    at s._promiseFulfilled (/usr/src/app/dist/app.js:315:139774)\n    at s._iterate (/usr/src/app/dist/app.js:315:139254)\n    at s.t [as _init] (/usr/src/app/dist/app.js:315:138475)\n    at A._settlePromise (/usr/src/app/dist/app.js:315:129488)\n    at A._settlePromise0 (/usr/src/app/dist/app.js:315:130149)\n    at A._settlePromises (/usr/src/app/dist/app.js:315:131476)\n    at A._fulfill (/usr/src/app/dist/app.js:315:130519)\n    at U._callback (/usr/src/app/dist/app.js:118:48500)"},"image":{"name":"registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5","appId":738415,"serviceId":2092,"serviceName":"main","imageId":367029,"releaseId":537738,"dependent":0,"dockerImageId":null}}
[2018-07-15T04:33:35.714Z] Event: Delta image download {"image":{"name":"registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5","appId":738415,"serviceId":2092,"serviceName":"main","imageId":367029,"releaseId":537738,"dependent":0,"dockerImageId":null}}
[2018-07-15T04:33:35.745Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Starting delta to registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5
[2018-07-15T04:33:42.196Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Delta failed with: Error: Delta server is still processing the delta, will retry
[2018-07-15T04:33:42.206Z] Event: Image download error {"error":{"message":"Delta server is still processing the delta, will retry","stack":"Error: Delta server is still processing the delta, will retry\n    at /usr/src/app/dist/app.js:618:34021\n    at u (/usr/src/app/dist/app.js:9:75210)\n    at A._settlePromiseFromHandler (/usr/src/app/dist/app.js:315:128550)\n    at A._settlePromise (/usr/src/app/dist/app.js:315:129450)\n    at A._settlePromise0 (/usr/src/app/dist/app.js:315:130149)\n    at A._settlePromises (/usr/src/app/dist/app.js:315:131476)\n    at A._fulfill (/usr/src/app/dist/app.js:315:130519)\n    at s._resolve (/usr/src/app/dist/app.js:315:139433)\n    at s._promiseFulfilled (/usr/src/app/dist/app.js:315:139774)\n    at s._iterate (/usr/src/app/dist/app.js:315:139254)\n    at s.t [as _init] (/usr/src/app/dist/app.js:315:138475)\n    at A._settlePromise (/usr/src/app/dist/app.js:315:129488)\n    at A._settlePromise0 (/usr/src/app/dist/app.js:315:130149)\n    at A._settlePromises (/usr/src/app/dist/app.js:315:131476)\n    at A._fulfill (/usr/src/app/dist/app.js:315:130519)\n    at U._callback (/usr/src/app/dist/app.js:118:48500)"},"image":{"name":"registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5","appId":738415,"serviceId":2092,"serviceName":"main","imageId":367029,"releaseId":537738,"dependent":0,"dockerImageId":null}}
[2018-07-15T04:33:43.125Z] Event: Delta image download {"image":{"name":"registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5","appId":738415,"serviceId":2092,"serviceName":"main","imageId":367029,"releaseId":537738,"dependent":0,"dockerImageId":null}}
[2018-07-15T04:33:43.155Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Starting delta to registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5
[2018-07-15T04:34:04.334Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Delta failed with: Error: Delta server is still processing the delta, will retry
[2018-07-15T04:34:04.360Z] Event: Image download error {"error":{"message":"Delta server is still processing the delta, will retry","stack":"Error: Delta server is still processing the delta, will retry\n    at /usr/src/app/dist/app.js:618:34021\n    at u (/usr/src/app/dist/app.js:9:75210)\n    at A._settlePromiseFromHandler (/usr/src/app/dist/app.js:315:128550)\n    at A._settlePromise (/usr/src/app/dist/app.js:315:129450)\n    at A._settlePromise0 (/usr/src/app/dist/app.js:315:130149)\n    at A._settlePromises (/usr/src/app/dist/app.js:315:131476)\n    at A._fulfill (/usr/src/app/dist/app.js:315:130519)\n    at s._resolve (/usr/src/app/dist/app.js:315:139433)\n    at s._promiseFulfilled (/usr/src/app/dist/app.js:315:139774)\n    at s._iterate (/usr/src/app/dist/app.js:315:139254)\n    at s.t [as _init] (/usr/src/app/dist/app.js:315:138475)\n    at A._settlePromise (/usr/src/app/dist/app.js:315:129488)\n    at A._settlePromise0 (/usr/src/app/dist/app.js:315:130149)\n    at A._settlePromises (/usr/src/app/dist/app.js:315:131476)\n    at A._fulfill (/usr/src/app/dist/app.js:315:130519)\n    at U._callback (/usr/src/app/dist/app.js:118:48500)"},"image":{"name":"registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5","appId":738415,"serviceId":2092,"serviceName":"main","imageId":367029,"releaseId":537738,"dependent":0,"dockerImageId":null}}
[2018-07-15T04:34:05.540Z] Event: Delta image download {"image":{"name":"registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5","appId":738415,"serviceId":2092,"serviceName":"main","imageId":367029,"releaseId":537738,"dependent":0,"dockerImageId":null}}
[2018-07-15T04:34:05.579Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Starting delta to registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5
[2018-07-15T04:34:08.023Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Applying rsync delta...
[2018-07-15T04:34:09.972Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Extracted image config
[2018-07-15T04:34:10.911Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Created empty image 17fc3966d0221aa7a0343840d26405054c54d8bccd1fcf6fe64f501dcc800689
[2018-07-15T04:34:28.192Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Hard-linked files from '/var/lib/docker/overlay2/tmp-cbc8e04a04861454c8bfa4578d1bfaf85b2b081282115ce06f41db4c37ba8a2a-2adJhbhW/' to '/var/lib/docker/overlay2/ea0154acccd27884cca88bf634f2716d1ae824c740c9167c63db97236d87caf3/diff/'
[2018-07-15T04:34:28.193Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Spawning rsync with arguments --archive --delete --read-batch - /var/lib/docker/overlay2/ea0154acccd27884cca88bf634f2716d1ae824c740c9167c63db97236d87caf3/diff/
[2018-07-15T04:35:08.715Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Batch input stream ended; waiting for rsync...
[2018-07-15T04:35:08.730Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): rsync exited cleanly
[2018-07-15T04:35:08.732Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): rsync exited successfully
[2018-07-15T04:35:08.733Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): fsync'ing...
[2018-07-15T04:35:09.045Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): All done. Image ID: 17fc3966d0221aa7a0343840d26405054c54d8bccd1fcf6fe64f501dcc800689
[2018-07-15T04:35:09.146Z] delta(registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824): Delta applied successfully
[2018-07-15T04:35:09.270Z] Event: Image downloaded {"image":{"name":"registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5","appId":738415,"serviceId":2092,"serviceName":"main","imageId":367029,"releaseId":537738,"dependent":0,"dockerImageId":null}}
[2018-07-15T04:35:09.637Z] Event: Service kill {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":536884,"image":"sha256:cbc8e04a04861454c8bfa4578d1bfaf85b2b081282115ce06f41db4c37ba8a2a","labels":{"io.resin.app-id":"738415","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.service-id":"2092","io.resin.service-name":"main","io.resin.supervised":"true"}}}
[2018-07-15T04:35:13.668Z] Event: Service exit {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":536884,"image":"sha256:cbc8e04a04861454c8bfa4578d1bfaf85b2b081282115ce06f41db4c37ba8a2a","labels":{"io.resin.app-id":"738415","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.service-id":"2092","io.resin.service-name":"main","io.resin.supervised":"true"}}}
[2018-07-15T04:35:13.787Z] Event: Service stop {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":536884,"image":"sha256:cbc8e04a04861454c8bfa4578d1bfaf85b2b081282115ce06f41db4c37ba8a2a","labels":{"io.resin.app-id":"738415","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.service-id":"2092","io.resin.service-name":"main","io.resin.supervised":"true"}}}
[2018-07-15T04:35:14.944Z] Event: Image removal {"image":{"name":"registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824","appId":738415,"serviceId":2092,"serviceName":"main","imageId":365929,"releaseId":536884,"dependent":0}}
[2018-07-15T04:35:22.178Z] Event: Image removed {"image":{"name":"registry2.resin.io/v2/c380f7859f3bc7ab3f478a112b70e9f7@sha256:c6b22b07cc97a400fc2b4f682f578862dd1b9548a18a7ea584e25576be682824","appId":738415,"serviceId":2092,"serviceName":"main","imageId":365929,"releaseId":536884,"dependent":0}}
[2018-07-15T04:35:22.853Z] Event: Service install {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":537738,"image":"sha256:17fc3966d0221aa7a0343840d26405054c54d8bccd1fcf6fe64f501dcc800689","labels":{"io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.supervised":"true","io.resin.app-id":"738415","io.resin.service-id":"2092","io.resin.service-name":"main"}}}
[2018-07-15T04:35:23.149Z] Event: Service installed {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":537738,"image":"sha256:17fc3966d0221aa7a0343840d26405054c54d8bccd1fcf6fe64f501dcc800689","labels":{"io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.supervised":"true","io.resin.app-id":"738415","io.resin.service-id":"2092","io.resin.service-name":"main"}}}
[2018-07-15T04:35:23.177Z] Event: Service start {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":537738,"image":"sha256:17fc3966d0221aa7a0343840d26405054c54d8bccd1fcf6fe64f501dcc800689","labels":{"io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.supervised":"true","io.resin.app-id":"738415","io.resin.service-id":"2092","io.resin.service-name":"main"}}}
[2018-07-15T04:35:25.094Z] Event: Service started {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":537738,"image":"sha256:17fc3966d0221aa7a0343840d26405054c54d8bccd1fcf6fe64f501dcc800689","labels":{"io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.supervised":"true","io.resin.app-id":"738415","io.resin.service-id":"2092","io.resin.service-name":"main"}}}
[2018-07-15T04:35:25.802Z] Finished applying target state
[2018-07-15T04:35:25.810Z] Apply success!
[2018-07-15T07:39:11.627Z] Event: Update notification {}
[2018-07-15T07:39:58.737Z] Event: Update notification {}
[2018-07-15T07:47:24.302Z] Event: Update notification {}
[2018-07-15T07:47:26.488Z] Event: Update notification {}
[2018-07-15T07:48:26.443Z] Event: Update notification {}
[2018-07-15T07:48:34.510Z] Event: Update notification {}
[2018-07-15T07:48:41.804Z] Event: Update notification {}
[2018-07-15T07:48:49.022Z] Applying target state
[2018-07-15T07:48:49.723Z] Event: Service kill {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":537738,"image":"sha256:17fc3966d0221aa7a0343840d26405054c54d8bccd1fcf6fe64f501dcc800689","labels":{"io.resin.app-id":"738415","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.service-id":"2092","io.resin.service-name":"main","io.resin.supervised":"true"}}}
[2018-07-15T07:49:00.963Z] Event: Service exit {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":537738,"image":"sha256:17fc3966d0221aa7a0343840d26405054c54d8bccd1fcf6fe64f501dcc800689","labels":{"io.resin.app-id":"738415","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.service-id":"2092","io.resin.service-name":"main","io.resin.supervised":"true"}}}
[2018-07-15T07:49:01.097Z] Event: Service stop {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","releaseId":537738,"image":"sha256:17fc3966d0221aa7a0343840d26405054c54d8bccd1fcf6fe64f501dcc800689","labels":{"io.resin.app-id":"738415","io.resin.architecture":"armv7hf","io.resin.device-type":"raspberry-pi2","io.resin.features.dbus":"1","io.resin.features.firmware":"1","io.resin.features.kernel-modules":"1","io.resin.features.resin-api":"1","io.resin.features.supervisor-api":"1","io.resin.qemu.version":"2.9.0.resin1-arm","io.resin.service-id":"2092","io.resin.service-name":"main","io.resin.supervised":"true"}}}
[2018-07-15T07:49:01.551Z] Event: Image removal {"image":{"name":"registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5","appId":738415,"serviceId":2092,"serviceName":"main","imageId":367029,"releaseId":537738,"dependent":0}}
[2018-07-15T07:49:10.822Z] Event: Image removed {"image":{"name":"registry2.resin.io/v2/e5645a1694688aeb7b5a0233295e1d44@sha256:61b23c5ed62a84b3c8c3089655a99cc7034b9fe4251d169d670f28fecd5f8bb5","appId":738415,"serviceId":2092,"serviceName":"main","imageId":367029,"releaseId":537738,"dependent":0}}
[2018-07-15T07:49:11.355Z] Event: Volume removal {}
[2018-07-15T07:49:11.998Z] Finished applying target state
[2018-07-15T07:49:12.001Z] Apply success!

Thanks for the extra data @SplitIce. So just to update you as transparently as possible; we have a theory and we’d like to see if this makes sense to you.

You mentioned the staged releases repository earlier, this has two different methods for staging the releases. You can either pin the device, which changes the should_be_running__release field on the device api resource. The other method is to set the entire application to a release, which actually changes the commit value on the application resource. This commit resource is an actual string which should be the same as a successful release in your application. The reason this isn’t a proper foreign key is down to limitation of the initial version of our api (these limitations no longer exist, and api maintainer @richbayliss is going to convert this soon).

Given the above information, our working theory is that the commit value on the application has at some point or another been set to a commit value for a release that does not exist (for example the id instead of the commit). Is this something that could have been possible?

@CameronDiver There could be conditions in which it’s possible. We move devices forward by running set-device-to-a-build.sh with the latest successful build for the devices branch.

So to confirm I understand correctly if I was to set the commit to a random id (say 123) then the device will cease running any release (like seen here). Or would this ID have to be valid (say for example an ID in use by another application)?

We use a basic PHP script to call the underlying bash scripts. I’ll see what we can do about implementing the curl calls there and doing more extensive validations.

@CameronDiver It looks like you might be on to something.

root@hookserver:/opt/git/build-resin/staged-releases# bash set-device-to-a-build.sh ./hub-app.env dcea8b0b6e97154dfecb4f09c929bc34 123
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   120  100   120    0     0    361      0 --:--:-- --:--:-- --:--:--   362
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     8  100     8    0     0     25      0 --:--:-- --:--:-- --:--:--    25
setting device dcea8b0b6e97154dfecb4f09c929bc34 (1081392) to commit "123" with buildID "null"
OK

I’ll replace the staged releases scripts with PHP curl calls on our end, and add some logging - the API looks simple enough.

@SplitIce yes currently any random value is taken as the release. This is being patched as we speak, so that should no longer be an issue, but before that, running a curl query along the lines of:

curl -H "Authorization: Bearer $API_TOKEN" "https://api.resin.io/v4/release?\$filter=(status%20eq%20'success'%20and%20belongs_to__application%20eq%20$APP_ID%20and%20commit%20eq%20'$COMMIT')"

Note that without the url-encoding, this is:

https://api.resin.io/v4/release?$filter=(status eq 'success' and belongs_to__application eq $APP_ID and commit eq '$COMMIT')"

If this returns a release, you can use that $COMMIT value, otherwise it’s not valid.

Really glad to hear we found this out, and as I say, we’re updating our backend to avoid this situation in the future.

@CameronDiver that’s a good idea for get-build-id.sh, we do something very similar to find the latest successful build as we iterate the branches. As a quick solution I’ll add that as safety check and a check for the resulting null in set-device-to-a-build.sh.

Thanks for finding this, and glad I could help you find an API shortfall.

@CameronDiver I think I found the place in our script where it was possible previously for an invalid build ID to be passed through for staging devices (specifically that release) of which this device was one.

Thanks for your help.

Also @CameronDiver re; the other issue we found that the issue with balena exec goes away if a small swapfile (e.g 64mb) is enabled. Strangely at the time the issue is occurring the device has over 150mb ram free (free + cached).

If you want a device to replicate that on let me know, I suspect in balena there is a large virtual memory allocation (perhaps never being written to, so never actually being allocated in physical memory / RSS) that is being rejected due to lack of free memory.

I did some more debugging into the balena exec failures.

  1. It’s the tty (-t) flag that hides the actual error message. The actual error message on the client side is.
rpc error: code = Unknown desc = oci runtime error: exec failed: container_linux.go:263: starting container process caused "process_linux.go:95: writing config to pipe caused \"write init-p: broken pipe\""
  1. The call usually succeeds if there is 157.5MB of available ram (157508kb lowest recorded success) and fails with anything less. No failures recorded above 160mb available.

  2. I checked with the docker servers debug enabled and nothing of interest was logged

Jul 17 01:51:32 dcea8b0 healthdog[831]: time="2018-07-17T01:51:32.179892581Z" level=debug msg="Calling POST /v1.31/containers/3ee230408f8f/exec"
Jul 17 01:51:32 dcea8b0 healthdog[831]: time="2018-07-17T01:51:32.181059484Z" level=debug msg="form data: {\"AttachStderr\":true,\"AttachStdin\":true,\"AttachStdout\":true,\"Cmd\":[\"/bin/bash\"],\"Detach\":false,\"DetachKeys\":\"\",\"Env\":null,\"Privileged\":false,\"Tty\":true,\"User\":\"\"}"
Jul 17 01:51:32 dcea8b0 healthdog[831]: time="2018-07-17T01:51:32.192466392Z" level=debug msg="Calling POST /v1.31/exec/49b2e3c0a968a2843933520b1a9935dd1025adf7776682d7fe5b346c5c274791/start"
Jul 17 01:51:32 dcea8b0 healthdog[831]: time="2018-07-17T01:51:32.193612546Z" level=debug msg="form data: {\"AttachStderr\":true,\"AttachStdin\":true,\"AttachStdout\":true,\"Cmd\":[\"/bin/bash\"],\"Detach\":false,\"DetachKeys\":\"\",\"Env\":null,\"Privileged\":false,\"Tty\":true,\"User\":\"\"}"
Jul 17 01:51:32 dcea8b0 healthdog[831]: time="2018-07-17T01:51:32.223102868Z" level=debug msg="Calling POST /v1.31/exec/49b2e3c0a968a2843933520b1a9935dd1025adf7776682d7fe5b346c5c274791/resize?h=24&w=88"
Jul 17 01:51:32 dcea8b0 healthdog[831]: time="2018-07-17T01:51:32.220952185Z" level=debug msg="starting exec command 49b2e3c0a968a2843933520b1a9935dd1025adf7776682d7fe5b346c5c274791 in container 3ee230408f8ff3c9de44246abee7eaddccd8d3b17cf56dd55468e0c0d2866952"
Jul 17 01:51:32 dcea8b0 healthdog[831]: time="2018-07-17T01:51:32.229934164Z" level=debug msg="attach: stdin: begin"
Jul 17 01:51:32 dcea8b0 healthdog[831]: time="2018-07-17T01:51:32.230341451Z" level=debug msg="attach: stdout: begin"
Jul 17 01:51:33 dcea8b0 healthdog[831]: time="2018-07-17T01:51:33.017854434Z" level=error msg="Error running exec in container: rpc error: code = Unknown desc = invalid character 'e' looking for beginning of value"
Jul 17 01:51:33 dcea8b0 healthdog[831]: time="2018-07-17T01:51:33.018565467Z" level=debug msg="Closing buffered stdin pipe"
Jul 17 01:51:33 dcea8b0 healthdog[831]: time="2018-07-17T01:51:33.018785340Z" level=debug msg="attach: stdin: end"
Jul 17 01:51:33 dcea8b0 healthdog[831]: time="2018-07-17T01:51:33.018923046Z" level=debug msg="attach: stdout: end"
Jul 17 01:51:33 dcea8b0 healthdog[831]: time="2018-07-17T01:51:33.022034760Z" level=debug msg="Calling GET /v1.31/exec/49b2e3c0a968a2843933520b1a9935dd1025adf7776682d7fe5b346c5c274791/json"

The container event is created:

2018-07-17T01:55:22.373419834Z container exec_start: /bin/bash  3ee230408f8ff3c9de44246abee7eaddccd8d3b17cf56dd55468e0c0d2866952 (image=sha256:296efdbf7685dcbebb1807225ee1902ae8a309bf9f7512090cd445b4d14e013d, io.resin.app-id=738415, io.resin.architecture=armv7hf, io.resin.device-type=raspberry-pi2, io.resin.features.dbus=1, io.resin.features.firmware=1, io.resin.features.kernel-modules=1, io.resin.features.resin-api=1, io.resin.features.supervisor-api=1, io.resin.qemu.version=2.9.0.resin1-arm, io.resin.service-id=2092, io.resin.service-name=main, io.resin.supervised=true, name=main_368440_538510)

And stdin & stdout’s are attached.

which means the daemon successfully gets to at-least here: https://github.com/resin-os/balena/blob/df5d0f6fe7410472a0f8a4e1f45f991cf71d4def/daemon/exec.go#L232

It either fails in that method critically, or subsequently in AddProcess.

Additional information from my co-worker:

The working/not working paths diverge at (self).remote.apiClient.AddProcess:
https://github.com/resin-os/balena/blob/17.06-resin/libcontainerd/client_linux.go#L120

That method is run by exec.go’s d.containerd.AddProcess. AttachErr is always set, number like 295186944, 295952832, 293988160, 293993152, 294342784.