How to prevent my container from being deleted ever

I understand that if my container misbehaves it gets killed.
By misbehaves I mean as @gelbal stated

Yes, high memory / cpu consumption by the application is a plausible reason.

I understand that is necessary, what is not necessary is that the entire container gets deleted.
In production, I won’t have the bandwidth to download an entire container very often. I need the container to stick around so that if I successfully push a patch it uses the cache which works great.

Also, I would appreciate some detail in the logs on why the container was killed so I can get to the bottom of this.

Is there any way I can accomplish those things as the balance supervisor stands, or can someone point me at the relevant code that decides when to kill I container so I can better understand how it functions?

Logs for context and people from the future on google:

Killing service 'main sha256:3b51aaadaf801f4042034c59df5c59b5849681febcec23940a7041897078b5cf'
Service exited 'main sha256:3b51aaadaf801f4042034c59df5c59b5849681febcec23940a7041897078b5cf'
Killed service 'main sha256:3b51aaadaf801f4042034c59df5c59b5849681febcec23940a7041897078b5cf'
Deleting image 'registry2.balena-cloud.com/v2/09502b550b4c5258b15a231b29374b4c@sha256:d8dd231fa99b92ca09bee30e9071134ad77dfb41f07da5194f841c64c752bc39'
Deleted image 'registry2.balena-cloud.com/v2/09502b550b4c5258b15a231b29374b4c@sha256:d8dd231fa99b92ca09bee30e9071134ad77dfb41f07da5194f841c64c752bc39'
Removing volume 'resin-data'

Support access is enabled, please do not restart the device.
Thank you

Edit:
The problem is with window CLI version 9.15.5 as of version 10.1.1 the problem is gone.
In short, when ctrl+c was pressed after the build succeeded it mistakenly cancelled the build and removed it from the device.

Hi @tacLog, thanks for enabling support access but we’ll need to know the device UUID in order to take a look. It’s not expected behaviour for the container to be deleted.

uuid:
a49451f78439e9a53876fb5a6172f6ee

Also as a note, the container is running development code right now. I can’t do anything to reproduce the error at will. I can provide details about the application running on it in PM.

I could go back to a build that I know failed quickly if you want to monitor it live.

I left the container in a state where it just encountered the problem for the convenience of support.

I will leave it in that state all weekend at least if not most of Monday. I have not managed to solve this problem by changing my app.

Thanks to anyone that looks at this.

Hey @tacLog, looking at the device, it is in a strange state. I see on your application “Releases” page, that the latest release is marked as “Cancelled”. Is that correct? Was that build cancelled during the push? Or your push succeeded and it’s marked erroneously as cancelled?

I see that the API gives a state to to the device that is indeed no containers / no volumes, etc. So the device seems to be working as it should, but the API gives the device a state, that unusual, in my opinion.

Could you do an application code push to again? And see if the result in the application “Releases” page becomes “Successful”? If so, the device should download things correctly. We are checking with a team, how the device could even try to deploy a “Cancelled” build, and get back to you.

Also, how do you deploy your code? git push ..., balena push? If balena push what, version of the CLI you are using? (balena version to get version number)

Hey @imrehg,

The pushes always succeed and run for highly variable amounts of time. I have had builds run for days or just a few minutes. (of course, I am developing this activity so the code is changing often) They only appear as canceled after the next release gets pushed or they fail and get deleted.

I have never canceled a release and I don’t even know how I would do so. If the build is looping out of control I hit the stop button sometimes but I didn’t this last time.

I will push a new release shortly.

I push with “balena push test” and I have the windows standalone version 9.15.5

Please let me know if there is anything further I can do. This is key to our ongoing development with balena cloud.

-Thomas

Further documentation around the issue.

As far as I know the releases pages always starts like this after a build.

The release becomes cancelled after the device delete’s it.
The logs above are always the logs I get when the device delete’s it.

On of the worst side affects is the removal of resin-data and the image, forcing a full redownload that is annoying while I have the device with me, but will take the device offline for more than a day in the field.
-Thomas

Hi,

This does sound strange, since /resin-data is never deleted except if

  • the device moves to a different app or
  • an explicit app/device action to do so is triggered.

Can you provide some extra details about this?
The device only deletes the local copy of the images, but that shouldn’t have any effect on the state of the release on our backend. I was checking some logs about release 3126c2e and it seems that it was created at 2019-04-19T23:47:03.516Z and last updated from an external API consumer at 2019-04-19T23:51:49.848Z . Do you suggest that it was marked as cancelled at a later point of time.

I was able to end up in a state where a successful release gets marked as canceled, while using balena push and doing a Ctrl+C when the build is almost done. We have also just released balena-CLI v10.1.1 which resolves that issue and it would be great to hear back from you whether it changes the situation that you are facing or not.

Kind regards,
Thodoris

Hey @thgreasi

I think you resolved my issue. Pressing ctrl+c after the build is finished cancelled the build and deleted everything.
I am happy to provide as much detail as I can.
Thank you to everyone that helped me figure this out. I assumed wrongly that hanging forever after the unicorn was normal behavior to allow the dev to view the log.

Thank you
-Thomas

Details:

As for the removal of resin-data that happens every time this occurs. See the logs from the above post
or below from the most recent failure on Friday.

Can you provide some extra details about this?

I only noticed this last week and thought it was a side affect of this bug. There is a correlation from when the device removes the image and the image is marked as canceled.

However, I think you are on to something when a build completes the builder just hangs here:

To exit I press control+c, I just confirmed this marks the build as canceled which causes the deletion.

The reason this was hard to correlate for me was the device takes some time to sync and then responds by deleted everything. I still feel bad for not noticing this before.

I will try the newest version and hopefully it will resolve the issue.

Logs from most recent time

Killing service 'main     sha256:253eaf313bebc6c10af385c029796326e58b66a3eb2e282f6cddc56d25374009'
Edited to remove NDA matrial  here. 
Service exited 'main sha256:253eaf313bebc6c10af385c029796326e58b66a3eb2e282f6cddc56d25374009'
Killed service 'main sha256:253eaf313bebc6c10af385c029796326e58b66a3eb2e282f6cddc56d25374009'
Deleting image 'registry2.balena-cloud.com/v2/5e8874b04e9604e76200f7e094639601@sha256:2ea5899797064154cffadb10f7e0f173c370eff44bad71d265440658a758f271'
Deleted image 'registry2.balena-cloud.com/v2/5e8874b04e9604e76200f7e094639601@sha256:2ea5899797064154cffadb10f7e0f173c370eff44bad71d265440658a758f271'
Removing volume 'resin-data'

After I balena push test the same exact code to make a new release.

Creating volume 'resin-data'
Downloading image 'registry2.balena-cloud.com/v2/2133ac12ec4f7d8fc0e75a6915c227ba@sha256:3cb399e13f5df810ebc1dae70d5c60828c6e7cf662f2d720e5b7865d0139d2ec'
Downloaded image 'registry2.balena-cloud.com/v2/2133ac12ec4f7d8fc0e75a6915c227ba@sha256:3cb399e13f5df810ebc1dae70d5c60828c6e7cf662f2d720e5b7865d0139d2ec'
Installing service 'main sha256:8005e34c5478e530b0ea0d35838a78be0471a58cf9772323d36150875e77ce02'
Installed service 'main sha256:8005e34c5478e530b0ea0d35838a78be0471a58cf9772323d36150875e77ce02'
Starting service 'main sha256:8005e34c5478e530b0ea0d35838a78be0471a58cf9772323d36150875e77ce02'
Started service 'main sha256:8005e34c5478e530b0ea0d35838a78be0471a58cf9772323d36150875e77ce02'

@tacLog give us a note how v10.1.1 works for you.
Given that you mentioned using v9.15.5, I think it will be easier to debug the Removing volume 'resin-data' issue once your CLI is up to date and in case that this is still an issue for you.

Hey @thgreasi,

Just updated to 10.1.1 and it appears the behavior is fixed. The build succeeds and gets deployed and starts running. Pressing ctrl+c after the unicorn does not cancel the build.

I am 100 percent certain that this entire issue was a CLI problem and half on me because I didn’t realize that pressing ctrl+c was killing my build. I would just leave the console up to get lost in my window until I would hit control c then immediately push another version. This prevented me from noticing anything, until I would close a version I was happy with and then look at it later to realized it got deleted.

Important context, I am using the windows 32 standalone version. The problem does not persist in version 10.1.1.

Thank you for addressing this!
-Thomas