Issue when moving database to AWS RDS postgres

g.corrigan · September 15, 2021, 6:47pm

We have dev and production env and we recently tested moving the database from the local docker container to AWS RDS. This worked well in dev, there were no issues when we restored the data from the openbalena server to RDS and switch the config to point to RDS.

However when we applied the exact same steps to the production server we have an issue with devices downloading the containers for the app after the device is flashed. Both dev and prod are on the same versions of openbalena and RDS is the same version of postgres with the same settings.

See logs attached and error message below. The issue seems to be authentication to the registry and the error seems to indicate a parsing issue but not sure how this is possible
. We are using AWS S3 as the registry storage and that has been working well for a few months and nothing has changed there.

event]   Event: Image download error {"error":{"message":"(HTTP code 500) server error - Get https://registry.devices.xxx.xxxx.xx/v2/v2/5b54c4058263bfbe9097a29a7532c0d5/manifests/sha256:5436e3666cc18da8da86411becd31c6039
cb58d2578a90e7b157ec302572742b: invalid token auth challenge realm: parse <https://api.devices.xxx.xxxx.xx/auth/v1/token>: first path segment in URL cannot contain colon ","stack":"Error: (HTTP code 500) server error - Get https:
//registry.devices.xxx.xxxx.xx/v2/v2/5b54c4058263bfbe9097a29a7532c0d5/manifests/sha256:5436e3666cc18da8da86411becd31c6039cb58d2578a90e7b157ec302572742b: invalid token auth challenge realm: parse <https://api.devices.xxx.xxxx.xx/a
uth/v1/token>: first path segment in URL cannot contain colon \n    at /usr/src/app/dist/app.js:10:2303379\n    at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:10:2303266)\n    at IncomingMessage.emit (events.js:322:22)\
n    at endReadableNT (_stream_readable.js:1187:12)\n    at processTicksAndRejections

If we revert the db back to local on the openbalena server everything works was normal again.

There are no errors showing in any of the container logs that look like they might relate to this.
Anyone got any ideas as to what might be causing the issue or where to focus out attention to?

prod_balena_issue.txt (16.1 KB)

cmfcruz · October 21, 2021, 1:16pm

Hi,

There seems to be a routing issue to the image when the engine is trying to do a download. I noticed a difference in the paths shown in the logs.

This event shows the engine attempting to download the image:

[event]   Event: Docker image download {"image":{"name":"registry.devices.xxx.xxxx.xx/v2/5b54c4058263bfbe9097a29a7532c0d5@sha256:5436e3666cc18da8da86411becd31c6039cb58d2578a90e7b157ec302572742b","appId":10,"serviceId":97,
"serviceName":"telegraf","imageId":553,"releaseId":111,"dependent":0,"dockerImageId":null}}

I see that the path of the image in the event error has /v2/v2/ instead of just /v2/:

[event]   Event: Image download error {"error":{"message":"(HTTP code 500) server error - Get https://registry.devices.xxx.xxxx.xx/v2/v2/5b54c4058263bfbe9097a29a7532c0d5/manifests/sha256:5436e3666cc18da8da86411becd31c6039
cb58d2578a90e7b157ec302572742b: invalid token auth challenge realm: parse <https://api.devices.xxx.xxxx.xx/auth/v1/token>: first path segment in URL cannot contain colon ",....

You mentioned that you are using the same registry in production and no changes have been done to it. You also mentioned that reverting to the local database fixes the issue. Have you tried doing another backup & restoration of the local database to AWS RDS? I suspect some data corruption may have happened during the transfer.

Regards,
Carlo

g.corrigan · October 26, 2021, 8:24pm

Hi @cmfcruz, we tried the back and restore a few times but I haven’t tried the migration again since I logged the issue and also didnt notice the /v2/v2 issue.

I’ll see if I can figure out why the /v2/v2 is appearing. Thanks for the response and for pointing that out.

Regards,
Gerard.