Unable to update OpenBalena to 3.6.0

We have an openBalena server running on Ubuntu 18.04.
After updating from 3.5.1 to 3.6.0 the openBalena is inaccessible from Balena CLI and all devices.
First I tried to update to the latest version (3.7.6), but it seems that the step that goes wrong is the above. So now we’re stuck at 3.5.1 with a backup.

After the update the database seems to be corrupted, since after restarting openBalena this is the output:

root@open-balena:~/open-balena# ./scripts/compose up
Creating network "openbalena_default" with the default driver
Creating openbalena_redis_1         ... done
Creating openbalena_cert-provider_1 ... done
Creating openbalena_s3_1            ... done
Creating openbalena_db_1            ... done
Creating openbalena_registry_1      ... done
Creating openbalena_api_1           ... done
Creating openbalena_vpn_1           ... done
Creating openbalena_haproxy_1       ... done
Attaching to openbalena_s3_1, openbalena_redis_1, openbalena_db_1, openbalena_cert-provider_1, openbalena_registry_1, openbalena_api_1, openbalena_vpn_1, openbalena_haproxy_1
s3_1             | Systemd init system enabled.
db_1             | === Upgrading data from v10 to v13
db_1             | === Creating postgres directory /usr/lib/postgresql/upgrade
db_1             | === Installing tools for Postgres v10
cert-provider_1  | [Info] VALIDATION not set. Using default: http-01
cert-provider_1  | [Info] Waiting for api.openbalena.domain-left-out.com to be available via HTTP...
cert-provider_1  | [Info] (1/3) Connecting...
redis_1          | 1:C 07 Dec 2022 12:45:32.723 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
redis_1          | 1:C 07 Dec 2022 12:45:32.723 # Redis version=6.2.3, bits=64, commit=00000000, modified=0, pid=1, just started
redis_1          | 1:C 07 Dec 2022 12:45:32.723 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
cert-provider_1  | [Info] (1/3) Failed. Retrying in 5 seconds...
cert-provider_1  | [Info] (2/3) Connecting...
cert-provider_1  | [Info] (2/3) Failed. Retrying in 5 seconds...
registry_1       | Systemd init system enabled.
redis_1          | 1:M 07 Dec 2022 12:45:32.725 * monotonic clock: POSIX clock_gettime
redis_1          | 1:M 07 Dec 2022 12:45:32.728 * Running mode=standalone, port=6379.
redis_1          | 1:M 07 Dec 2022 12:45:32.728 # Server initialized
redis_1          | 1:M 07 Dec 2022 12:45:32.728 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
redis_1          | 1:M 07 Dec 2022 12:45:32.750 * Loading RDB produced by version 6.2.3
redis_1          | 1:M 07 Dec 2022 12:45:32.750 * RDB age 28 seconds
redis_1          | 1:M 07 Dec 2022 12:45:32.750 * RDB memory usage when created 3.81 Mb
redis_1          | 1:M 07 Dec 2022 12:45:32.797 * DB loaded from disk: 0.068 seconds
redis_1          | 1:M 07 Dec 2022 12:45:32.797 * Ready to accept connections
api_1            | Systemd init system enabled.
vpn_1            | Systemd init system enabled.
haproxy_1        | Using certificate from cert-provider...
haproxy_1        | Setting up watches.  Beware: since -r was given, this may take a while!
haproxy_1        | Watches established.
haproxy_1        | [NOTICE] 340/124541 (10) : New worker #1 (12) forked
haproxy_1        | [WARNING] 340/124541 (12) : Server backend_api/balena_api_1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
haproxy_1        | [ALERT] 340/124541 (12) : backend 'backend_api' has no server available!
haproxy_1        | [WARNING] 340/124541 (12) : Server backend_registry/balena_registry_1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
haproxy_1        | [ALERT] 340/124541 (12) : backend 'backend_registry' has no server available!
haproxy_1        | [WARNING] 340/124542 (12) : Server backend_vpn/balena_vpn_1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
haproxy_1        | [ALERT] 340/124542 (12) : backend 'backend_vpn' has no server available!
haproxy_1        | [WARNING] 340/124542 (12) : Server backend_s3/balena_s3_1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
haproxy_1        | [ALERT] 340/124542 (12) : backend 'backend_s3' has no server available!
haproxy_1        | [WARNING] 340/124542 (12) : Server backend_db/balena_db_1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
haproxy_1        | [ALERT] 340/124542 (12) : backend 'backend_db' has no server available!
haproxy_1        | [WARNING] 340/124543 (12) : Server vpn-tunnel/balena_vpn is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
haproxy_1        | [ALERT] 340/124543 (12) : proxy 'vpn-tunnel' has no server available!
haproxy_1        | [WARNING] 340/124543 (12) : Server vpn-tunnel-tls/balena_vpn is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
haproxy_1        | [ALERT] 340/124543 (12) : proxy 'vpn-tunnel-tls' has no server available!
db_1             | debconf: delaying package configuration, since apt-utils is not installed
db_1             | Selecting previously unselected package postgresql-client-10.
(Reading database ... 12041 files and directories currently installed.)
db_1             | Preparing to unpack .../postgresql-client-10_10.23-1.pgdg110+1_amd64.deb ...
db_1             | Unpacking postgresql-client-10 (10.23-1.pgdg110+1) ...
db_1             | Selecting previously unselected package postgresql-10.
db_1             | Preparing to unpack .../postgresql-10_10.23-1.pgdg110+1_amd64.deb ...
db_1             | Unpacking postgresql-10 (10.23-1.pgdg110+1) ...
db_1             | Setting up postgresql-client-10 (10.23-1.pgdg110+1) ...
cert-provider_1  | [Info] (3/3) Connecting...
cert-provider_1  | [Info] (3/3) Failed!
cert-provider_1  | [Info] Unable to access api.openbalena.domain-left-out.com on port 80. This is needed for certificate validation. Retrying in 30 seconds...
haproxy_1        | [WARNING] 340/124546 (12) : Server backend_s3/balena_s3_1 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
db_1             | Setting up postgresql-10 (10.23-1.pgdg110+1) ...
db_1             | debconf: unable to initialize frontend: Dialog
db_1             | debconf: (TERM is not set, so the dialog frontend is not usable.)
db_1             | debconf: falling back to frontend: Readline
db_1             | invoke-rc.d: could not determine current runlevel
db_1             | invoke-rc.d: policy-rc.d denied execution of start.
db_1             | Processing triggers for postgresql-common (238.pgdg110+1) ...
db_1             | debconf: unable to initialize frontend: Dialog
db_1             | debconf: (TERM is not set, so the dialog frontend is not usable.)
db_1             | debconf: falling back to frontend: Readline
db_1             | Building PostgreSQL dictionaries from installed myspell/hunspell packages...
db_1             | Removing obsolete dictionary files:
db_1             | === Initializing new data directory /var/lib/postgresql/data/13
db_1             | === Creating postgres directory /var/lib/postgresql/data/13
db_1             | The files belonging to this database system will be owned by user "postgres".
db_1             | This user must also own the server process.
db_1             | 
db_1             | The database cluster will be initialized with locale "en_US.utf8".
db_1             | The default database encoding has accordingly been set to "UTF8".
db_1             | The default text search configuration will be set to "english".
db_1             | 
db_1             | Data page checksums are disabled.
db_1             | 
db_1             | initdb: error: directory "/var/lib/postgresql/data/13" exists but is not empty
db_1             | If you want to create a new database system, either remove or empty
db_1             | the directory "/var/lib/postgresql/data/13" or run initdb
db_1             | with an argument other than "/var/lib/postgresql/data/13".
openbalena_db_1 exited with code 1
haproxy_1        | [WARNING] 340/124551 (12) : Server backend_registry/balena_registry_1 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
haproxy_1        | [WARNING] 340/124552 (12) : Server backend_vpn/balena_vpn_1 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
cert-provider_1  | [Info] Waiting for api.openbalena.domain-left-out.com to be available via HTTP...
cert-provider_1  | [Info] (1/3) Connecting...
cert-provider_1  | [Info] (1/3) Failed. Retrying in 5 seconds...
cert-provider_1  | [Info] (2/3) Connecting...
cert-provider_1  | [Info] (2/3) Failed. Retrying in 5 seconds...
cert-provider_1  | [Info] (3/3) Connecting...
cert-provider_1  | [Info] (3/3) Failed!
cert-provider_1  | [Info] Unable to access api.openbalena.domain-left-out.com on port 80. This is needed for certificate validation. Retrying in 30 seconds...

Hello,

thanks for your post about updating openBalena 3.5.1 to 3.6.0.
Checking the openBalena image versions used in 3.6.0 I figured out that open-balena-db 5.1.2 is used: openBalena image versions

Checking open-balena-db 5.1.2 commit history, I figured out that the ${PGNEWDATA, which is in your case /var/lib/postgresql/data/13 is not ‘removed’ before initialised.
This was changed from: open-balena-db/balena-entrypoint.sh at 2308dc90e0c5156552872cdab0fe0837bb4c54be · balena-io/open-balena-db · GitHub
to version 5.1.2: open-balena-db/balena-entrypoint.sh at 2308dc90e0c5156552872cdab0fe0837bb4c54be · balena-io/open-balena-db · GitHub

The reasoning here is, that if there was no postgres 13 on the openBalena instance initialised this folder should not exists and is created freshly. Now with your information that you tried an update to 3.7.6 which failed, I could image that the folder was created and filled somehow.

My suggestion now is, to exec into open-balena-db container of the running 3.5.1 openBalena and search for /var/lib/postgresql/data/13 Check this folder and remove it. Then retry the update again from 3.5.1 to 3.6.0.
In the current situation the database shouldn’t know or use the /var/lib/postgresql/data/13 folder at all. You may want to double check this by checking the container inside environment variable $PGDATA. For the version 3.5.1 this is set here: https://github.dev/balena-io/open-balena-db/blob/30eb64e12b8adf9169059719162383cab6cc0296/balena-entrypoint.sh#L114

Please report back your findings.

Best Regards
Harald

Thank you very much, for the suggestion, Harald. I will let you know as soon as I get time to try it out.
Best Regards
Lars

We finaly got arround to try this again and i got it working!

we started to have problems with our balena VPN not working, despite it reporting that everthing was good.

I did not see the dir before i tried to update, so i updated and got the same error (but with newer versions now… a year later…).
i then found the volume from the OS outside the container and moved the dir that was now named 14. ot 14_bak, and ran compose build, and then up and all was good execpt our database was naturaly not there. So i move 14_bak in to 14 and that seems to just work now.

and it also fixed our vpn problems.