Recreate device permissions after loss of openBalena volume data

While trying to get renewed certificates picked up by openBalena, I found comments here recommending removing the volumes as part of the container restart process, or discussing how quickstart sets up automatic certificate renewal. This resulted in losing or overwriting the volume that contained the openBalena database, so the old application and its devices are not listed by the Balena CLI. I still have back-ups of the original (still-valid) certificates and activate file with secrets. Is there any way to remotely register the devices with a new application, on the same domain as before, or otherwise remotely connect to them?

Edit: specifically, how to add records for the old devices to the new copy of the database.

I’ve confirmed that devices’ resin_supervisor containers are still attempting to connect and are fine with the still-intact certs:

Thu Apr 28 07:45:52 2022 VERIFY OK: depth=1, CN=vpn-ca.[domain]
Thu Apr 28 07:45:52 2022 VERIFY KU OK
Thu Apr 28 07:45:52 2022 Validating certificate extended key usage
Thu Apr 28 07:45:52 2022 ++ Certificate has EKU (str) TLS Web Server Authentication, expects TLS Web Server Authentication
Thu Apr 28 07:45:52 2022 VERIFY EKU OK
Thu Apr 28 07:45:52 2022 VERIFY OK: depth=0, CN=vpn.[domain]
Thu Apr 28 07:45:52 2022 Control Channel: TLSv1.3, cipher TLSv1.3 TLS_AES_256_GCM_SHA384, 4096 bit RSA
Thu Apr 28 07:45:52 2022 [vpn.[domain]] Peer Connection Initiated with [AF_INET]44.231.88.53:443
Thu Apr 28 07:45:54 2022 SENT CONTROL [vpn.[domain]]: 'PUSH_REQUEST' (status=1)
Thu Apr 28 07:45:54 2022 AUTH: Received control message: AUTH_FAILED

These requests appear to align with open-balena-vpn logs like:

AUTH FAIL: API Authentication failed for c9cd7417dc7ff7c4148e6cb5bde98203
[28/Apr/2022:07:43:30 +0000] "POST /api/v1/auth/ HTTP/1.1" 401 12 "-" "curl/7.64.0"
127.0.0.1:60726 WARNING: Failed running command (--auth-user-pass-verify): external program exited
127.0.0.1:60726 TLS Auth Error: Auth Username/Password verification failed for peer
127.0.0.1:60650 Connection reset, restarting [0]

I am guessing that this is based on one of the keys in /boot/config.json, probably deviceApiKey. If so, what would be the corresponding setting on the openBalena side that I should adjust to let the authentication succeed?

1 Like

Still curious about where the openvpn credentials are set. In the meantime, attempting to recover volume data from the filesystem. This requires:

  1. Stopping the server instance.
  2. Disconnecting the filesystem from the server instance and mounting it as a secondary disk on another instance (which has enough space to contain any recovered files).
  3. Installing TestDisk on the new server and running testdisk on the attached partition. If there is a warning that “the harddisk seems too small”, continue anyway. Then press p to view the files.
  4. Using the TestDisk interface to navigate to /var/lib/docker/volumes. This shows a mix of recoverable (gray) and damaged (red) directories.
  5. The desired ones are likely mostly overwritten, but it’s possible to select individual ones with : (followed by C), or just press c while a particular volume directory or the whole /var/lib/docker/volumes is highlighted, then choosing a location on the server’s main disk (outside the partition being recovered) to retrieve them and review later.

Any documentation on how the keys in the configuration files fit into the openvpn authentication process would be appreciated.
The server is still using 13GB despite the fresh install, on an operating system that takes under 2GB, so I am hoping to figure out where the extra data is located which appears to be /var/lib/overlay2 and am wondering whether any of it is recoverable.

So far, the authentication flow is:

  1. BalenaOS’ supervisor container sends a connection request via openvpn client.
  2. The request arrives on openBalena and is routed to the open-balena-vpn container.
  3. There, openvpn receives the request, and has /etc/openvpn/server.conf configured to check username/password via shell script:
# Allow authorisation via username/password.
script-security 3 # Level 3 for username/password auth
verify-client-cert none
username-as-common-name
  1. Balena’s /etc/openvpn/scripts/auth.sh (openvpn/scripts/auth.sh in open-balena-vpn) packages the authentication as a POST to /api/v1/auth/ in the same container.
  2. src/api.ts in open-balena-vpn receives the POST and converts it into a GET request addressed to the open-balena-api container, using the API’s standard UUID / Bearer Token format.
  3. src/features/vpn/index.ts in open-balena-api routes this GET request through various middleware.
  4. prefetchApiKeyMiddleware() and apiKeyMiddleware() in src/infra/auth/middleware.ts work on converting the bearer token to an API key. src/infra/auth/middleware.ts ultimately does this using Balena’s version of PineJS. This seems to involve some processing for a database interaction, but does not seem to be the core auth decision.
  5. authDevice() in src/features/vpn/services.ts relays to checkAuth() which builds a query object, again ultimately relying on balena/pinejs. The query appears to go to PineJS’ Auth model, looking for an Auth record with a matching uuid and apiKey. If found, the authentication succeeds.

So, I need to create some auth records in the database. PineJS seems to be configured for PostgreSQL. Can anyone point to the location where those are stored? The PostgreSQL in open-balena-db appears to be empty. psql -U docker -d resin - \dt then lists the expected tables.

Found open-balena-api logging queries based on the key. Not yet clear on the best process for allowing those to succeed.

SELECT "permission"."name"
FROM "permission"
WHERE (EXISTS (
        SELECT 1
        FROM "api key-has-permission" AS "permission.api key-has-permission"
        WHERE "permission"."id" = "permission.api key-has-permission"."permission"
        AND EXISTS (
                SELECT 1
                FROM "api key" AS "permission.api key-has-permission.api key"
                WHERE "permission.api key-has-permission"."api key" = "permission.api key-has-permission.ap
                AND ("permission.api key-has-permission.api key"."key") IS NOT NULL AND ("permission.api ke
        )
)
OR EXISTS (
        SELECT 1
        FROM "role-has-permission" AS "permission.role-has-permission"
        WHERE "permission"."id" = "permission.role-has-permission"."permission"
        AND EXISTS (
                SELECT 1
                FROM "role" AS "permission.role-has-permission.role"
                WHERE "permission.role-has-permission"."role" = "permission.role-has-permission.role"."id"
                AND EXISTS (
                        SELECT 1
                        FROM "api key-has-role" AS "permission.role-has-permission.role.api key-has-role"
                        WHERE "permission.role-has-permission.role"."id" = "permission.role-has-permission.
                        AND EXISTS (
                                SELECT 1
                                FROM "api key" AS "permission.role-has-permission.role.api key-has-role.api
                                WHERE "permission.role-has-permission.role.api key-has-role"."api key" = "p
                                AND ("permission.role-has-permission.role.api key-has-role.api key"."key") 
                        )
                )
        )
))
ORDER BY "permission"."name" ASC [ '<key>' ]

While I am able to create actor and device records manually, the key/role/permission system is too complex to recreate. Based on the Balena primer, openBalena normally creates all these records during the device provisioning process, and my goal is now to recreate the records produced by the provisioning process without the provisioning key.

Based on the provisionDevice() test function, I found that a device is created with a POST request. For example, with application (fleet) id 1 and device type 56, that comes to:

curl -X POST "https://api.unicopower.com/v6/device" -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data '{
	"belongs_to__application": 1,
	"uuid": "<device uuid>",
	"is_of__device_type": 56
}'

(or similar - currently getting errors on this request and having to use open-balena-admin). Once the numerical id of the device is known, an API key can be set with a subsequent:

curl -X POST "https://api.unicopower.com/api-key/device/<id>/device-key" -H "Content-Type: application/json" -H "Authorization: Bearer <token>" --data '{"apiKey": "<apiKey>"}'

This results in a device record that is linked to an api key record through an actor record, but queries still fail at step 7 when getAPIKey() calls permissions.resolveApiKey(req). Following that to pinejs/src/sbvr-api/permissions.ts leads further to checkApiKey() → getApiKeyPermissions() → $getApiKeyPermissions(). The 10-layer-deep query object there appears to seeks the name of a permission that is granted (via api key-has-permission) to an api key with matching apiKey and null or future expiryDate, or that is granted to a role (via role-has-permission that is granted (via api key-has-role) to an api key that meets those conditions. api key-has-permission is empty, while “role-has-permission” has been painstakingly populated by openBalena, so I focused on choosing a role from the ones that already existed:

select * from role;
         created at         |        modified at         | id |         name         
----------------------------+----------------------------+----+----------------------
 2022-04-27 21:24:26.743892 | 2022-04-27 21:24:26.743892 |  1 | provisioning-api-key
 2022-04-27 21:24:26.743892 | 2022-04-27 21:24:26.743892 |  2 | named-user-api-key
 2022-04-27 21:24:26.743892 | 2022-04-27 21:24:26.743892 |  3 | device-api-key
 2022-04-27 21:24:26.743892 | 2022-04-27 21:24:26.743892 |  4 | default-user
 2022-04-27 21:24:26.743892 | 2022-04-27 21:24:26.743892 |  5 | service.api
 2022-04-27 21:24:26.743892 | 2022-04-27 21:24:26.743892 |  6 | service.vpn

So, role id 3 (device-api-key). In fact, the above queries already caused openBalena to create api key-has-role with role=3. After adding a few more devices with the right UUID and key, permissions are being found, and several devices are reported as “Is Online” in the Balena CLI, though only “VPN State: Connected” (while “Connectivity: Unknown”) in open-balena-admin.

The fastest workflow proved to be using open-balena-admin to first add each device with the correct uuid and name, then editing the api key field of the API key record that this automatically created to match the hashes that the open-balena-api server was querying for. This led to everything being marked as connected to the VPN and “online” in Balena CLI, but not “Online” in the admin UI (which is based on the Connectivity state), as above. This connection was sufficient to regain SSH access and allow essential debugging and network inspection tasks. The server also allowed a development build unit to compile and upload a first release.

Restarting the regular group of open-balena containers caused the connections to start being marked as Online in the UI too. A handful of the device records also began showing details like OS version, CPU utilization, temperature, etc. The logs had shown that these were already being reported before, but not accepted by open-balena. The devices whose statuses were now saved tended to also update to the only release, even when the fleet was not set to track this release. I believe that the status reporting might be hindered as long as a device is trying to work through this first update, which is quite demanding on the server since the registry/device cannot find and skip any layers that the device’s (forgotten) previous release and the registry’s (only) new release might have in common. The server load has been quite high due to the messy first upgrades and associated log flood, but over a quarter of the fleet has upgraded so far. Essential capabilities appear to be within reach.

Edit: server load appears to decline proportionally as more devices finish upgrading.