Devices offline & VPN Error: Cannot load DH parameters from dh.pem

Hey,
Thanks for this amazing open source project.

I installed OpenBalena on a DigitalOcean droplet on Ubuntu 19.04, following the Getting Started guide.

Before that, I used to have it installed on a Virtual Machine running in my local network and everything was fine.

I am having an issue that looks similar to this one:

Since installing on the droplet, my devices have always been sown “offline”.
On the droplet, I however used to be able to push updates to them, but not any more.

In case it matters, I have Raspberry Pi Zero W connected via WIFI running balenaOS 2.32.0+rev1.
Also I cannot SSH into them any more, so I can’t take any logs from that side.

Following the mentioned thread, I pulled the logs from the VPN, using ./scripts/compose exec vpn journalctl -fn100, and they revealed a few errors:

vpn-logs.log (11.9 KB)

I’ll quote one of them here to make this thread easier to find, but please take a look at the full log above.

WARNING: POTENTIALLY DANGEROUS OPTION --verify-client-cert none|optional (or --client-cert-not-required) may accept clients which do not present a certificate
WARNING: file 'server.key' is group or others accessible
OpenVPN 2.4.0 x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] built on Oct 14 2018
library versions: OpenSSL 1.0.2r  26 Feb 2019, LZO 2.08
NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
OpenSSL: error:0906D06C:PEM routines:PEM_read_bio:no start line
Cannot load DH parameters from dh.pem
Exiting due to fatal error

I’m an IT guy, but I don’t know much about Docker, and I certainly have no idea how to go about “fixing” this issue.
Thanks in advance for your help. I can provide any other required logs or info, just ask.

Tim

How did you move the RPi to the DigitalOcean openBalena instance? Did you re-provision it? The logs from the VPN service look scary and something seems wrong with the certificates. Can you try reprovisioning the instance from scratch? I wonder if something went silently wrong during setup.

I’m not sure what you mean by “move the RPi”.

If you mean moved it from my VM instance to the DigitalOcean instance, then I re-flashed the SD card with a new OS configured with the DigitalOcean instance. All good there.

If it has something to do with certificates, I would like to take the opportunity to make a Let’s Encrypt cert. I saw that it was possible:

However, I do not know how it works exactly and I don’t think it is mentioned in the docs or the getting started.

So I’m going to re-install the project from zero on my server, but first can you explain how to enable the Let’s Encrypt cert ?

Thank you very much.

EDIT: I’ll try re-installing using the “-c” option

[Note: I reinstalled once]
Just adding “-c” when running the quickstart script also generates a self signed cert. No difference.
I must be missing something.

[Note: I reinstalled again]
Things keep getting weirder.
I took down the containers and deleted all balena files, then re-installed and re-configured again.
I decided to use a different email as the login this time.

I could not login to the instance using the CLI, I would get BalenaRequestError: Request error: Unauthorized.

However, using the credentials of the old instance, I am able to login into the new one.
So that means I am not deleting everything correctly, or that the quickstart script doesn’t work properly.

How do you go about doing a clean install ?

Thanks in advance

Hey.

I would backup, and remove your ./config directory as this holds the values from the quickstart script output. The -c flag is the one you want, and you’re correct that it will still create a self-signed cert, BUT at runtime the cert is swapped out since the ACME protocol has to run at runtime to generate the cert. You should see the cert-manager container doing this if you check the logs :+1:

Note: Make sure that your DNS is setup correctly before trying this, as ACME will fail to provide a cert if you do not have your DNS pointing to the server.

Hey,

my DNS is ok, so I will try again and look at the logs.

In the help, it talks about a “production mode” is there a production flag somewhere in this ?
-c enable the ACME certificate service in staging or production mode.

What config folder are you talking about ? I deleted everything in the root directory cloned from git and cloned anew.

The cert-manager container will try and get you a staging cert first and if this is successful then it will acquire the production one. LetsEncrypt will rate limit cert acquisition of production/trusted certs so we do a check on their staging server first to make sure a cert could be acquired.

The ./config folder is created in the root directory of the repo after running ./quickstart - so if you have a fresh directory then you should be fine.

I only deleted the config folder this time, and used the “-c” option.

Running ./scripts/compose exec cert-provider journalctl -fn100 returns:

OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused "exec: \"journalctl\": executable file not found in $PATH": unknown

Running ./scripts/compose exec vpn journalctl -fn100 returns:
new-vpn-logs.log (13.2 KB)

I can still see a few warning and errors in the VPN.

Trying to connect with the CLI still gives me the SELF_SIGNED_CERT_IN_CHAIN: request to https://api.{{mydomain}}/login_ failed, reason: self signed certificate in certificate chain after waiting for a while.

Using the self signed cert to connect reveals that I cannot connect with the newly configured email address and password, only the old one.
That’s not normal, but right now it’s not the main issue I suppose.

At least that means my Apps and devices still exist, even though I clearly deleted everything multiple times… Very strange.
Should I have restarted the server before installing again ?

Anyways. The device is still shown offline, hasn’t received any logs recently, and creating a new release doesn’t push it to the device.

Nothing has changed with the reinstall, expect the VPN error isn’t as scary as before.

@Winkelmann

The logs for the cert-provider container are just docker logs, not journalctl - so ./scripts/compose logs cert-provider should show you what’s happening there.

The devices will show as offline if they cannot bring up the VPN, so you could see what’s going on in the device supervisor by running journalctl -u resin-supervisor -fn100 in the device’ terminal (SSH port 22222 if you have used a development balenaOS image)

Okay, for the cert ./scripts/compose logs cert-provider returns

Attaching to openbalena_cert-provider_1
cert-provider_1  | [Info] VALIDATION not set. Using default: http-01
cert-provider_1  | [Info] Waiting for api.iot.meters.ch to be available via HTTP...
cert-provider_1  | (1/6) Retrying in 5 seconds...
cert-provider_1  | (2/6) Retrying in 5 seconds...
cert-provider_1  | (3/6) Retrying in 5 seconds...
cert-provider_1  | (4/6) Retrying in 5 seconds...
cert-provider_1  | (5/6) Retrying in 5 seconds...
cert-provider_1  | (6/6) Retrying in 5 seconds...
cert-provider_1  | [Error] Unable to access api.{{mydomain}} on port 80. This is needed for certificate validation. [Stopping]

Should I be the one doing something to make api.{{mydomain}} accessible or is it automatically done by OpenBalena ?

As for the device logs, sadly, since everything was working fine, I switched to a production image a while ago.
Is it possible to flash my SD card with the dev image while retaining the data partitions ?

Thanks for your help.

No, unfortunately you can re-flash an SD card with a dev image and retain the data partitions.

Not sure what you meant. “No, unfortunately you cannot re-flash” ? Typo ?

And what about the certificate ? Is there something I’m missing, or should OpenBalena be able to do it on it’s own and it’s a bug ?

So the message [Error] Unable to access api.openbalena.richbayliss.dev on port 80. This is needed for certificate validation. [Stopping] happens when the API container is not reachable when the cert-provider container is started. I found this to be the case when the images have to be pulled, so I had to issue a ./scripts/compose restart cert-provider to make it try the cert acquisition again.

You need to make sure api.{openbalena domain} is pingable, and resolving to the machines IP address, but otherwise the application stack will route the requests as required :+1:

I have also confirmed that the stack works on Digital Ocean with Ubuntu 19.04 and the latest Docker-CE – so your setup should be fine.

Sorry about the typo. As you correct me, you cannot re-flash an SD card and preserve the data partitions.

Hey,

Thank you very much @richbayliss . Restarting the cert-provider fixed the certificate.
I now have a beautiful Let’s Encrypt certificate.

Obviously I would recommend talking about the “-c” option of the quickstart script, and the restarting of the container in the Getting Started Page for OpenBalena :slight_smile:

As for the other (main, I guess) issue, I suppose I will reflash the RPi with a dev image and see from there.
It might be a problem with the device itself, or my software on it that made it crash… Don’t know…
Thanks @thgreasi for the answer.

I’ll keep you updated here.

@Winkelmann please report back here with your experience :+1: Glad it’s working, and I will be putting your feedback into our development loop :tada:

The new device works perfectly. Shows Online, I can SSH into it and it received the latest version of my software.

Only minor issue left: the old device has stayed in the list as a ghost. I thought it would get replaced, since the new device has the same Mac address, same IP address, and now same name.

The is a “leave” command to the CLI, but that works by IP.
If you know a way to remove a device from the list, I’m up for it.

Thanks again

If you know a way to remove a device from the list, I’m up for it.

To confirm, do you mean the list of devices that belong to an application in its web dashboard? If so, indeed I think devices are identified purely by UUID (not MAC address, IP and the like), and re-flashing generates a new UUID, so the old UUID will stay in the list until the device is deleted from the balena API’s database.

The balena leave command (CLI) indeed will not delete the device from the list, but I think the balena device rm <uuid> command will do the trick:

$ balena device rm 28a734d6cdcebe0b1051ab476f503df0
? Are you sure you want to delete the device? Yes
balena device rm <uuid>

… was the command I was lokking for, thanks :slight_smile:

I tried balena help device but that didn’t show me the existance of the rm command, that’s why I asked.

With that, I’m out of problems (for now :stuck_out_tongue: )