SSH has stopped working to all devices

Hi All,

I have a 20x devices running for around a year or so without any issues.

I am running openBalena 2.0.1 and the devices are running BalenaOS versions between 2.44 and 2.58.

I have previously been able to ssh them using proxy tunnel as described in ’ HowTo: SSH into host device @richbayliss.

Just in the last couple of days though I have tried to ssh any of the devices I am running and receiving the following error:

$ ssh root@d6a20843efed99bbe56f813ac4b797e2.balena
Via xxx.xxx.xxx.xxx:3128 → d6a20843efed99bbe56f813ac4b797e2.balena:22222
analyze_HTTP: readline failed: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host

All of the devices show online status and in the logs I can see them making request to the API regularly.

What is interesting is that when looking at the certificate presented by the API end it was automatically renewed earlier this week which makes suspect there is some sort of certificate error occurring with the VPN.

This is also interesting as Letsencrypt recently added a new ROOT CA - I’m wondering if perhaps these devices do not have this Root CA installed in their base OS?

In short has anyone else experienced this or have any advice on how I might be able to debug VPN/SSH to identify the root cause.

Cheers,
Chris

Just updating with debugging done to date:

  • Inspected logs of all openBalena server containers. Can see all of the devices regularly checking in with API using /device/v2/(uuid)/state endpoint.
  • Updated all OS packages on openBalena server and rebooted.
  • Tested certificates on endpoints with openssl results as follows:

CONNECTED(00000005)
140263854088640:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:…/ssl/record/ssl3_record.c:332:

no peer certificate available

No client certificate CA names sent

SSL handshake has read 5 bytes and written 327 bytes
Verification: OK

New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)

  • Provisioned new device from the same image as was used for most recently provisioned devices (intel-nuc 2.50.1). This new device is not connecting to either API or VPN nothing specific to it coming through in logs.

My intention now is to try provision a new device with 2.58.6 x86 Generic image and connect it to this openBalena server and see if it will connect.

Any guidance on how to debug this would be much appreciated.

Okay further investigation. As my suspicion is SSL based I started looking through the certificates in the open-balena/config/certs directory.

Looking at /open-balena/config/certs/root/issues/*.mydomain.com.crt I noticed the following:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            db:a1:be:74:fa:31:f1:08:74:8c:2e:7c:1b:5e:1a:50
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=ca.mydomain.com
        Validity
            Not Before: May  1 03:33:13 2019 GMT
            Not After : Apr 30 03:33:13 2021 GMT

Also the VPN-CA:

Certificate:
Data:
Version: 3 (0x2)
Serial Number:
88:5e:45:21:43:c0:32:18:dc:2d:ce:4e:f4:60:2b:da
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=vpn-ca.mydomain.com
Validity
Not Before: May 1 03:33:15 2019 GMT
Not After : Apr 30 03:33:15 2021 GMT
Subject: CN=vpn.mydomain.com

This would line up with when ssh stopped working (nothing has been changed about this server or openbalena config in months.

Would the the root CA expiring stop VPN connects working properly/particularly tunnels via the VPN?

Also what is the procedure for renewing the CA and safely having existing devices continue to work after renewal?

Actually looking at logs for VPN there is alot of this:

May 02 08:47:51 4adc57eb6870 haproxy[231]: Routing 223.186.35.235:1373@tcp-443 to vpn-cluster/vpn1:59996 [C:2/2 Q:0/0 T:1/0]
May 02 08:47:51 4adc57eb6870 balena-vpn-api[1489]: TCP connection established with [AF_INET]127.0.0.1:59996
May 02 08:47:52 4adc57eb6870 balena-vpn-api[1489]: 127.0.0.1:59992 Connection reset, restarting [0]
May 02 08:47:52 4adc57eb6870 balena-vpn-api[1489]: 127.0.0.1:59996 Connection reset, restarting [0]

VPN connections appear to be being immediately reset. Will see if I can find some more verbosity on that>

Sorry for the spam. Just trying to share all the information I have and steps I have tried.

So I have setup a new device with dev BalenaOS version. It isn’t connected to the openBalena server (its status is offline).

When I ssh into it locally and look at the logs for supervisor I see this:

root@00cc9cf:~# balena logs resin_supervisor --tail 10000 -f
[api] GET /v1/healthy 200 - 12.964 ms
[api] GET /v1/healthy 200 - 1.319 ms
[debug] Attempting container log timestamp flush…
[debug] Container log timestamp flush complete
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
Warning: Ignoring extra certs from /etc/ssl/certs/balenaRootCA.pem, load failed: error:02001002:system library:fopen:No such file or directory
[success] Device state apply success

This would suggest it is connecting to API and updating state correctly however its state remains offline when query device with balena cli.

Looking in config.json on the image that was used to provision this device ‘balenaRootCA’ is clearly set.

Also checking /mnt/boot/config.json balenaRootCA is clearly set on this device as well.

I can also tail the logs for openvpn on this device and can see this:

root@00cc9cf:/resin-boot# systemctl status openvpn.service
openvpn.service - OpenVPN
Loaded: loaded (/lib/systemd/system/openvpn.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2021-05-02 10:16:48 UTC; 36min ago
Main PID: 1150 (openvpn)
Tasks: 1 (limit: 2358)
Memory: 2.1M
CGroup: /system.slice/openvpn.service
└─1150 /usr/sbin/openvpn --writepid /run/openvpn/openvpn.pid --cd /etc/openvpn/ --config /etc/openvpn/openvpn.conf --connect-retry 5 120

May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 TLS: Initial packet from [AF_INET]3.104.60.33:443, sid=9aa02bbf c23f58b3
May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 VERIFY OK: depth=1, CN=vpn-ca.mydomain.com
May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 VERIFY ERROR: depth=0, error=certificate has expired: CN=vpn.mydomain.com
May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 OpenSSL: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed
May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 TLS_ERROR: BIO read tls_read_plaintext error
May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 TLS Error: TLS object → incoming plaintext read error
May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 TLS Error: TLS handshake failed
May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 Fatal TLS error (check_tls_errors_co), restarting
May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 SIGUSR1[soft,tls-error] received, process restarting
May 02 10:52:05 00cc9cf openvpn[1150]: Sun May 2 10:52:05 2021 Restart pause, 120 second(s)
root@00cc9cf:/resin-boot#

Seems to be the smoking gun re: vpn certificate expiry being the issue, will see if I can work out how to renew safely at server end.

Okay all solved. Ma man @wolf_karl has all the answers you seek over here if you need to renew vpn certificate: VPN Certs seems to be expired - #10 by wolf_karl