Trouble with openbalena after internet outage

Hi,
We are testing openbalena for several months now but today we see this very interesting problems…

Here is a short timeline of events and the measured we took…
OpenBalena server and other service lost connectivity for few hours (fiber broken from our ISP)
After services got back there seems that the container running the app was not behaving (could be a but in the release but not sure)
When we looked at the openbalena VM we has a surprise - it was stuck with a lot of memory exhausted errors (see image: https://prnt.sc/q2zfut)
We rebooted the VM and started services… all was fine
We logged in to balena in order to use balena cli utils…
We saw the devices online and issue to both of them a reboot - and the commands passed without errors, devices went offline…
Now after 1 hours openbalena still showed the devices offline…

Here is the device info at the moment: http://prntscr.com/q2zh7a
Since we are using a production image (yes we say the SSH part a way too late) seems like we have limited or none remote options so we plan so send out the engineers to power cycle each device (unfortunately by removing the power connector)

We are a bit lost on how to process like this since not sure where and why things got this stuck…
Any ideas?
Also, do you thing going back to development images is a good idea?
Can i just ship 2 new SD cards with devel image and have them online in order to get the software on the same hardware?
This will only create new device uiids but application will work as expected (we already did this with device number 2 in the role since we had a dead SD card)

Thanks for having created openbalena and i’m willing to provide more info as soon as this project develops more…

Please provide some ideas… we ahve a interesting event on friday and this is a very bad moment to loose the devices…

Thanks all…

Hi there,
I’m glad you find openBalena helpful and welcome to the forums.

Here are several questions to clarify.
What version of openBalena you are using?
When you say the services lost connectivity, do you mean that openBalena servers were not reachable by the devices or that openBalena servers could not access Internet?

Speaking about devices not connecting after reboot, it’s not really possible to understand the reason why this happened without getting extra info from the devices. Please let us know if power-cycling helps in your case.
I don’t think development OS image would really help you in this case unless you have another machine in the same network that you can SSH into…

When i say we lost services… i mean all VPS (app server, openbalena server) lost internet connectivity in both direction (outage that lasted almost an hour) - this means both openBalena servers were not reachable by the devices and that openBalena servers could not access Internet…

The balena version was the latest from aprox 1 month ago - version 2.0 according to the changelog…

I’m still not certain if they rebooted correctly after i send out the balena device reboot xxxxxxx

My idea about using maybe devel image is that i can connect to SSH over the openbalena server to the pi2 (if i understood that tutorial for proxytunnel you have on the forums) before i execute any reboots (and debug possible connectivity there)…
Also, not sure if there is a command to allow only the app container to be reloaded and not the whole pi2 device in the current set of things (i think i have seen this in the cloud version of balena)

Thanks for the quick support…
I’ll update you on the power cycle in the morning (it’s 9PM in Skopje at the moment so at least 11 hours from now)

If there is any additional ideas please advice…

Just another update…
I opened htop on the openbalena server and i see thing going pretty high for a openbalena with only 2 end devices and VPS with 6GB ram and 4 vCPU…

Is this normal?

My idea about using maybe devel image is that i can connect to SSH over the openbalena server to the pi2 (if i understood that tutorial for proxytunnel you have on the forums) before i execute any reboots (and debug possible connectivity there)…

The development image is basically open to SSH connections from the local network. It also exposes its engine control port to enable development workflow when you can push images to the device directly from a dev machine without making releases on the server. But they should not be used in prod deployment for the security considerations: anyone in the local network will be able to control a device with dev image.

balena ssh {uuid} command should work for prod images and give you SSH access to the connected devices.

I opened htop on the openbalena server and i see thing going pretty high for a openbalena with only 2 end devices and VPS with 6GB ram and 4 vCPU…

How do you deploy openBalena on the server? Is it done with docker-compose? I think docker stats command would give more meaningful information on what resources are used by what openBalena components. You could also try checking service logs (with docker logs) for any anomalies.

Ok,
The devel vs production image usage concept has been cleart - we will remain on production images :slight_smile:

Also, glad that

> balena ssh uuid

now works (which probably makes this explanation not relevant any more: HowTo: SSH into host device )

The openBalena server has been deployed with docker-compose…
Here are the stats: http://prntscr.com/q323xn - they seem pretty normal/sane to me …
After running stats for several minutes - the only spike i saw at:
70efdf44e81c openbalena_s3_1 ~3 to 5 %

In docker logs most of them had nothing special - the only ones with something informative is:

DB:
2019-11-27 17:46:44.466 UTC [1] LOG: listening on IPv4 address “0.0.0.0”, port 5432
2019-11-27 17:46:44.466 UTC [1] LOG: listening on IPv6 address “::”, port 5432
2019-11-27 17:46:44.590 UTC [1] LOG: listening on Unix socket “/var/run/postgresql/.s.PGSQL.5432”
2019-11-27 17:46:44.842 UTC [23] LOG: database system was interrupted; last known up at 2019-11-26 01:14:29 UTC
2019-11-27 17:46:45.394 UTC [23] LOG: database system was not properly shut down; automatic recovery in progress
2019-11-27 17:46:45.509 UTC [23] LOG: redo starts at 0/132FAFD0
2019-11-27 17:46:45.644 UTC [23] LOG: invalid record length at 0/132FC8F8: wanted 24, got 0
2019-11-27 17:46:45.644 UTC [23] LOG: redo done at 0/132FC8C0
2019-11-27 17:46:45.644 UTC [23] LOG: last completed transaction was at log time 2019-11-26 01:18:13.78291+00
2019-11-27 17:46:49.101 UTC [1] LOG: database system is ready to accept connections
2019-11-27 17:47:07.200 UTC [30] ERROR: relation “uniq_model_model_type_vocab” already exists
2019-11-27 17:47:07.200 UTC [30] STATEMENT: CREATE UNIQUE INDEX “uniq_model_model_type_vocab” ON “model” (“is of-vocabulary”, “model type”);

cert-provider:
[Error] ACTIVE variable is not enabled. Value should be “true” or “yes” to continue.
[Error] Unable to continue due to misconfiguration. See errors above. [Stopping]
[Error] ACTIVE variable is not enabled. Value should be “true” or “yes” to continue.
[Error] Unable to continue due to misconfiguration. See errors above. [Stopping]
[Error] ACTIVE variable is not enabled. Value should be “true” or “yes” to continue.
[Error] Unable to continue due to misconfiguration. See errors above. [Stopping]
[Error] ACTIVE variable is not enabled. Value should be “true” or “yes” to continue.
[Error] Unable to continue due to misconfiguration. See errors above. [Stopping]

haproxy:
Building certificate from environment variables…
Setting up watches. Beware: since -r was given, this may take a while!
Watches established.
[NOTICE] 330/174655 (15) : New worker #1 (17) forked
[WARNING] 330/174655 (17) : Server backend_api/balena_api_1 is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 330/174655 (17) : backend ‘backend_api’ has no server available!
[WARNING] 330/174656 (17) : Server backend_registry/balena_registry_1 is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 330/174656 (17) : backend ‘backend_registry’ has no server available!
[WARNING] 330/174656 (17) : Server backend_vpn/balena_vpn_1 is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 330/174656 (17) : backend ‘backend_vpn’ has no server available!
[WARNING] 330/174656 (17) : Server backend_s3/balena_s3_1 is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 330/174656 (17) : backend ‘backend_s3’ has no server available!
[WARNING] 330/174657 (17) : Server vpn-tunnel/balena_vpn is DOWN, reason: Layer4 connection problem, info: “Connection refused”, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 330/174657 (17) : proxy ‘vpn-tunnel’ has no server available!
[WARNING] 330/174700 (17) : Server backend_vpn/balena_vpn_1 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 330/174703 (17) : Server vpn-tunnel/balena_vpn is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 330/174704 (17) : Server backend_registry/balena_registry_1 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 330/174713 (17) : Server backend_api/balena_api_1 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

Seeing this, most of them are connected to the reboot of the openBalena VM because it exhausted the memory…
Not sure if this gives any idea on the matter…

Sorry for posting again, forgot to ask for clarification - does the balena cli command:

balena ssh uuid

works without setting up SSH keys, and if not - how is the easiest way this to be now deployed to the existing devices on the field ?

Sorry for misleading. balena ssh is still not supported, but the balena tunnel command should provide the most convenient workaround as described in the post you referred.

Indeed, you need ssh keys installed on the devices running prod images. As you are probably aware, one way is to include them into config.json on the device. For already deployed devices, this would require re-flashing them with an OS image that contains this config.json

That said, with recent versions of balenaOS and openBalena API, devices aree capable of dynamically fetching the user public key using a dedicated API endpoint. We really lack documentation for this, but you can find relevant PRs in our github repos:

I haven’t personally tried this feature with openBalena, but I believe it should be possible to get it working after bumping API service on your openBalena server and getting a new OS on the device, which also requires re-flashing.

And a note on logging. API service container has systemd inside, and service logs are written to journald there.
Hence, to read API logs you would need to run

docker exec -it {api_service_container} journalctl

Hmmm…
saw config.json as option for SSH now… and seems re-imaging is a must…

But, can’t figure out how to apply this config.json in the get started part of open balena where you prepare the image before flashing to the device as in accordance with the docs…

I found this: https://www.balena.io/docs/reference/cli/#config-generate
But, not sure if this can be used for SSH - and if yes in what moment do i execute this while creating the .img for balena-etcher or something else needs to be done…

Thanks for the great so far - things are starting to look not that black now…

Hi @bidikov,

One of our openBalena engineers has had a look into this, and we believe the underlying problem with the memory is down to the way the VPS handled the outage. We’ve noted that the processes with the largest memory footprint were the docker-proxy processes, each taking 500MB (with there being three of them). We’ve not seen this occur before (and we have several systems using the openBalena framework which are essentially soaktested), and so believe that this might be an artefact of how the underlying networking on the VPS was dealing with data (possibly buffering a large amount of untransmitable packets).

As for device SSH access, you can actually generate your own private and public keypair, and then use set the OPENBALENA_SSH_AUTHORIZED_KEY value to the public key generated in the generated config/activate file which was created when you ran the quickstart script. The private key will then allow you to access your devices as they’ll pull the authorised key from the API. We highly recommend this over having to change the config.json with the desired keys, as this will also allow you to change the keys dynamically in future.

Best regards,

Heds

Hi,
Just to update… one of the 2 devices is online after a “simple” power cycle…
The second device is under debug - we see symptoms of sdcard failure which is statistically impossible for devices that are online only 1 week and since we already saw a strange “sd failure” the first time on the same hardware when we did a simple balena shutdown xxxxx and then mounted the device to the production location…

Thank you for looking into the VPS moment… if we can provide additional info (even ssh access to the VPS) please say so…

Can you once more clarify this with the SSH access since in this tread i saw several different positions from you balena people :slight_smile:

Thanks for the great support so far…

Hi @bidikov,

Apologies for the SSH access confusion. There are several ways to go about this, but the one we are officially advising with openBalena is to alter the environment variables from the generated config/activate file, as then openBalena itself will handle exchanges with devices. I’m going to ping in the relevant openBalena maintainer to verify my previous answer to ensure it’s got enough info.

On the SD card issues, we regularly see problems with devices using SD cards that aren’t of an industrial quality that matches their use with balena. We generally recommend the San Disk Extreme Pro series of SD cards, which we know get industrially soak-tested to a very high degree, and which we use internally at balena. We’d strongly suggest trying a different make of SD card to see if this also solves your issue.

Thank you for the offer of the VPS access, but unfortunately it’s difficult for us to get involved with underlying hardware/distribution issues, as this differs for most customers. We’d recommend trying openBalena on a local Linux machine as well, to confirm that this is a VPS problem (which it does appear to be).

Best regards,

Heds

Hi,
thanks for the clarifications…

I will wait the official info (let’s say howto) to so the SSH the right way…

We use San Disk SD cards (for many years and projects) and this is the first time we have 2 failures on same device - this could be a bad batch of SD (since we use different supplier just for this reason so maybe it was bad luck)

Well truing this on a bare metal server will be interesting issue… we use VM machines with vmware hypervisor for 10+ years now…

Best regards,

Hi again,

My colleague has confirmed with me that the information I’ve given you on adding the public key to the config/activate file is indeed the correct way to add your keys for the device.

It does sound possible that the SD card batch may be problematic, we’ve spent a lot of effort on making sure balenaOS does not abuse SD cards and does the minimum number of writes to avoid wear. I personally don’t think I’ve ever seen a problem caused by balenaOS, but mostly by card makes that aren’t well rated.

We do have customers running both openBalena and On-premises balena on VPCs, and have not seen this issue before. As we do carry out soak-testing, it does still appear to be an issue with this particular VPS. If you could let us know the outcome of testing openBalena on a bare metal machine, we’d be very interested to hear if you see the same issue!

Thanks and best regards,

Heds

Hi,
So just using:
balena env add OPENBALENA_SSH_AUTHORIZED_KEY xxxxxxxxxxxx --application app1
will add the ssh key for user root into the devices?
Do i need to encapsulate the key content in anything since it’s a long string?

We will also look into this with the VM you stated here… there could be some incompatibility between latest Ubuntu 18.04 LTS and older vmware hypervisor…

Thank you for all the superb support - you just show how well this project is and as i promised to your colleague David i will write you more details and suggestions based on our openBalena experience…

All the best,

Hey there!

The OPENBALENA_SSH_AUTHORIZED_KEY is the public SSH key that you allow to connect to your devices as root. The SSH public key should be just a string (a long one indeed), but it shouldn’t require any special treatment.

Thanks a lot for the kind words and keep us updated of how things are going!

1 Like

Hi team,
i have to update on this in several fields:

  1. Seems the VM is stable although we did not anything special (except some limits on request on web services for countries that will never communicate with the server)

  2. One of the 2 nodes that hanged last time was localized to a issue with R-PI 3. Even with other SD cards (for which we suspected that are corrupt) the PI did not boot after issuing a balena command for reboot or shutdown. We are still debugging that “bad” R-PI

  3. The other node lost connectivity 20 hours ago (both balena shows it not online) and the app is not working… but since there where internet provider outages i’m not sure where the problem could be - i need a way to setup a well known stable DNS server (like 1.1.1.1) while keeping DHCP on the lan interface on the R-PI to remove a possibility of DNS as issues…

  4. Is there a way to have some kind of a “watchdog” in balenaos that will allot a graceful reboot of the R-PI in case of prolonged connectivity (lets say if there is no connectivity to a ip/host automatically to reboot)

As soon as we restore connectivity i plan to setup the SSH key part we already discussed…
I’m sending out engineers on site this Saturday so some idea will be more than welcome :slight_smile:

Hello, thanks for the update. There’s no such watchdog in balenaOS unfortunately but should be trivial to implement in your application. Let us know how everything pans out.

Hi,
Based on the initial info from the field there was a local internet outage on site for 2-3 hours after which both balena os and the app inside stopped working…
After a power cycle everything was ok…
Since network in DHCP based i suppose the problem was that the PI lost the lease…

Having a watchdog on the docker will not help since it will not reset the whole PI since it’s inside…
Some ideas will definitely be needed…

Hello,

You can interact with the balena supervisor from inside a container using the supervisor API. One of the endpoints enables you to reboot the device; you could then trigger a reboot whenever certain connectivity conditions are met (for example, if X successive pings to 8.8.8.8 fail). Be sure to have well defined conditions to avoid forcing a device into a reboot loop.

You can find more information on how to interact with the Supervisor API here:
https://www.balena.io/docs/reference/supervisor/supervisor-api

Let us know if this approach would work for your use case.