No API communication

Hi,

I have a device running balenaOS 2.47.1+rev1, supervisor 10.6.27.
It only comes online as Online (VPN only) and it will not start updating to the latest release.

I can ssh into it. In he supervisor logs I see:

event]   Event: Device state report failure {"error":{"message":""}}
(node:1) UnhandledPromiseRejectionWarning: Error: getaddrinfo EAI_AGAIN api.balena-cloud.com api.balena-cloud.com:443
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:56:26)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
[event]   Event: Device state report failure {"error":{"message":""}}
(node:1) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 2)
[info]    Internet Connectivity: OK

I understand that this means it can’t resolve api.balena-cloud.com.

But when I run:

root@22bbd38:~# nslookup api.balena-cloud.com
Server:    127.0.0.2
Address 1: 127.0.0.2 22bbd38

Name:      api.balena-cloud.com
Address 1: 18.208.61.234 ec2-18-208-61-234.compute-1.amazonaws.com
Address 2: 18.210.71.79 ec2-18-210-71-79.compute-1.amazonaws.com
Address 3: 3.229.114.198 ec2-3-229-114-198.compute-1.amazonaws.com

And I can also ping the API from the command line.

root@22bbd38:~# curl -v  https://api.balena-cloud.com/ping
*   Trying 3.229.114.198...
* TCP_NODELAY set
* Connected to api.balena-cloud.com (3.229.114.198) port 443 (#0)
* found 128 certificates in /etc/ssl/certs/ca-certificates.crt
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification OK
* 	 server certificate status verification SKIPPED
* 	 common name: balena.io (matched)
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: CN=balena.io
* 	 start date: Fri, 27 Sep 2019 00:00:00 GMT
* 	 expire date: Tue, 27 Oct 2020 12:00:00 GMT
* 	 issuer: C=US,O=Amazon,OU=Server CA 1B,CN=Amazon
* ALPN, server accepted to use http/1.1
> GET /ping HTTP/1.1
> Host: api.balena-cloud.com
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 02 Jul 2020 10:43:12 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 2
< Connection: keep-alive
< X-Frame-Options: DENY
< X-Content-Type-Options: nosniff
< ETag: W/"2-nOO9QiTIwXgNtWtBJezz8kv3SLc"
<

Any ideas what could be the matter, why won’t the supervisor connect to the API?

Cheers,
Erik

Hi, can you run device health checks from the diagnostics tab from the dashboard?

Some networking issues detected: 
test_upstream_dns: DNS lookup failed for 0.resinio.pool.ntp.org via upstream: 192.168.0.1
test_upstream_dns: DNS lookup failed for api.balena-cloud.com via upstream: 192.168.0.1
test_balena_registry: Could not communicate with registry2.balena-cloud.com for authentication

But on the command line I can resolve all of these just fine.

I noticed the latency on the connection is fairly high.

root@22bbd38:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=112 time=588.433 ms
64 bytes from 8.8.8.8: seq=1 ttl=112 time=647.057 ms

Around ~600ms could that be throwing the javascript code off?

Hi @ErikHH

Thanks for the extra info. I just want to follow up on what karaxuna asked - are you able to run health checks for this device from the diagnostics tab on the dashboard?

Kind regards
Alida

Hi @AlidaOdendaal,

Yes I managed to run the heatlh checks from the dashboard. That’s where the output I pasted above came from. The check_networking check, was the only one failing.

In the mean time the device did manage to get contact with the API on its own, we didn’t change anything. It installed the latest update and we’re now able to continue.

I would still be interested to know the root cause of this, it’s kind of difficult to explain to our customer that the device doesn’t come through for two hours even tough they meet all the connectivity criteria we set.

Cheers,
Erik

Hi Erik,

You can try running the same nslookup command from the supervisor container with:

balena exec -it resin_supervisor /bin/sh

If nslookup fails in the supervisor container and not on the host OS, then that would mean something is wrong with the internal balenaEngine/Docker resolver. If it does not work on both the supervisor container and the host OS then it would most probably mean there are networking issues outside the device.

Can you please check this and let us know what you see there?

Thanks,
Zahari

Hi @majorz,

Sorry for the delay, the issue went away for a while, so I couldn’t investigate. But it’s back now.

In the super visor log I have:

[error]   Failed to get target state for device: Error: getaddrinfo EAI_AGAIN api.balena-cloud.com api.balena-cloud.com:443

On the HosOS:

root@7d83bd2:~# nslookup api.balena-cloud.com
Server:    127.0.0.2
Address 1: 127.0.0.2 7d83bd2

Name:      api.balena-cloud.com
Address 1: 18.210.71.79 ec2-18-210-71-79.compute-1.amazonaws.com
Address 2: 18.208.61.234
Address 3: 3.229.114.198 ec2-3-229-114-198.compute-1.amazonaws.com

From the supervisor container:

root@7d83bd2:~# balena exec -it resin_supervisor /bin/sh
/usr/src/app # nslookup api.balena-cloud.com
Server:    10.114.102.1
Address 1: 10.114.102.1

Name:      api.balena-cloud.com
Address 1: 3.229.114.198 ec2-3-229-114-198.compute-1.amazonaws.com
Address 2: 18.208.61.234 ec2-18-208-61-234.compute-1.amazonaws.com
Address 3: 18.210.71.79 ec2-18-210-71-79.compute-1.amazonaws.com

Cheers,
Erik

Thank you for the update Erik. We’ll discuss this internally and get back to you.

Hi there, what sort of Internet connection is this device running on? We typically see these sorts of intermittent issues on poor quality cellular links and sometimes on networks where DNS requests are being filtered upstream…

We operate in the maritime industry, most our devices connect over satelilte links.
Their main flaw is very high latency which goes up even further in poor weather conditions. I never notice much package loss, it just takes (very) long.

I timed some of these nslookup commands, they take between 10 and 30 seconds to complete. It obviously needs to make an effort to muscle it through.

Would it be possible to make DNS lookups in the supervisor code more robust by implementing a DNS lookup fallback over TCP when the UDP default fails? And maybe also increasing the timeout on DNS lookups, or maybe make that something I can control through configuration.

Cheers,
Erik

Hi Erik

Thanks for more information about your usecase and the suggestions on how we can have a solution.
I had a quick discussion with some of our engineers and we are taking a closer look at what we can do to make things work smoothly on connections with high latency but no packet loss.
I have created an issue to track this feature request - https://github.com/balena-os/meta-balena/issues/1943
Feel free to add more details on that thread.