No API communication

ErikHH · July 2, 2020, 10:37am

Hi,

I have a device running balenaOS 2.47.1+rev1, supervisor 10.6.27.
It only comes online as Online (VPN only) and it will not start updating to the latest release.

I can ssh into it. In he supervisor logs I see:

event]   Event: Device state report failure {"error":{"message":""}}
(node:1) UnhandledPromiseRejectionWarning: Error: getaddrinfo EAI_AGAIN api.balena-cloud.com api.balena-cloud.com:443
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:56:26)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
[event]   Event: Device state report failure {"error":{"message":""}}
(node:1) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 2)
[info]    Internet Connectivity: OK

I understand that this means it can’t resolve api.balena-cloud.com.

But when I run:

root@22bbd38:~# nslookup api.balena-cloud.com
Server:    127.0.0.2
Address 1: 127.0.0.2 22bbd38

Name:      api.balena-cloud.com
Address 1: 18.208.61.234 ec2-18-208-61-234.compute-1.amazonaws.com
Address 2: 18.210.71.79 ec2-18-210-71-79.compute-1.amazonaws.com
Address 3: 3.229.114.198 ec2-3-229-114-198.compute-1.amazonaws.com

And I can also ping the API from the command line.

root@22bbd38:~# curl -v  https://api.balena-cloud.com/ping
*   Trying 3.229.114.198...
* TCP_NODELAY set
* Connected to api.balena-cloud.com (3.229.114.198) port 443 (#0)
* found 128 certificates in /etc/ssl/certs/ca-certificates.crt
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification OK
* 	 server certificate status verification SKIPPED
* 	 common name: balena.io (matched)
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: CN=balena.io
* 	 start date: Fri, 27 Sep 2019 00:00:00 GMT
* 	 expire date: Tue, 27 Oct 2020 12:00:00 GMT
* 	 issuer: C=US,O=Amazon,OU=Server CA 1B,CN=Amazon
* ALPN, server accepted to use http/1.1
> GET /ping HTTP/1.1
> Host: api.balena-cloud.com
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 02 Jul 2020 10:43:12 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 2
< Connection: keep-alive
< X-Frame-Options: DENY
< X-Content-Type-Options: nosniff
< ETag: W/"2-nOO9QiTIwXgNtWtBJezz8kv3SLc"
<

Any ideas what could be the matter, why won’t the supervisor connect to the API?

Cheers,
Erik

karaxuna · July 2, 2020, 10:57am

Hi, can you run device health checks from the diagnostics tab from the dashboard?

ErikHH · July 2, 2020, 11:03am

Some networking issues detected: 
test_upstream_dns: DNS lookup failed for 0.resinio.pool.ntp.org via upstream: 192.168.0.1
test_upstream_dns: DNS lookup failed for api.balena-cloud.com via upstream: 192.168.0.1
test_balena_registry: Could not communicate with registry2.balena-cloud.com for authentication

But on the command line I can resolve all of these just fine.

ErikHH · July 2, 2020, 11:06am

I noticed the latency on the connection is fairly high.

root@22bbd38:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=112 time=588.433 ms
64 bytes from 8.8.8.8: seq=1 ttl=112 time=647.057 ms

Around ~600ms could that be throwing the javascript code off?

AlidaOdendaal · July 2, 2020, 12:01pm

Hi @ErikHH

Thanks for the extra info. I just want to follow up on what karaxuna asked - are you able to run health checks for this device from the diagnostics tab on the dashboard?

Kind regards
Alida

ErikHH · July 2, 2020, 12:24pm

Hi @AlidaOdendaal,

Yes I managed to run the heatlh checks from the dashboard. That’s where the output I pasted above came from. The check_networking check, was the only one failing.

In the mean time the device did manage to get contact with the API on its own, we didn’t change anything. It installed the latest update and we’re now able to continue.

I would still be interested to know the root cause of this, it’s kind of difficult to explain to our customer that the device doesn’t come through for two hours even tough they meet all the connectivity criteria we set.

Cheers,
Erik

majorz · July 2, 2020, 1:45pm

Hi Erik,

You can try running the same nslookup command from the supervisor container with:

balena exec -it resin_supervisor /bin/sh

If nslookup fails in the supervisor container and not on the host OS, then that would mean something is wrong with the internal balenaEngine/Docker resolver. If it does not work on both the supervisor container and the host OS then it would most probably mean there are networking issues outside the device.

Can you please check this and let us know what you see there?

Thanks,
Zahari

ErikHH · July 6, 2020, 10:59am

Hi @majorz,

Sorry for the delay, the issue went away for a while, so I couldn’t investigate. But it’s back now.

In the super visor log I have:

[error]   Failed to get target state for device: Error: getaddrinfo EAI_AGAIN api.balena-cloud.com api.balena-cloud.com:443

On the HosOS:

root@7d83bd2:~# nslookup api.balena-cloud.com
Server:    127.0.0.2
Address 1: 127.0.0.2 7d83bd2

Name:      api.balena-cloud.com
Address 1: 18.210.71.79 ec2-18-210-71-79.compute-1.amazonaws.com
Address 2: 18.208.61.234
Address 3: 3.229.114.198 ec2-3-229-114-198.compute-1.amazonaws.com

From the supervisor container:

root@7d83bd2:~# balena exec -it resin_supervisor /bin/sh
/usr/src/app # nslookup api.balena-cloud.com
Server:    10.114.102.1
Address 1: 10.114.102.1

Name:      api.balena-cloud.com
Address 1: 3.229.114.198 ec2-3-229-114-198.compute-1.amazonaws.com
Address 2: 18.208.61.234 ec2-18-208-61-234.compute-1.amazonaws.com
Address 3: 18.210.71.79 ec2-18-210-71-79.compute-1.amazonaws.com

Cheers,
Erik

srlowe · July 6, 2020, 12:03pm

Thank you for the update Erik. We’ll discuss this internally and get back to you.

ab77 · July 7, 2020, 9:34pm

Hi there, what sort of Internet connection is this device running on? We typically see these sorts of intermittent issues on poor quality cellular links and sometimes on networks where DNS requests are being filtered upstream…

ErikHH · July 8, 2020, 7:17am

We operate in the maritime industry, most our devices connect over satelilte links.
Their main flaw is very high latency which goes up even further in poor weather conditions. I never notice much package loss, it just takes (very) long.

I timed some of these nslookup commands, they take between 10 and 30 seconds to complete. It obviously needs to make an effort to muscle it through.

Would it be possible to make DNS lookups in the supervisor code more robust by implementing a DNS lookup fallback over TCP when the UDP default fails? And maybe also increasing the timeout on DNS lookups, or maybe make that something I can control through configuration.

Cheers,
Erik

anujdeshpande · July 8, 2020, 9:36am

Hi Erik

Thanks for more information about your usecase and the suggestions on how we can have a solution.
I had a quick discussion with some of our engineers and we are taking a closer look at what we can do to make things work smoothly on connections with high latency but no packet loss.
I have created an issue to track this feature request - https://github.com/balena-os/meta-balena/issues/1943
Feel free to add more details on that thread.

Topic		Replies	Views
Lost control of Balena device Product support	12	574	June 7, 2021
Supervisor fails to resolve DNS on v4, v5 in offline/air-gapped setup using open-balena Product support raspberrypi3 , docker	5	233	June 12, 2024
Cannot Download Image HTTP 500 Error 127.0.0.2:53: server misbehaving Product support	8	1300	August 27, 2019
DNS failure not caught by supervisor balenaOS	72	2281	November 11, 2020
Error: Request error: tunneling socket could not be established, cause=socket hang up Product support	12	1955	February 14, 2020

No API communication

Related topics