I have a device running balenaOS 2.47.1+rev1, supervisor 10.6.27.
It only comes online as Online (VPN only) and it will not start updating to the latest release.
I can ssh into it. In he supervisor logs I see:
event] Event: Device state report failure {"error":{"message":""}}
(node:1) UnhandledPromiseRejectionWarning: Error: getaddrinfo EAI_AGAIN api.balena-cloud.com api.balena-cloud.com:443
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:56:26)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
[event] Event: Device state report failure {"error":{"message":""}}
(node:1) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 2)
[info] Internet Connectivity: OK
I understand that this means it can’t resolve api.balena-cloud.com.
Some networking issues detected:
test_upstream_dns: DNS lookup failed for 0.resinio.pool.ntp.org via upstream: 192.168.0.1
test_upstream_dns: DNS lookup failed for api.balena-cloud.com via upstream: 192.168.0.1
test_balena_registry: Could not communicate with registry2.balena-cloud.com for authentication
But on the command line I can resolve all of these just fine.
I noticed the latency on the connection is fairly high.
root@22bbd38:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=112 time=588.433 ms
64 bytes from 8.8.8.8: seq=1 ttl=112 time=647.057 ms
Around ~600ms could that be throwing the javascript code off?
Thanks for the extra info. I just want to follow up on what karaxuna asked - are you able to run health checks for this device from the diagnostics tab on the dashboard?
Yes I managed to run the heatlh checks from the dashboard. That’s where the output I pasted above came from. The check_networking check, was the only one failing.
In the mean time the device did manage to get contact with the API on its own, we didn’t change anything. It installed the latest update and we’re now able to continue.
I would still be interested to know the root cause of this, it’s kind of difficult to explain to our customer that the device doesn’t come through for two hours even tough they meet all the connectivity criteria we set.
You can try running the same nslookup command from the supervisor container with:
balena exec -it resin_supervisor /bin/sh
If nslookup fails in the supervisor container and not on the host OS, then that would mean something is wrong with the internal balenaEngine/Docker resolver. If it does not work on both the supervisor container and the host OS then it would most probably mean there are networking issues outside the device.
Can you please check this and let us know what you see there?
Hi there, what sort of Internet connection is this device running on? We typically see these sorts of intermittent issues on poor quality cellular links and sometimes on networks where DNS requests are being filtered upstream…
We operate in the maritime industry, most our devices connect over satelilte links.
Their main flaw is very high latency which goes up even further in poor weather conditions. I never notice much package loss, it just takes (very) long.
I timed some of these nslookup commands, they take between 10 and 30 seconds to complete. It obviously needs to make an effort to muscle it through.
Would it be possible to make DNS lookups in the supervisor code more robust by implementing a DNS lookup fallback over TCP when the UDP default fails? And maybe also increasing the timeout on DNS lookups, or maybe make that something I can control through configuration.
Thanks for more information about your usecase and the suggestions on how we can have a solution.
I had a quick discussion with some of our engineers and we are taking a closer look at what we can do to make things work smoothly on connections with high latency but no packet loss.
I have created an issue to track this feature request - https://github.com/balena-os/meta-balena/issues/1943
Feel free to add more details on that thread.