Oddity while testing balenaCloud with QEMU X86

Hi all,

I am just playing around with balenaCloud and started with the “Intel NUC” image and VirtualBox. Thanks to https://www.balena.io/blog/no-hardware-use-virtualbox/ this was pleasantly easy and without any difficulties, even when the blog entry is little bit aged.
Afterwards I wanted to try a QEMU image direct on a VM. The image booted, but did not appear in my balenaCloud.

Spoiler: I got it working, but maybe this is helpful for others - or someone can tell me, why I had these issues. :slight_smile:

After some investigations I found hints in the OpenVPN logs, that it cannot connect at all

root@2d1c5fc:~# journalctl -u openvpn -f
-- Logs begin at Sat 2020-05-16 11:38:56 UTC. --
May 16 11:44:43 2d1c5fc openvpn[1303]: Sat May 16 11:44:43 2020 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
May 16 11:44:43 2d1c5fc openvpn[1303]: Sat May 16 11:44:43 2020 TCP/UDP: Preserving recently used remote address: [AF_INET]35.169.89.252:443
May 16 11:44:43 2d1c5fc openvpn[1303]: Sat May 16 11:44:43 2020 Socket Buffers: R=[87380->87380] S=[16384->16384]
May 16 11:44:43 2d1c5fc openvpn[1303]: Sat May 16 11:44:43 2020 Attempting to establish TCP connection with [AF_INET]35.169.89.252:443 [nonblock]
May 16 11:44:44 2d1c5fc openvpn[1303]: Sat May 16 11:44:44 2020 TCP connection established with [AF_INET]35.169.89.252:443
May 16 11:44:44 2d1c5fc openvpn[1303]: Sat May 16 11:44:44 2020 TCP_CLIENT link local: (not bound)
May 16 11:44:44 2d1c5fc openvpn[1303]: Sat May 16 11:44:44 2020 TCP_CLIENT link remote: [AF_INET]35.169.89.252:443
May 16 11:44:44 2d1c5fc openvpn[1303]: Sat May 16 11:44:44 2020 NOTE: UID/GID downgrade will be delayed because of --client, --pull, or --up-delay
May 16 11:44:44 2d1c5fc openvpn[1303]: Sat May 16 11:44:44 2020 TLS: Initial packet from [AF_INET]35.169.89.252:443, sid=9040810e 79edb9b0
May 16 11:44:44 2d1c5fc openvpn[1303]: Sat May 16 11:44:44 2020 WARNING: this configuration may cache passwords in memory -- use the auth-nocache option to prevent this
May 16 11:44:45 2d1c5fc openvpn[1303]: Sat May 16 11:44:45 2020 VERIFY OK: depth=1, C=AU, ST=Some-State, O=Internet Widgits Pty Ltd
May 16 11:44:45 2d1c5fc openvpn[1303]: Sat May 16 11:44:45 2020 VERIFY KU OK
May 16 11:44:45 2d1c5fc openvpn[1303]: Sat May 16 11:44:45 2020 Validating certificate extended key usage
May 16 11:44:45 2d1c5fc openvpn[1303]: Sat May 16 11:44:45 2020 ++ Certificate has EKU (str) TLS Web Server Authentication, expects TLS Web Server Authentication
May 16 11:44:45 2d1c5fc openvpn[1303]: Sat May 16 11:44:45 2020 VERIFY EKU OK
May 16 11:44:45 2d1c5fc openvpn[1303]: Sat May 16 11:44:45 2020 VERIFY OK: depth=0, CN=vpn.balena-cloud.com
May 16 11:44:45 2d1c5fc openvpn[1303]: Sat May 16 11:44:45 2020 Control Channel: TLSv1.2, cipher TLSv1/SSLv3 ECDHE-RSA-AES256-GCM-SHA384, 2048 bit RSA
May 16 11:44:45 2d1c5fc openvpn[1303]: Sat May 16 11:44:45 2020 [vpn.balena-cloud.com] Peer Connection Initiated with [AF_INET]35.169.89.252:443
May 16 11:44:46 2d1c5fc openvpn[1303]: Sat May 16 11:44:46 2020 SENT CONTROL [vpn.balena-cloud.com]: 'PUSH_REQUEST' (status=1)
May 16 11:44:46 2d1c5fc openvpn[1303]: Sat May 16 11:44:46 2020 AUTH: Received control message: AUTH_FAILED
May 16 11:44:46 2d1c5fc openvpn[1303]: Sat May 16 11:44:46 2020 SIGTERM[soft,auth-failure] received, process exiting

and I saw an error in the “resin-supervisor”

root@2d1c5fc:~# journalctl -u resin-supervisor -f
-- Logs begin at Sat 2020-05-16 11:38:56 UTC. --
May 16 11:48:49 2d1c5fc resin-supervisor[1594]: resin_supervisor
May 16 11:48:50 2d1c5fc systemd[1]: Stopped Resin supervisor.
May 16 11:48:50 2d1c5fc systemd[1]: Starting Resin supervisor...
May 16 11:48:50 2d1c5fc resin-supervisor[1645]: resin_supervisor
May 16 11:48:51 2d1c5fc resin-supervisor[1652]: active
May 16 11:48:51 2d1c5fc systemd[1]: Started Resin supervisor.
May 16 11:48:58 2d1c5fc resin-supervisor[1653]: Container config has not changed
May 16 11:49:08 2d1c5fc resin-supervisor[1653]: Starting system message bus: dbus.
May 16 11:49:08 2d1c5fc resin-supervisor[1653]:  * Starting Avahi mDNS/DNS-SD Daemon: avahi-daemon
May 16 11:49:09 2d1c5fc resin-supervisor[1653]:    ...done.
May 16 11:49:37 2d1c5fc resin-supervisor[1653]: (node:1) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
May 16 11:49:39 2d1c5fc resin-supervisor[1653]: Starting event tracker
May 16 11:49:39 2d1c5fc resin-supervisor[1653]: Starting up api binder
May 16 11:49:39 2d1c5fc resin-supervisor[1653]: Starting logging infrastructure
May 16 11:49:39 2d1c5fc resin-supervisor[1653]: Unhandled rejection TypeError [ERR_INVALID_OPT_VALUE]: The value "4294967295" is invalid for option "family"
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at lookup (dns.js:112:11)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at r.lookup (/usr/src/app/dist/app.js:293:101715)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at net.js:994:5
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at defaultTriggerAsyncIdScope (internal/async_hooks.js:294:19)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at lookupAndConnect (net.js:993:3)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at TLSSocket.Socket.connect (net.js:928:5)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at Object.connect (_tls_wrap.js:1150:12)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at Agent.createConnection (https.js:120:22)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at Agent.createSocket (_http_agent.js:227:26)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at Agent.addRequest (_http_agent.js:185:10)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at new ClientRequest (_http_client.js:249:16)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at Object.request (https.js:281:10)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at Object.i.request (/usr/src/app/dist/app.js:293:217995)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at t.<anonymous> (/usr/src/app/dist/app.js:604:7771)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at m (/usr/src/app/dist/app.js:9:41356)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at /usr/src/app/dist/app.js:9:41689
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at t.b [as setup] (/usr/src/app/dist/app.js:9:41696)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at t.write (/usr/src/app/dist/app.js:604:9463)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at t.log (/usr/src/app/dist/app.js:604:9172)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at e.log (/usr/src/app/dist/app.js:604:4373)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at e.logSystemMessage (/usr/src/app/dist/app.js:604:4522)
May 16 11:49:39 2d1c5fc resin-supervisor[1653]:     at /usr/src/app/dist/app.js:293:103766

Some googling about the “AUTH_FAILED” led me towards “network issues” and the error from the supervisor in “dns.js” reinforced this assumption.

After some wild changes on my VM without any success, I tried to use the img-file for QEMU like the “Intel NUC” image before. And tada: in VirtualBox this image started and showed up in the balenaCloud instandly.

After comparing the OpenVPN logs and configurations I was quite sure, the system was not completly provisioned yet. So I investigated a little bit more for the supervisor…

I found the source of the exception in the code, inserted some more logging to the javascript in the supervisor-container and found, there is no difference in the balenaOS supervisor itself wether it is running on QEMU or VirtualBox (thankfully).
BUT: there is a difference in one bitshift operation, which breaks everything.

Line of interest:

family = options.family >>> 0;

This family parameter is undefined and gets to “0” while running in VirtualBox, but to “4294967295” while running in QEMU. And this leads to an exception, so the supervisor cannot finish the device configuration and OpenVPN cannot connect.

As a workaround I set the family parameter now fixed to zero before the function call and the supervisor can successful setup my QEMU image.

At the end this was done by a one liner (tested with balenaOS version qemux86-64-2.38.0+rev1-dev-v9.15.7):

balena exec -it resin_supervisor sed -i 's/r.lookup=function(e,t,n){/r.lookup=function(e,t,n){t.family=0;/' dist/app.js

So the final steps for a working QEMU image were

  • start like the getting started
  • run the QEMU command
  • wait for the login prompt
  • login with root
  • wait some time until the supervisor container is running
  • run the balena exec command from above
  • reboot

If anybody knows something about this behaviour regarding QEMU, I would appreciate details. Although I double checked the QEMU command on my VM and a native Linux PC, there can of course also be an issue in my systems.

Interesting. Thanks for sharing this.
I wonder what is setting family to 0xffffffff when using qemu.
Family is supposed to be either 0, 4 or 6.
https://nodejs.org/api/dns.html#dns_dns_lookup_hostname_options_callback

To clarify this: for me it did not look like a real “set”, but a internal conversation issue.
I inserted a console.log right before the lookup-call and got two results:

  • console.log(options.family) --> undefined
    • a JSON.stringify(options) showed only hints as a member (value 32 I believe)
  • console.log(options.family >>> 0) --> 4294967295

I wondered if the QEMU-CPU has a “special” instruction set which is acting different than a real CPU - but here I know too less about QEMU :wink: