Updated bandwidth usage numbers

Recently we got a few questions regarding the resin.io bandwidth usage, whether or not the numbers mentioned in our Device Bandwidth/Data usage blogpost and documentation are still accurate or not.

We’ve decided to revisit it, and take some more notes on how did we test, so it’s easy to replicate in the future. :wrench:

Setup

The setup we had was the following:

We had two Raspberry Pi 3 running the latest 2.3.0+rev1 resinOS, freshly provisioned. Choose two to run two tests in parallel: the default/not-adjusted resinOS behaviour for a baseline, and a bandwidth minimized behaviour.

Both devices were connected over Ethernet to measurement laptops, that shared their wifi to the Ethernet. That made it really easy to separate the test devices’ traffic from elsewhere.

Run Wireshark on the laptop to record all the traffic on the Ethernet interface to the test devices.

The devices had the following application deployed:

# Dockerfile.template
FROM resin/%%RESIN_MACHINE_NAME%%-alpine

CMD while : ; do if [ ! -z "${DEBUG}" ]; then echo "Hello..."; fi ; sleep ${INTERVAL=10}; done

By default this application just continuously sleeps, but could enable log output by setting the DEBUG environment to anything at all, and can adjust the frequency of log entries by setting the INTERVAL env var. In these tests we’ve kept everything the default (that is no log output), so that we can benchmark the bandwidth usage of resinOS alone.

The “baseline” device was not modified in any way, while the “minimal traffic” device had these variables set (as mentioned in the docs):

  • RESIN_SUPERVISOR_VPN_CONTROL to false
  • RESIN_SUPERVISOR_CONNECTIVITY_CHECK to false
  • RESIN_SUPERVISOR_POLL_INTERVAL to 86400000
  • RESIN_SUPERVISOR_LOG_CONTROL to false

This would minimize the amount data the device transfers, and still would enable to control the device (as it checks in once every 24h=86400000ms).

After these settings the devices were rebooted, and the recording started 30 minutes after restart (to keep it similar to the blogpost).

Wireshark recorded the traffic for about 134 hours (5.5 days) or so, to have some decent averaging.

Results

Taking the Wireshark records, we can filter out some traffic, that is due to fact that the devices are on a LAN, and it’s not due to resinOS, but because of the network environment (and e.g. would not be present in a 3G modem setting). The filter we’ve used for the analysis removes mdns (Avahi) traffic, DHCP updates (the “type==53” filter below), and address resolution protocol (ARP) packages:

not mdns  && not (bootp.option.type == 53) && not arp

What’s left is the actual communication tcp/ssl/tls traffic, the ntp network time protocol traffic, dns DNS resolution traffic, and a couple of icmp packages.

Baseline

For the baseline device, the bandwidth usage results come in ~around 145 B/s ~= 11.95 MB/day ~= 358 MB/month (30 days) (see the “Displayed” column in the results below:

The details are:

  • TLS traffic: averages around 127 B/s, of which is about 89 B/s to our API and 38 B/s to our VPN (filtered in the results with ip.addr == a.b.c.d where the IPs of api.resin.io and vpn.resin.io were filled in
  • DNS/ICMP traffic: averages around 17 B/s
  • NTP: 2*90 bytes messages every ~2000s (as systemd-timesyncd synchronizes time), or ~0.09 B/s

So the breakdown would be something like

  • API traffic: 61%
  • VPN traffic: 26%
  • DNS traffic: 12%
  • NTP traffic: << 1.%

This result is better than the previous test’s when about 527MB/month was extrapolated (see the linked blogpost).

Minimal

The bandwidth minimization resulted in an average of 0.283 B/s ~= 23.9 kB/day ~= 716.3 kB/month (which is ~0.2% of the baseline traffic): (See the “Displayed” column below, when the appropriate filters were applied):

  • TCP traffic: 0.172 B/s, all of it to the API, there was zero VPN traffic found, as it is correct due to the settings; the API calls show up ~every 24h in the logs also just as it was set
  • DNS/ICMP traffic: 0.021 B/s DNS traffic was averaged, since much less traffic needed DNS resolution
  • NTP traffic: got ~0.09 B/s, which is the same as in the baseline case, as the settings did not affect NTP in any way.

So the breakdown would be something like:

  • API traffic: 61%
  • DNS traffic: 7%
  • NTP traffic: 32%

This is again better than before, about 50% of the previous minimal bandwidth usage estimate.

Future tests

This should be a good cross-check to show that previous bandwidth estimates are still in the right ballpark, and if anything those are too conservative. It was a bit contrived setup (filtering out local network traffic, etc, even if justified). Most of our users who want minimal bandwidth usage are on metered 3G connections. For more realistic setup we’ll be setting up some devices on the 3G network, and use the network provider’s traffic estimate to confirm these numbers above.


Would be happy to hear any feedback or comments on these tests or your network traffic minimization setup! :slight_smile:

1 Like

Thanks for the detailed report. These numbers are in the same ballpark as what we’ve seen on our devices.

Even though the average numbers are quite small, the API polls saturate low bandwidth links. I was wondering whether there is a way to soft-limit the available bandwidth for those API polls or the “resin stuff” in general?

What sort of network setup are you working with? So that we have some similar setup to test the resin behaviour with. Not sure how a saturated bandwidth manifest itself for you. What sort of available bandwidth do you have?

If the bandwidth is the problem, wouldn’t settin RESIN_SUPERVISOR_POLL_INTERVAL to something larger than 60000ms=60 but smaller than the 24h work? Let’s say every 10 or 30 minutes or even longer as it makes sense…

We’re using a GPRS link. The HTTPS handshake involves quite sizeable packets that for a limited time, this uses a significant slice of the overall available link bandwidth. I was just wondering if it’s possible to soft-limit the datarate of that service? Maybe it’s possible to limit the bandwidth that’s available to the entire resin-supervisor container?

Can you give us a bit more context for your use? When the API call / HTTPS handshake uses up the link bandwidth, what happens then? How does it affect your application / the device / the link itself? Trying to see how a short burst of data like that will interact with the rest of the system. Also, what’s the available bandwidth for you? What hardware do you use to connect?

I have followed your guide to save on data usage, and after a device restart the log output on the resin.io dashboard looks as follows:

03.10.17 23:05:55 (+0200) Applying config variable RESIN_SUPERVISOR_VPN_CONTROL = false
03.10.17 23:05:56 (+0200) Applied config variable RESIN_SUPERVISOR_VPN_CONTROL = false
03.10.17 23:05:56 (+0200) Applying config variable RESIN_SUPERVISOR_CONNECTIVITY_CHECK = false
03.10.17 23:05:56 (+0200) Applied config variable RESIN_SUPERVISOR_CONNECTIVITY_CHECK = false
03.10.17 23:05:56 (+0200) Applying config variable RESIN_SUPERVISOR_POLL_INTERVAL = 600000
03.10.17 23:05:56 (+0200) Applied config variable RESIN_SUPERVISOR_POLL_INTERVAL = 600000
03.10.17 23:05:56 (+0200) Applying config variable RESIN_SUPERVISOR_LOG_CONTROL = false

When I restart the device this same block of text is logged to the online console. You can see that the device is supposed to check in every 600000ms = 10 minutes.

My concerns:
My device is listed as “offline” on the dashboard. The last heard also does not update, even not after a restart of the device. Shouldn’t the “last seen” indicate the last time the device checked in to the API (at 10 minute interval in my case) to indicate that the device is still alive? Is there another way to know if the device is still alive and still checking in periodically?

The restart app and restart device buttons are disabled. Shouldn’t these be enabled and the action queued to be executed when the device checks in on the 10 minutes cycle?

Hey @jpmeijers some general comments

  • the device is “offline”, which means “not connected to the VPN”, because RESIN_SUPERVISOR_VPN_CONTROL is set to false. Since reboot and such is issued through the VPN, if you set this setting, you won’t be able to do that
  • there are no logs because you set RESIN_SUPERVISOR_LOG_CONTROL to false, so the logs won’t leave the device, won’t show up in the dashboard
  • the 10min update interval is set correctly

These changes are all done to absolutely minimize the traffic, and that comes with obvious trade-offs.

  • If it’s set not to display logs, there won’t be no logs
  • If it’s set not to connect to the VPN, then it wont show up online, or can’t do remote maintanance
  • If the poll interval is very long, then every changes made to the device are reflected over a longer time (ie. if you’d turn logs or VPN back on, that would take about one poll interval to be applied)

So yes, if you limit the device this much to minimize the traffic, there will be things that are harder to manage.

Regarding the device actions, that we’ll need to discuss internally, there are some changes coming to that, but not exactly sure how much will it change in this sense.

Regarding the “checking in”, yes, we’re working on ways to display such data of “last contact by device X mins ago”, or similar, that’s part of our re-thinking of the “online/offline” status. It’s still a work in progress, thanks for the feedback! And will be checking out more of the things you mentioned.

By the way, what sort of environment you are working in that you need data transfer reduction like this? Just so that we can think of your case better.

Thank you for this thorough investigation!

I had the same questions as @jpmeijers, so I’m lucky he/she already asked :slight_smile:

Does this mean that if VPN is off (RESIN_SUPERVISOR_VPN_CONTROL = false), then the value of poll interval (RESIN_SUPERVISOR_POLL_INTERVAL) doesn’t matter because it never polls for updates? Does it mean if we set VPN control off we cannot update the device remotely but would have to power cycle it to apply any changes? (new environment variables, updates, etc.).

Or to make it more concrete, which are the necessary settings to keep remote control of the device?

Another question… In our case, we call the Resin API from our own management dashboard and show if the device is online. When testing we realized that for getting the online/offline status we also need VPN control on. Is that correct?

In our tests we are getting quite bigger numbers. We’ll try one test with the “ultra saving” mode that you suggest and compare.
We are using a Beaglebone Black with Resin OS 2.0.6+rev2 and testing on 3G. We can see the bandwidth consumed on the 3G connector we use. Is there a difference in bandwidth consumption between Ethernet and 3G? (more or less?)

Thanks again, sorry for so many questions!

Flavia

I think our explanation is still not clear enough, will try to further clarify

VPN: the VPN is used to send data to and from the resin.io services to the device, so RESIN_SUPERVISOR_VPN_CONTROL=false will result in blink/restart application/reboot/resinhup/remote support not being available, and the device will not show “online”, as “online” for the dashboard means connected to the VPN

Polling: the supervisor requests info from the resin backend on its own initiative, such as environment variables & and configuration, whether there are application updates. Thus the poll interval gives the time it takes at most for those changes to be received by the device. This setup is necessary, because e.g. if the env vars were sent over the VPN and we turn off the VPN, there wouldn’t be any way to turn things back on… Nope, these settings are through the API, and hence the effect of the polling interval

Logging: the logs sent to PubNub, and the dashboard pulls the information from there. If logging is disabled, this part will be turned off and the dashboard will show no logs.

So to answer your direct questions:

  • won’t be able to power cycle
  • changes will be applied, taking that much time as defined by the poll interval

Can you define “keeping remote control”? As there are degrees to it, and not sure what level you’d like to have.

That is correct, because “online/offline” status is in fact “connected to VPN / not connected to VPN”. We are working on being clearer about this, and have some more granularity as well.

What 3G modem and service are you using? How do you see how much data is sent on the 3G connector, you also connect to the device some other way (e.g. ethernet?) Would be useful to know the description of your test setup. Also, you see quite bigger numbers for the default setting, or some other arrangement?

Well the API calls are several kilobytes in size and consist of > 20 packets. We don’t want our application and the API calls to share the available bandwidth fairly, but ideally prioritize our application or at least apply a hard limit for the API calls.

We connect via a GPRS modem so our available bandwidth can be as small as a few kbit/s.

Ok, that makes it more clear, thank you.

What I mean to “keep control of the device” is that we can, for example, update it, see if it’s online, ssh to it. But, if I understand right, the minimum we need is to be able to change the VPN property back to ON, because if we can do that, we can do all the rest I mentioned, correct? So if we set a polling interval which is good enough for us (say, half hour), we should be fine, because within half an hour we could have control of the device again (and do everything I mentioned in the first sentence :slight_smile:). Does this make sense?

About having more granularity when showing the status of the device, we would be very interested in this. For example if we could see if the device is online only because it polls regularly. Or what other ideas you have in mind?

About our test setup, we are using this 3G modem: http://consumer.huawei.com/en/mobile-broadband/e5770/. We can see the data consumption on the device screen itself.

The numbers we get are:

baseline (all default options): 4.6Mb per month (vs the 716Kb you got)
saving mode: 633Mb per month (vs the 358Mb you got)

(In the second case) when I log to the Terminal and I run ps I don’t see much running:

root@29dc210-29dc210:~/radarvirt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.9  0.5   4284  2920 ?        Ss   12:32   0:03 /sbin/init quiet systemd.show_status=0
root        42  0.4  0.4   7380  2180 ?        Ss   12:32   0:00 /lib/systemd/systemd-journald
root        50  0.3  0.4  10864  2228 ?        Ss   12:32   0:00 /lib/systemd/systemd-udevd
root       126  0.0  0.7   6540  3696 ?        Ss   12:32   0:00 /usr/sbin/sshd -D
message+   131  0.0  0.4   4584  2416 ?        Ss   12:32   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root       253 16.3  0.5   4588  2540 ?        Ss   12:35   0:00 /bin/bash
root       257  0.0  0.3   4320  1636 ?        R+   12:35   0:00 ps aux

Do you see anything out of the ordinary there?

Our image is not a basic one, but modified to run our software, but we are not running any extra software when running this test.

Thanks!

I have a couple of more questions, I hope you don’t mind :blush:

According to the original page about bandwidth usage, having the VPN on is “cheaper” than having it off, see:

VPN enabled	                        43 Bytes / second
TCP check cost (When VPN disabled)	47.36 Bytes / second

Is this correct?

And the question, to confirm if I understand this right. If we leave VPN ON, is it necessary to do any polling at all? Does it add anything?

Thank you!

Hi, the VPN and the polls are doing different things, as mentioned before. The VPN is to “push” commands to the device, and the web terminal functionality, which needs direct communication. The polls are for the supervisor of “what do I need to change on this device’s state? Is there anything new to download or variables to set?”).

These are pretty orthogonal things, and intentionally so. Turning off the VPN will disable the resin team’s access to the device, while does not affect the base functionality, so you can be assured what we can and cannot do remotely.

If you see the table, the polls are another line (and controlled by RESIN_SUPERVISOR_POLL_INTERVAL) compared to the VPN (RESIN_SUPERVISOR_VPN_CONTROL) and TCP check (RESIN_SUPERVISOR_CONNECTIVITY_CHECK).

Regarding your earlier question, hard to say anything without knowing your exact setup (including the code you are running, etc). We have tested it with a Soracom 3G connection, and got very similar numbers than the LAN test menioned above. Not sure what your modifications meanin this case, also whether your software sends any logs, or you log in to the device remotely while doing the test (both would inflate the numbers)…

Ok, then there is something I don’t understand.

I made the following test:

  1. Change RESIN_SUPERVISOR_POLL_INTERVAL to 86400000 (so it will poll for changes once per day), I leave RESIN_SUPERVISOR_VPN_CONTROL with the default value (true).
  2. Change an environment variable.
  3. (After it’s done with handling 2) change again an environment variable.

What I understand from your explanation is that at least (2) or (3) will not both be taken into account within the day, because it has to wait until the next polling moment. So at the minimum I would have to wait for one day.

However, both changes occur immediately: I make a change (2), the application is killed and started again, done. I make the second change (3), the application is killed and started again.

What am I not understanding?

Thanks
Flavia

Hey, okay, let me step back a step, and refer to our documentation that actually has most of the info, and elaborating on a few things that I misunderstood as well

The main thing is that env var / configuration changes are applied instantly when VPN is on, but when the VPN is off, they will be still applied through polling. So I guess you were in the right ballpark earlier, sorry for the confusion:

Going back to your earlier question:

Talking to the team, it seems like the main thing for poll is notifying the device that there’s an application update that it can download. So setting long poll interval will affect how quickly the device will realize that there’s something it should pull.