Balena OS Upgrade and Cell Data Usage

Hello,

Around June 15th I pushed an OS update to most of our Balena Fin devices (over a cellular connection) which generally updated the OS from version 2.58.3+rev1 to version 2.77.0+rev1. This caused a very large (~80 to 100MB) download event for our devices, which is fine, but unexpectedly large. Does that seem like a reasonable size?

At the same time, I changed the configuration setting in our devices to “Disable logs from being sent to Balena” to try and reduce the upload traffic.

Notice in the following graphs, our overall upload/download data usage has actually doubled since the OS update and since this setting was applied.

Is this increase expected? Any idea what might be causing it? Our application did not change at all, so I’m wondering if there is something in the OS that is causing the increase in upload & download data. I have seen a similar data increase in most of our devices after the OS update. Prior to the update, you can see this cell modem typically downloaded 0.5MB data every 2 hours, and after, it downloaded 1.5 to 2MB every 2 hours. The upload rate also doubled. (Our application software doesn’t download any data, and it periodically uploads a few kB of data. VPN control is enabled.)

I should also add that I have not updated the supervisor, which is still at 12.5.10. Not sure if that would make a difference or not.

Hello @dstewart thanks for your detailed message.

We are internally working to visualize the network data exchanged on every balenaOS to understand why this happened. Having said that, could you please confirm what services at balena do you use? Do you use VPN? API calls?

Our devices have VPN control and the VPN connectivity check enabled. Our application doesn’t use any Balena API calls..

Hi, about the size of the OS update, this depends on the differences between the two versions. I just tried to update a balenaFin device between 2.58.3+rev1 and 2.77.0+rev1 and the size of the delta image is actually 155M.

And regarding the bandwidth used by the OS, as my colleague states we are working on features that will make it possible to measure the bandwidth consumption. Until then the only documentation we have is Reduce bandwidth usage - Balena Documentation which has not been recently reviewed.

Nothing in the OS should have increased the bandwidth usage, let alone double it. We will check this internally see if we can find the cause.

Great, thank you! I really appreciate your help!

Hi, just a quick update. We have an open ticket to track progress on the bandwidth measurement mechanism I mentioned, so you can track that for updates. Include IP table bandwidth usage in device metrics · Issue #1724 · balena-os/balena-supervisor · GitHub. It is also linked to this ticket so we will update here if there when it closes.

Hi, something that could affect the bandwidth usage is the metrics features that was recently introduced in the dashboard. Could you see if disabling this metrics have any effect? Check Reduce bandwidth usage - Balena Documentation for details.

Disabling metrics had a significant impact on our data usage - typically reducing usage from 20MB/day to 4MB/day. That was very helpful. I’ll restart my data usage testing with the other bandwidth reduction settings and see if I can reduce it further.

Posting here to +1 that we have also hit this issue:

Impacted Configuration:
BalenaOS 2.72.0+rev1
Supervisor 12.3.5

Upgraded From (March 23 2021):
BalenaOS 2.58.3+rev1
Supervisor 11.14.0

We had a fleet of 32 devices for an alpha product, 11 of which remain online today. We upgraded to the above balenaOS and supervisor in March. We didn’t initially notice the incremental data usage because we did not include data usage as part of our testing and validation procedure for our alpha, and because our fleet was small and shrinking as alpha devices were decommissioned from that program, keeping costs in check.

We started deploying a new fleet of devices and observed that our data costs were getting out of control. The investigation of that issue led to the discovery of this forum topic. We observed that disabling the HW monitoring feature (as suggested by another user) did result in a significant reduction of data usage, but unfortunately it required us to upgrade to a supervisor version that supports the necessary BALENA_SUPERVISOR_HARDWARE_METRICS flag (minimum 12.8.0).

After upgrading to supervisor 12.10.1 (the latest at the time of this posting), and disabling the HW metrics, our downloads in our alpha environment decreased by an average of 35.10 MB/day and uploads decreased by 11.92 MB/day (which corresponds to a data cost of $0.94 per device per day in excess data usage from this feature alone). The interim period from March → August represents a very large per-device data cost from our cell network provider. If we had encountered this issue after deploying our solution at scale, at a minimum it would have resulted in a major outage due to meeting and exceeding data caps, and potentially could have introduced crippling data costs. On top of that, had we deployed at scale in the interim period between the release of the supervisor (12.3.5 - Feb 9th) and the patched supervisor (12.8.0 - May 13th) [ A time span of over 3 months ] we would have had very limited options to stop the bleeding.

We further investigated this issue internally, and discovered that indeed, the payload of data required to transmit HW status is typically between 25 and 80 bytes per message. It’s a JSON formatted data and looks like this:


{
    "local": {
        "cpu_temp": 50,
        "cpu_usage": 44,
        "memory_usage": 596,
        "storage_usage": 1335
    }
}

The trouble is that every 10 seconds, it’s not sending only this data, but it’s completing a TLS 1.2 handshake to

https://api.balena-cloud.com/device/v2/\<device-uuid>/state

which includes 5-8KB of transmissions (most of the cost is downloading a 4960 byte certificate chain). In fact, as a sanity check on my estimation of the cost of a single HW status report, I took the amount of observed average data savings per device (47.02 MB/day) and divided by the number of HW status updates per day (8,640) and came up with an average data cost of 5,706 bytes/update.

So the table for bandwidth costs located here should be updated to include the TLS handshake cost, which is much much greater than the stated ~66 bytes/10 seconds.

This experience has highlighted for us the need to carefully evaluate new features coming from balena before deploying them. We are also concerned about the inability to disable the feature when it was originally introduced, and we have questions about the quality effort for this feature before it was deployed to the field and how it’s true cost of consumption was measured and validated before it was published to the public documentation.

Thanks for that feedback jweide!

It’s not a good one, but it the kind of feedback that allows us to improve. In fact, related to this topic, we are already working on different directions:

  • We are going to automate bandwidth usage tests to catch problems or advice better about the consumption. (More info here: Measure bandwidth consumption on every release · Issue #1756 · balena-os/meta-balena · GitHub)
  • We are including a migration from aufs to overlfs that will allow moving aufs containers to overlay2 without having to re-download them (please be aware of the changelogs and the effect of extra bandwidth usage if this migration doesn’t work properly)
  • We are studying how to give more control to users, on cell vs landline traffic
  • We are analyzing how to reduce bandwidth usage.

And of course, your message pushed us to check and see how to improve our release cycle. We definitely want to improve and avoid unwanted situations like yours.

Thanks again!