Posting here to +1 that we have also hit this issue:
Impacted Configuration:
BalenaOS 2.72.0+rev1
Supervisor 12.3.5
Upgraded From (March 23 2021):
BalenaOS 2.58.3+rev1
Supervisor 11.14.0
We had a fleet of 32 devices for an alpha product, 11 of which remain online today. We upgraded to the above balenaOS and supervisor in March. We didn’t initially notice the incremental data usage because we did not include data usage as part of our testing and validation procedure for our alpha, and because our fleet was small and shrinking as alpha devices were decommissioned from that program, keeping costs in check.
We started deploying a new fleet of devices and observed that our data costs were getting out of control. The investigation of that issue led to the discovery of this forum topic. We observed that disabling the HW monitoring feature (as suggested by another user) did result in a significant reduction of data usage, but unfortunately it required us to upgrade to a supervisor version that supports the necessary BALENA_SUPERVISOR_HARDWARE_METRICS flag (minimum 12.8.0).
After upgrading to supervisor 12.10.1 (the latest at the time of this posting), and disabling the HW metrics, our downloads in our alpha environment decreased by an average of 35.10 MB/day and uploads decreased by 11.92 MB/day (which corresponds to a data cost of $0.94 per device per day in excess data usage from this feature alone). The interim period from March → August represents a very large per-device data cost from our cell network provider. If we had encountered this issue after deploying our solution at scale, at a minimum it would have resulted in a major outage due to meeting and exceeding data caps, and potentially could have introduced crippling data costs. On top of that, had we deployed at scale in the interim period between the release of the supervisor (12.3.5 - Feb 9th) and the patched supervisor (12.8.0 - May 13th) [ A time span of over 3 months ] we would have had very limited options to stop the bleeding.
We further investigated this issue internally, and discovered that indeed, the payload of data required to transmit HW status is typically between 25 and 80 bytes per message. It’s a JSON formatted data and looks like this:
{
"local": {
"cpu_temp": 50,
"cpu_usage": 44,
"memory_usage": 596,
"storage_usage": 1335
}
}
The trouble is that every 10 seconds, it’s not sending only this data, but it’s completing a TLS 1.2 handshake to
https://api.balena-cloud.com/device/v2/\<device-uuid>/state
which includes 5-8KB of transmissions (most of the cost is downloading a 4960 byte certificate chain). In fact, as a sanity check on my estimation of the cost of a single HW status report, I took the amount of observed average data savings per device (47.02 MB/day) and divided by the number of HW status updates per day (8,640) and came up with an average data cost of 5,706 bytes/update.
So the table for bandwidth costs located here should be updated to include the TLS handshake cost, which is much much greater than the stated ~66 bytes/10 seconds.
This experience has highlighted for us the need to carefully evaluate new features coming from balena before deploying them. We are also concerned about the inability to disable the feature when it was originally introduced, and we have questions about the quality effort for this feature before it was deployed to the field and how it’s true cost of consumption was measured and validated before it was published to the public documentation.