Either of these limiters would work in isolation, but together they create a potential issue, because when performance metrics are not accepted by open-balena-api, it still returns a 200 success code on the patch call - so balena-supervisor has no idea that its updated metric did not take, and does not re-try to send the update unlessisSignificantChange is triggered again from the new level. This is a problem, because the new baseline for determining isSignificantChange in balena-supervisor now differs from the level that is stored in open-balena-api, so the stat remains misreported (often significantly) until a new significant change is triggered.
For example, this is the type of behavior we are seeing on our devices:
T12:00:00 Device supervisor correctly detects high CPU usage (say 99%) from some process that happened to be running at the time of the test, and sends a PATCH call to API to report this; API accepts the call and updates cpu_usage to 99%
T12:00:30 Device CPU usage drops to 20% as the process completes, supervisor detects and reports this to API but API ignores the update because it has not been 60 seconds. API reported CPU usage remains at 99%, however supervisor thinks API accepted the update and changes its cached value to 20% for determining whether a significant change has occurred from this point forward.
Many hours or days pass while device remains between 0 and 40% CPU usage, so no significant change is triggered
open-balena-api continues to report CPU usage at 99% and never changes
I believe the solution to this is to either remove the shouldUpdateMetrics function in open-balena-api entirely, or just default METRICS_MAX_REPORT_INTERVAL_SECONDS to zero and let users set this understanding the issues it might cause by doing so. I believe balena cloud is also using a non-zero value here which would explain why users are seeing this issue in that environment per the thread linked above. Setting this to zero makes the primary governor of performance updates the isSignificantChange function in balena-supervisor, which works correctly on its own.
In the meanwhile, to alleviate this issue in our environment we have manually set METRICS_MAX_REPORT_INTERVAL_SECONDS to zero. For whatever reason, open-balena-api would not pick up this value from an env var in the API container (when we set the METRICS_MAX_REPORT_INTERVAL_SECONDS env var to zero, the constant in the app remains at 60 which was verified via console log tests), so we had to just hardwire the constant to zero in the code - and the CPU usage lingering issue is now solved.
Great job with the deep dive into the source code! You are spot on about why metrics are not being updated as expected. On startup, often there is high CPU usage, but then the device drops to a consistent level that doesn’t pass the significant change calculations in the Supervisor code, thus the Supervisor never reports another CPU usage metric. In practice, balenaCloud’s actual METRICS_MAX_REPORT_INTERVAL is greater than 60 seconds afaik, so the situation could be exacerbated even more. The solution for openBalena could indeed be to set the metrics interval to 0, but this will not work in practice for balenaCloud as the backend will quickly become overwhelmed by PATCH requests.
We are working on an improvement related to device metrics which is more aware of backend-load. Part of this improvement could also be to expose metrics reporting to users outside of the database, where users can define their own update intervals and use more elegant monitoring services like Grafana. As the first step towards this improvement, we have an open GitHub issue for improving significant change calculation to use relative buckets instead of absolute buckets. Feel free to follow this issue for updates: balenaCloud CPU Usage metric not updating as expected · Issue #1907 · balena-os/balena-supervisor · GitHub
As for your specific issue, setting the metrics interval in the backend will lessen instances of it, but restarting the Supervisor on devices would work as well.
Hi @cywang117 thanks for the feedback. If I could suggest another possible solution - because I think balena cloud users will care a lot more about absolute cpu usage and not relative changes. If you simply send back a response with the PATCH update request to indicate the hardware metrics were not accepted, supervisor will know that it needs to include them with the next update by not updating it’s internally cached level for them.
I think there was a misunderstanding about my reference to absolute vs relative buckets. The CPU util displayed to the user would be the same as before, but the underlying logic to calculate when metrics should be sent by the Supervisor as a current state PATCH would change. Without going into too much detail into implementation details that may be subject to change, this will hopefully make metrics reporting more true to the condition of the device.
Regarding the API sending back a response of “not accepted” for current state PATCHes, this would be a solution in theory, but the reality is that any changes in the balena API need to be backwards compatible with legacy Supervisors, and legacy Supervisors don’t have the logic in place to handle unexpected code paths. While we do strongly encourage Supervisor / OS upgrades whenever possible, the reality is that there are plenty of devices on outdated Supervisor versions. (Side note, with regards to the Supervisor, I’d consider anything that’s not latest semver major as “outdated”.) We are working on an improvement to make the process of upgrading the Supervisor more automatic to alleviate this in the long run.
We recognize this as a hard problem and are working towards lessening the friction on this front.
I’m curious to hear – for your situation with openBalena, did you set the MAX_METRICS_REPORT_INTERVAL to 0, and how did that work out?
@cywang117 sounds like a lot of factors that you need to take into account. After we bypassed MAX_METRICS_REPORT_INTERVAL the problem was solved for us - all hardware metrics are being correctly reported and we have not seen the issue recur. We also tested it with manual PATCH calls to update hardware metrics at API, an the new values were always reflected instantly. So the governor for us is now just the device reporting frequency, coupled with the bucketing system to limit noise.