Hi!
I’m using the otel-collector-device-prom image in my docker compose.
I followed the readme in this repo: GitHub - balena-io-experimental/otel-collector-device-prom: OpenTelemetry Collector balenaBlock for device monitoring to a Prometheus backend.
I used the Dockerfile.template from this example: otel-collector-device-prom/docs/example at master · balena-io-experimental/otel-collector-device-prom · GitHub and added a custom entrypoint script.
I used to run the docker compose with the Dockerfile as is (i.e. with this Dockerfile line: FROM bh.cr/ken8/otel-collector-device-prom-%%BALENA_ARCH%%) and my entrypoint.sh script.
This works perfectly when using ‘balena push’. However, I don’t want to keep copying these Dockerfile and entrypoint.sh files into every project I’m using them for, and building the image over and over again in the docker compose. So I want to make an image of this, so I can just pull the image. The problem is that the otel-collector service keeps crashing and starting back up, even though I think I’m using the correct images. This is my Github action to build the image:
name: Create Otel Collector Docker Image
on:
workflow_dispatch: # Allows you to run this workflow manually from the Actions tab
I’m building for aarch64 Raspberry Pi 4.
In the Dockerfile I’m using for this image, I put FROM bh.cr/ken8/otel-collector-device-prom-aarch64 instead of the line with %%BALENA_ARCH%%.
A little update: the problem is not with our Harbor registry where we save our images. When i change %%BALENA_ARCH%% with aarch64, the image also keeps restarting when doing balena push and building the image locally.
The problem is also not with my custom script, as this works when using %%BALENA_ARCH%%.
I’m also certain I’m pulling the correct image from Balena, when using %%BALENA_ARCH%%, these are my logs:
[otel-collector] Step 1/4 : FROM bh.cr/ken8/otel-collector-device-prom-aarch64
[otel-collector] —> d959ba305b63
[otel-collector] Step 2/4 : COPY entrypoint.sh /usr/local/bin/entrypoint.sh
[otel-collector] Using cache
[otel-collector] —> 7d8b7e98292f
[otel-collector] Step 3/4 : RUN chmod +x /usr/local/bin/entrypoint.sh
[otel-collector] Using cache
[otel-collector] —> 8fdbd17cf520
[otel-collector] Step 4/4 : ENTRYPOINT [“/usr/local/bin/entrypoint.sh”]
[otel-collector] Using cache
[otel-collector] —> f49f8e0f3c45
[otel-collector] Successfully built f49f8e0f3c45
Edit:
Another update: if I first use %%BALENA_ARCH%%, then switch to aarch64 image and then back to %%BALENA_ARCH%%, the container also keeps restarting and I have to flash my raspberry pi again to get it to work again.
Okay I found the issue. Turns out the problem lay in the entrypoint script I was so certain of worked …
After the script is done running, the container has no more running processes and shuts down. The docker compose has the service on ‘restart: always’ and that’s why it restarts every few seconds. Sorry for wasting your time!
Hey @ferrelaridon , I just updated the otel-collector-device-prom repo with the latest from upstream, and device logs work pretty well now. Let us know how it works when you get the chance.
Hey @kb2ma ! I saw the update and pulled the new image. For my use case, I would rather not use Loki. If i leave the URL and User empty, my device metrics are not getting collected. Is there a way where I can keep using the image on my Balena devices, without making an account on Loki? Thanks in advance!
Thanks for the feedback. I did not test the scenario of not using the logging part. I suspect a good approach is to build the config.yaml dynamically from a script at container startup, based on the definition of those LOKI… variables. Hopefully it won’t be too much work, and it adds flexibility for future cases as well. I’ll get back to you.
I tested some things by removing the loki variables from the config.yaml on container startup and also by just filling in localhost in the url env variable. Both times I got this error:
Error: invalid configuration: exporters::loki: “endpoint” must be a valid URL
Then I tried running a Loki container alongside my other containers and using loki:3100 as the url, but still the same error.
FWIW, if I do not define the LOKI_… variables, metrics still are sent. To be clear I don’t define those variables at all, I don’t define them and leave them blank/empty. I do see the “Exporting failed” error about sending Loki messages.
At any rate, I am working on an update to avoid the attempt to send journal messages at all if the variables are not defined.
I have pushed updates to the otel-collector-device-prom repository and blocks to disable use of journald/loki reporting if the LOKI_* variables are not defined. Let me know if that works for you.
Thank you! After testing around some more, I ran into these errors:
otel-collector /app/node-metrics.yaml
otel-collector 2024-11-12T10:36:53.311Z info service@v0.112.0/service.go:135 Setting up own telemetry...
otel-collector 2024-11-12T10:36:53.313Z info telemetry/metrics.go:70 Serving metrics {"address": "localhost:8888", "metrics level": "Normal"}
otel-collector 2024-11-12T10:36:53.316Z warn prometheusremotewriteexporter@v0.112.0/factory.go:46 Currently, remote_write_queue.num_consumers doesn't have any effect due to incompatibility with Prometheus remote write API. The value will be ignored. Please see https://github.com/open-telemetry/opentelemetry-collector/issues/2949 for more information. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite"}
otel-collector Error: failed to build pipelines: failed to create "prometheusremotewrite" exporter for data type "metrics": invalid endpoint
otel-collector 2024/11/12 10:36:53 collector server run finished with error: failed to build pipelines: failed to create "prometheusremotewrite" exporter for data type "metrics": invalid endpoint
To me, it doesn’t look like the config.yaml has changed (except for the name). Am I missing a configuration somewhere?
otel-collector Error: failed to build pipelines: failed to create "prometheusremotewrite" exporter for data type "metrics": invalid endpoint
The error above indicates to me that your PROMETHEUS_URL is not valid. If you are using Grafana Cloud, does it have the same format as this screenshot from the README? In particular, it does not include the protocol prefix (https://).
Otherwise, please review or provide the full output of logging from the collector startup. This script shows the expected sequence of actions. You should see the echo statements in that file. Does the full config.yaml print out in the logs? Does it have the content you expect?