otel-collector-device-prom image not working when building without 'balena push'

Hi!
I’m using the otel-collector-device-prom image in my docker compose.
I followed the readme in this repo: GitHub - balena-io-experimental/otel-collector-device-prom: OpenTelemetry Collector balenaBlock for device monitoring to a Prometheus backend.
I used the Dockerfile.template from this example: otel-collector-device-prom/docs/example at master · balena-io-experimental/otel-collector-device-prom · GitHub and added a custom entrypoint script.
I used to run the docker compose with the Dockerfile as is (i.e. with this Dockerfile line: FROM bh.cr/ken8/otel-collector-device-prom-%%BALENA_ARCH%%) and my entrypoint.sh script.
This works perfectly when using ‘balena push’. However, I don’t want to keep copying these Dockerfile and entrypoint.sh files into every project I’m using them for, and building the image over and over again in the docker compose. So I want to make an image of this, so I can just pull the image. The problem is that the otel-collector service keeps crashing and starting back up, even though I think I’m using the correct images. This is my Github action to build the image:

name: Create Otel Collector Docker Image

on:
workflow_dispatch: # Allows you to run this workflow manually from the Actions tab

jobs:
build_docker_images:
runs-on: ubuntu-latest

steps:
  - name: Checkout repository
    uses: actions/checkout@v4

  - name: Login to Harbor Registry
    uses: docker/login-action@v2
    with:
      registry: redacted
      username: redacted
      password: redacted

  - name: Set up QEMU
    uses: docker/setup-qemu-action@v3

  - name: Set up Docker Buildx
    uses: docker/setup-buildx-action@v3

  - name: Build and push Docker image
    uses: docker/build-push-action@v6
    with:
      context: ./docker/images/otel-collector
      file: ./docker/images/otel-collector/Dockerfile.template
      push: true
      tags: redacted
      cache-from: type=gha
      cache-to: type=gha,mode=max
      platforms: linux/arm64

I’m building for aarch64 Raspberry Pi 4.
In the Dockerfile I’m using for this image, I put FROM bh.cr/ken8/otel-collector-device-prom-aarch64 instead of the line with %%BALENA_ARCH%%.

I appreciate any help!
Thank you

Hi, thanks for the feedback and welcome to the forums! I should have some time to review the week of Oct. 24.

2 Likes

A little update: the problem is not with our Harbor registry where we save our images. When i change %%BALENA_ARCH%% with aarch64, the image also keeps restarting when doing balena push and building the image locally.
The problem is also not with my custom script, as this works when using %%BALENA_ARCH%%.

I’m also certain I’m pulling the correct image from Balena, when using %%BALENA_ARCH%%, these are my logs:

[otel-collector] Step 1/4 : FROM bh.cr/ken8/otel-collector-device-prom-aarch64
[otel-collector] —> d959ba305b63
[otel-collector] Step 2/4 : COPY entrypoint.sh /usr/local/bin/entrypoint.sh
[otel-collector] Using cache
[otel-collector] —> 7d8b7e98292f
[otel-collector] Step 3/4 : RUN chmod +x /usr/local/bin/entrypoint.sh
[otel-collector] Using cache
[otel-collector] —> 8fdbd17cf520
[otel-collector] Step 4/4 : ENTRYPOINT [“/usr/local/bin/entrypoint.sh”]
[otel-collector] Using cache
[otel-collector] —> f49f8e0f3c45
[otel-collector] Successfully built f49f8e0f3c45

Edit:
Another update: if I first use %%BALENA_ARCH%%, then switch to aarch64 image and then back to %%BALENA_ARCH%%, the container also keeps restarting and I have to flash my raspberry pi again to get it to work again.

1 Like

Okay I found the issue. Turns out the problem lay in the entrypoint script I was so certain of worked …
After the script is done running, the container has no more running processes and shuts down. The docker compose has the service on ‘restart: always’ and that’s why it restarts every few seconds. Sorry for wasting your time!

1 Like

No worries, good to hear you were able to figure it out. Let us know if you have any suggestions for improvements to the collector block.

1 Like

Hey @ferrelaridon , I just updated the otel-collector-device-prom repo with the latest from upstream, and device logs work pretty well now. Let us know how it works when you get the chance.

1 Like

Hey @kb2ma ! I saw the update and pulled the new image. For my use case, I would rather not use Loki. If i leave the URL and User empty, my device metrics are not getting collected. Is there a way where I can keep using the image on my Balena devices, without making an account on Loki? Thanks in advance!

Thanks for the feedback. I did not test the scenario of not using the logging part. I suspect a good approach is to build the config.yaml dynamically from a script at container startup, based on the definition of those LOKI… variables. Hopefully it won’t be too much work, and it adds flexibility for future cases as well. I’ll get back to you.

2 Likes

I tested some things by removing the loki variables from the config.yaml on container startup and also by just filling in localhost in the url env variable. Both times I got this error:
Error: invalid configuration: exporters::loki: “endpoint” must be a valid URL

Then I tried running a Loki container alongside my other containers and using loki:3100 as the url, but still the same error.

FWIW, if I do not define the LOKI_… variables, metrics still are sent. To be clear I don’t define those variables at all, I don’t define them and leave them blank/empty. I do see the “Exporting failed” error about sending Loki messages.

At any rate, I am working on an update to avoid the attempt to send journal messages at all if the variables are not defined.

I have pushed updates to the otel-collector-device-prom repository and blocks to disable use of journald/loki reporting if the LOKI_* variables are not defined. Let me know if that works for you.

1 Like

Thank you! After testing around some more, I ran into these errors:

 otel-collector  /app/node-metrics.yaml
 otel-collector  2024-11-12T10:36:53.311Z       info    service@v0.112.0/service.go:135 Setting up own telemetry...
 otel-collector  2024-11-12T10:36:53.313Z       info    telemetry/metrics.go:70 Serving metrics {"address": "localhost:8888", "metrics level": "Normal"}
 otel-collector  2024-11-12T10:36:53.316Z       warn    prometheusremotewriteexporter@v0.112.0/factory.go:46    Currently, remote_write_queue.num_consumers doesn't have any effect due to incompatibility with Prometheus remote write API. The value will be ignored. Please see https://github.com/open-telemetry/opentelemetry-collector/issues/2949 for more information.    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite"}
 otel-collector  Error: failed to build pipelines: failed to create "prometheusremotewrite" exporter for data type "metrics": invalid endpoint
 otel-collector  2024/11/12 10:36:53 collector server run finished with error: failed to build pipelines: failed to create "prometheusremotewrite" exporter for data type "metrics": invalid endpoint

To me, it doesn’t look like the config.yaml has changed (except for the name). Am I missing a configuration somewhere?

 otel-collector  Error: failed to build pipelines: failed to create "prometheusremotewrite" exporter for data type "metrics": invalid endpoint

The error above indicates to me that your PROMETHEUS_URL is not valid. If you are using Grafana Cloud, does it have the same format as this screenshot from the README? In particular, it does not include the protocol prefix (https://).

Otherwise, please review or provide the full output of logging from the collector startup. This script shows the expected sequence of actions. You should see the echo statements in that file. Does the full config.yaml print out in the logs? Does it have the content you expect?

We are using a custom integration for prometheus. For the environment variables PROMETHEUS_USER and PROMETHEUS_PASSWORD, we fill in “” (an empty string). This worked fine, but now the prometheus url is no longer valid. Did you by chance add some extra validation or something to catch empty strings? The prefix https:// is left out. The environment variable PROMETHEUS_URL is filled in. I think the error happens when combining the 3 variables into 1 string?

These are the logs I get when starting up the collector:

 otel-collector  /app/node-metrics.yaml
 otel-collector  2024-11-15T13:10:43.875Z       info    service@v0.112.0/service.go:135  Setting up own telemetry...
 otel-collector  2024-11-15T13:10:43.876Z       info    telemetry/metrics.go:70  Serving metrics {"address": "localhost:8888", "metrics level": "Normal"}
 otel-collector  2024-11-15T13:10:43.877Z       warn    prometheusremotewriteexporter@v0.112.0/factory.go:46     Currently, remote_write_queue.num_consumers doesn't have any effect due to incompatibility with Prometheus remote write API. The value will be ignored. Please see https://github.com/open-telemetry/opentelemetry-collector/issues/2949 for more information.      {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite"}
 otel-collector  Error: failed to build pipelines: failed to create "prometheusremotewrite" exporter for data type "metrics": invalid endpoint
 otel-collector  2024/11/15 13:10:43 collector server run finished with error: failed to build pipelines: failed to create "prometheusremotewrite" exporter for data type "metrics": invalid endpoint

Thank you!

Update:
I think I got it working! I used some dummy values for the username and password and everything seems to be working now.

1 Like

@ferrelaridon Thanks for sharing that you got it working!