ClickStack + HyperDX Observability with Kubernetes Operators

by Afanasy Barbarov

ClickStack + HyperDX Observability with Kubernetes Operators

The journey from basic ClickStack to production-grade observability with Kubernetes operators.

The starting point

I began with the simplest possible ClickStack deployment - a single Helm install that bundles everything: ClickHouse, OpenTelemetry Collector, MongoDB, and the HyperDX UI. Four pods, zero configuration headaches.

helm install clickstack clickstack/clickstack \
  --namespace observability \
  -f k8s/clickstack-values-echo.yaml

It worked. I could see the HyperDX dashboard, poke around the UI. But this wasn't production-ready: single ClickHouse pod (no replication), single OTel Collector (can't collect node-specific metrics from multiple nodes), no high availability.

Why operators?

ClickStack's docs recommend using Kubernetes operators for production. The idea: instead of Helm managing individual pods, specialized operators manage complex stateful applications via Custom Resource Definitions (CRDs).

Two operators are needed:

OpenTelemetry Operator - Creates and manages OTel Collectors from OpenTelemetryCollector CRDs. Supports DaemonSets (one collector per node) and Deployments (cluster-wide collectors).

Altinity ClickHouse Operator - Creates and manages ClickHouse clusters from ClickHouseInstallation CRDs. Handles replication, sharding, user management.

Installing the operators

First, the OpenTelemetry Operator. I don't use cert-manager, so auto-generated certs:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace observability \
  --set admissionWebhooks.certManager.enabled=false \
  --set admissionWebhooks.autoGenerateCert.enabled=true \
  --set manager.resources.requests.cpu=100m \
  --set manager.resources.requests.memory=128Mi \
  --set manager.resources.limits.cpu=200m \
  --set manager.resources.limits.memory=256Mi \
  --wait

Then the ClickHouse Operator. I named the release cho to avoid ridiculously long pod names (the default altinity-clickhouse-operator creates pods like clickhouse-operator-altinity-clickhouse-operator-xyz):

helm repo add altinity https://helm.altinity.com

helm install cho altinity/altinity-clickhouse-operator \
  --namespace observability \
  --wait

PodSecurity warnings appear but pods run fine - the namespace is labeled privileged.

Creating the ClickHouse cluster

With the operator running, create a 2-replica ClickHouse cluster via CRD:

kubectl apply -f k8s/clickhouse-cluster.yaml

The operator creates the ClickHouse pods. Check status:

kubectl get chi -n observability

Service endpoint: clickhouse-echo.observability.svc.cluster.local

The OTel Collector challenge

This is where things got interesting. I needed collectors to gather:

  • Container logs - from /var/log/pods on each node
  • Host metrics - CPU, memory, disk, network per node
  • Kubelet stats - pod/container resource usage from kubelet API
  • Kubernetes events - cluster-wide events (pod created, failed, etc.)
  • Pod status - which pods are Running, Pending, Failed

The catch: some data is node-specific (logs, host metrics, kubelet stats), some is cluster-wide (events, pod status).

DaemonSet vs Deployment

I learned the hard way: you can't collect node-specific data from a single Deployment. The kubeletstats receiver talks to the local kubelet on each node. A Deployment runs on one node and can only see that node's kubelet.

Solution: two collectors.

DaemonSet (otel-collector.yaml) - runs on every node:

  • filelog receiver (container logs)
  • hostmetrics receiver (CPU, memory, disk, network)
  • kubeletstats receiver (pod/container metrics from local kubelet)
  • otlp receiver (for apps to send traces/metrics)

Deployment (otel-collector-cluster.yaml) - runs once, cluster-wide:

  • k8s_cluster receiver (pod status, deployment info)
  • k8sobjects receiver (kubernetes events)
kubectl apply -f k8s/otel-collector-rbac.yaml
kubectl apply -f k8s/otel-collector.yaml
kubectl apply -f k8s/otel-collector-cluster.yaml

Fixing kubeletstats

The kubeletstats receiver was a pain. First it couldn't resolve the node hostname - Talos Linux nodes have hostnames that don't resolve in DNS. Fixed by using K8S_NODE_IP environment variable:

env:
  - name: K8S_NODE_IP
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP

Then: 403 Forbidden. The service account didn't have permission to access kubelet's /stats endpoint. Fixed by creating RBAC with nodes/stats and nodes/proxy permissions.

Then: deprecation warnings about CPU utilization metrics. Fixed with feature gate:

args:
  feature-gates: "+receiver.kubeletstats.enableCPUUsageMetrics"

Fixing filelog

Container logs weren't appearing. Turned out start_at: end only collects new logs written after the collector starts. Changed to start_at: beginning to catch existing logs. Also needed to mount /var/log/pods from the host.

Fixing pod status

Pod status showed "Unknown" in HyperDX. The kubeletstats receiver doesn't include pod phase - it only has resource metrics. The k8s_cluster receiver was missing. Added it in a separate Deployment (it needs cluster-wide view, not per-node).

Also hit RBAC issues: k8s_cluster receiver needs access to replicationcontrollers, services, resourcequotas, and more. Updated the ClusterRole.

Reconfiguring ClickStack

Finally, I pointed ClickStack at the operator-managed ClickHouse:

helm upgrade clickstack clickstack/clickstack \
  --namespace observability \
  -f k8s/clickstack-values-operators.yaml \
  --wait

The values file disables built-in ClickHouse and OTel, and overrides the connection string:

clickhouse:
  enabled: false
otel:
  enabled: false
hyperdx:
  defaultConnections: |
    [{"name": "Local ClickHouse", "host": "http://clickhouse-echo:8123", ...}]

I had to delete the MongoDB PVC once because it cached the old connection config. Fresh start fixed it.

The final architecture

After all the fixes:

  • 3x otel-collector (DaemonSet) - one per node, collecting logs, host metrics, kubelet stats
  • 1x otel-cluster-collector (Deployment) - cluster-wide events and pod status
  • 2x ClickHouse replicas - data replication
  • 1x HyperDX - the UI
  • 1x MongoDB - HyperDX metadata

Access the UI

kubectl port-forward svc/clickstack-app -n observability 3000:3000

Open http://localhost:3000

Verification

Check logs are flowing:

kubectl exec -n observability <clickhouse-pod> -- \
  clickhouse-client --user otel --password <your-password> \
  -q "SELECT count() FROM otel_logs"

Check all pods healthy:

kubectl get pods -n observability

Files

FilePurpose
k8s/clickhouse-cluster.yamlClickHouse cluster CRD (2 replicas)
k8s/otel-collector.yamlOTel Collector DaemonSet (logs, host metrics, kubelet stats)
k8s/otel-collector-cluster.yamlOTel Collector Deployment (k8s events, pod status)
k8s/otel-collector-rbac.yamlRBAC for all collectors
k8s/clickstack-values-operators.yamlClickStack Helm values (operators mode)

Clean reinstall

If you need to wipe and reinstall (order matters - delete CRs before operators):

# Delete CRs first (operators handle finalizers)
kubectl delete opentelemetrycollector otel otel-cluster -n observability
kubectl delete chi echo -n observability

# Delete cluster-wide RBAC
kubectl delete clusterrolebinding otel-collector
kubectl delete clusterrole otel-collector

# Uninstall Helm releases
helm uninstall clickstack -n observability
helm uninstall cho -n observability
helm uninstall opentelemetry-operator -n observability

# Delete data
kubectl delete pvc --all -n observability

Then reinstall in order: operators first, RBAC, CRs, ClickStack last.

Written by Afanasy Barbarov — Tech Lead with 15+ years shipping production systems in Rust, Go, and TypeScript. Facing a similar challenge? Reach out on LinkedIn. Support my work.

More articles

Previous post

Cilium Network Policies.

Read more

Next post

Kubernetes Data Layer: Postgres, NATS, and Namespace Strategy.

Read more