ClickStack + HyperDX Observability with Kubernetes Operators
by Afanasy Barbarov
ClickStack + HyperDX Observability with Kubernetes Operators
The journey from basic ClickStack to production-grade observability with Kubernetes operators.
The starting point
I began with the simplest possible ClickStack deployment - a single Helm install that bundles everything: ClickHouse, OpenTelemetry Collector, MongoDB, and the HyperDX UI. Four pods, zero configuration headaches.
helm install clickstack clickstack/clickstack \
--namespace observability \
-f k8s/clickstack-values-echo.yamlIt worked. I could see the HyperDX dashboard, poke around the UI. But this wasn't production-ready: single ClickHouse pod (no replication), single OTel Collector (can't collect node-specific metrics from multiple nodes), no high availability.
Why operators?
ClickStack's docs recommend using Kubernetes operators for production. The idea: instead of Helm managing individual pods, specialized operators manage complex stateful applications via Custom Resource Definitions (CRDs).
Two operators are needed:
OpenTelemetry Operator - Creates and manages OTel Collectors from OpenTelemetryCollector CRDs. Supports DaemonSets (one collector per node) and Deployments (cluster-wide collectors).
Altinity ClickHouse Operator - Creates and manages ClickHouse clusters from ClickHouseInstallation CRDs. Handles replication, sharding, user management.
Installing the operators
First, the OpenTelemetry Operator. I don't use cert-manager, so auto-generated certs:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
--namespace observability \
--set admissionWebhooks.certManager.enabled=false \
--set admissionWebhooks.autoGenerateCert.enabled=true \
--set manager.resources.requests.cpu=100m \
--set manager.resources.requests.memory=128Mi \
--set manager.resources.limits.cpu=200m \
--set manager.resources.limits.memory=256Mi \
--waitThen the ClickHouse Operator. I named the release cho to avoid ridiculously long pod names (the default altinity-clickhouse-operator creates pods like clickhouse-operator-altinity-clickhouse-operator-xyz):
helm repo add altinity https://helm.altinity.com
helm install cho altinity/altinity-clickhouse-operator \
--namespace observability \
--waitPodSecurity warnings appear but pods run fine - the namespace is labeled privileged.
Creating the ClickHouse cluster
With the operator running, create a 2-replica ClickHouse cluster via CRD:
kubectl apply -f k8s/clickhouse-cluster.yamlThe operator creates the ClickHouse pods. Check status:
kubectl get chi -n observabilityService endpoint: clickhouse-echo.observability.svc.cluster.local
The OTel Collector challenge
This is where things got interesting. I needed collectors to gather:
- Container logs - from
/var/log/podson each node - Host metrics - CPU, memory, disk, network per node
- Kubelet stats - pod/container resource usage from kubelet API
- Kubernetes events - cluster-wide events (pod created, failed, etc.)
- Pod status - which pods are Running, Pending, Failed
The catch: some data is node-specific (logs, host metrics, kubelet stats), some is cluster-wide (events, pod status).
DaemonSet vs Deployment
I learned the hard way: you can't collect node-specific data from a single Deployment. The kubeletstats receiver talks to the local kubelet on each node. A Deployment runs on one node and can only see that node's kubelet.
Solution: two collectors.
DaemonSet (otel-collector.yaml) - runs on every node:
- filelog receiver (container logs)
- hostmetrics receiver (CPU, memory, disk, network)
- kubeletstats receiver (pod/container metrics from local kubelet)
- otlp receiver (for apps to send traces/metrics)
Deployment (otel-collector-cluster.yaml) - runs once, cluster-wide:
- k8s_cluster receiver (pod status, deployment info)
- k8sobjects receiver (kubernetes events)
kubectl apply -f k8s/otel-collector-rbac.yaml
kubectl apply -f k8s/otel-collector.yaml
kubectl apply -f k8s/otel-collector-cluster.yamlFixing kubeletstats
The kubeletstats receiver was a pain. First it couldn't resolve the node hostname - Talos Linux nodes have hostnames that don't resolve in DNS. Fixed by using K8S_NODE_IP environment variable:
env:
- name: K8S_NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIPThen: 403 Forbidden. The service account didn't have permission to access kubelet's /stats endpoint. Fixed by creating RBAC with nodes/stats and nodes/proxy permissions.
Then: deprecation warnings about CPU utilization metrics. Fixed with feature gate:
args:
feature-gates: "+receiver.kubeletstats.enableCPUUsageMetrics"Fixing filelog
Container logs weren't appearing. Turned out start_at: end only collects new logs written after the collector starts. Changed to start_at: beginning to catch existing logs. Also needed to mount /var/log/pods from the host.
Fixing pod status
Pod status showed "Unknown" in HyperDX. The kubeletstats receiver doesn't include pod phase - it only has resource metrics. The k8s_cluster receiver was missing. Added it in a separate Deployment (it needs cluster-wide view, not per-node).
Also hit RBAC issues: k8s_cluster receiver needs access to replicationcontrollers, services, resourcequotas, and more. Updated the ClusterRole.
Reconfiguring ClickStack
Finally, I pointed ClickStack at the operator-managed ClickHouse:
helm upgrade clickstack clickstack/clickstack \
--namespace observability \
-f k8s/clickstack-values-operators.yaml \
--waitThe values file disables built-in ClickHouse and OTel, and overrides the connection string:
clickhouse:
enabled: false
otel:
enabled: false
hyperdx:
defaultConnections: |
[{"name": "Local ClickHouse", "host": "http://clickhouse-echo:8123", ...}]I had to delete the MongoDB PVC once because it cached the old connection config. Fresh start fixed it.
The final architecture
After all the fixes:
- 3x otel-collector (DaemonSet) - one per node, collecting logs, host metrics, kubelet stats
- 1x otel-cluster-collector (Deployment) - cluster-wide events and pod status
- 2x ClickHouse replicas - data replication
- 1x HyperDX - the UI
- 1x MongoDB - HyperDX metadata
Access the UI
kubectl port-forward svc/clickstack-app -n observability 3000:3000Open http://localhost:3000
Verification
Check logs are flowing:
kubectl exec -n observability <clickhouse-pod> -- \
clickhouse-client --user otel --password <your-password> \
-q "SELECT count() FROM otel_logs"Check all pods healthy:
kubectl get pods -n observabilityFiles
| File | Purpose |
|---|---|
k8s/clickhouse-cluster.yaml | ClickHouse cluster CRD (2 replicas) |
k8s/otel-collector.yaml | OTel Collector DaemonSet (logs, host metrics, kubelet stats) |
k8s/otel-collector-cluster.yaml | OTel Collector Deployment (k8s events, pod status) |
k8s/otel-collector-rbac.yaml | RBAC for all collectors |
k8s/clickstack-values-operators.yaml | ClickStack Helm values (operators mode) |
Clean reinstall
If you need to wipe and reinstall (order matters - delete CRs before operators):
# Delete CRs first (operators handle finalizers)
kubectl delete opentelemetrycollector otel otel-cluster -n observability
kubectl delete chi echo -n observability
# Delete cluster-wide RBAC
kubectl delete clusterrolebinding otel-collector
kubectl delete clusterrole otel-collector
# Uninstall Helm releases
helm uninstall clickstack -n observability
helm uninstall cho -n observability
helm uninstall opentelemetry-operator -n observability
# Delete data
kubectl delete pvc --all -n observabilityThen reinstall in order: operators first, RBAC, CRs, ClickStack last.