Observability is not an afterthought in this homelab β itβs a first-class concern. Every workload is monitored, every log line is indexed, and every alert has a defined routing policy. The stack is designed to answer three questions: What is the system doing right now? What happened in the past? What should I be notified about?
The stack runs entirely in the monitoring namespace and is composed of well-established open-source projects, deployed via Helm.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β grafana.kubefurlan.com β
β (dashboards, explore, alerting) β
ββββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ€
β Metrics β Logs β Probes β
β Prometheus β Loki β Blackbox Exporter β
ββββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββββββββββββββββββββ€
β Grafana Alloy (DaemonSet) ββ K8s pod logs + host journal β
β Node Exporter ββ host metrics (CPU, mem, disk, systemd) β
β kube-state-metrics ββ K8s object metrics β
β Pushgateway ββ batch job metrics (backups, OCI bucket) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β AlertManager βββΊ Telegram bot + Gmail β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Helm Chart Version Type Purpose kube-prometheus-stack prometheus-community/kube-prometheus-stack81.6.4 Deployment Prometheus, Grafana, AlertManager, node-exporter, kube-state-metrics Loki grafana/lokiβ StatefulSet Log storage and querying (monolithic mode) Grafana Alloy grafana/alloyβ DaemonSet Log collection (K8s pods + host systemd journal) Pushgateway prometheus-community/prometheus-pushgatewayβ Deployment Receives metrics from batch jobs Blackbox Exporter prometheus-community/prometheus-blackbox-exporterβ Deployment HTTP endpoint probing
Prometheus is the time-series metrics database at the heart of the stack. It scrapes metrics from all configured targets on a pull model and stores them with a 5-day retention window in a 10 GiB PVC backed by local-path storage.
Host Metrics
via Node Exporter (DaemonSet)
CPU usage per core and aggregate
Memory usage and availability
Disk I/O and filesystem usage
Network traffic
System load average
Systemd service states: AdGuardHome, restic-k8s-backup, restic-host-backup, pg-dump, oci-bucket-stats
Kubernetes Metrics
via kube-state-metrics
Pod status, restarts, resource requests/limits
Deployment and StatefulSet health
PersistentVolume usage
Node conditions
API server request rates
Batch Job Metrics
via Pushgateway
Backup scripts and cron jobs push metrics after completion:
restic_k8s_data: success, duration, files_new, files_changed
restic_host: success, duration, timestamp
pg_dump: success, duration, dump_size_bytes
oci_bucket (per prefix): size_bytes, object_count, limit_bytes
HTTP Probes
via Blackbox Exporter
Synthetic monitoring of external endpoints:
Jellyfin : http://192.168.1.133:8096/health β HTTP 200
Cloudflare Tunnel : http://192.168.1.133:20241/ready β HTTP 200
The Prometheus Operator manages scrape configurations through ServiceMonitor resources. The following monitors are active in the monitoring namespace:
ServiceMonitor Purpose kube-prometheus-stack-alertmanagerAlertManager metrics kube-prometheus-stack-apiserverKubernetes API server kube-prometheus-stack-corednsCoreDNS query metrics kube-prometheus-stack-grafanaGrafana health kube-prometheus-stack-kube-state-metricsKubernetes object state kube-prometheus-stack-kubeletKubelet and cAdvisor metrics kube-prometheus-stack-operatorPrometheus Operator itself kube-prometheus-stack-prometheusPrometheus self-monitoring kube-prometheus-stack-prometheus-node-exporterHost metrics prometheus-pushgatewayBatch job metrics traefikTraefik request metrics (used by Argo Rollouts canary analysis)
Note
Four kube-prometheus-stack monitors were intentionally disabled: kubeControllerManager, kubeScheduler, kubeEtcd, and kubeProxy. On a single-node kubeadm cluster, these components bind to 127.0.0.1, making them unreachable from the Prometheus pod.
Log aggregation is handled by the Loki + Alloy combination. Loki stores logs in a compact, indexed format queryable with LogQL. Alloy is the modern replacement for Promtail β a fully programmable pipeline agent.
K8s pods (all namespaces) Host systemd journal
ββββββββββββββββ¬βββββββββββββββββββ
Grafana Alloy (DaemonSet)
- adds labels (namespace, pod, app)
5-day retention / 5 GiB PVC
Grafana Explore (LogQL queries)
Alloy runs as a DaemonSet ensuring itβs present on every node. The loki-canary DaemonSet is a companion service that continuously writes test log entries to Loki and verifies they can be read back β providing a live health check of the log ingestion pipeline.
# All logs from a namespace
# Host systemd journal logs
{job="systemd-journal"} |= "AdGuardHome"
# Error-level journal entries
{job="systemd-journal"} | detected_level="error"
# Logs from a specific pod
{pod="n8n-c955cdc45-cxr6b"}
Grafana serves as the single pane of glass for all metrics and logs, accessible at https://grafana.kubefurlan.com via Cloudflare tunnel.
Name Type Endpoint Prometheus prometheusAuto-configured by kube-prometheus-stack Loki lokihttp://loki.monitoring.svc.cluster.local:3100Alertmanager alertmanagerAuto-configured
A fully custom dashboard tracks the backup and storage posture of the cluster. It ingests metrics pushed to Pushgateway by all four backup layers β Velero + Kopia, restic PVC data, pg_dumpall, and host configuration β and surfaces backup status, duration trends, PostgreSQL dump size over time, and OCI Object Storage usage against the 20 GB free tier limit.
Full dashboard documentation and panel breakdown β
Dashboard Purpose Node Exporter / Nodes Host CPU, memory, disk, network Kubernetes / Compute Resources / Cluster Cluster-wide resource usage Kubernetes / Compute Resources / Namespace (Pods) Per-namespace breakdown Kubernetes / Persistent Volumes PV usage tracking CoreDNS DNS query rates and errors Grafana Overview Grafana self-health
AlertManager receives fired alerts from Prometheus and routes them to the appropriate notification channels.
A custom PrometheusRule resource defines homelab-specific alerts:
Alert Condition Severity For BackupFailedbackup_success == 0critical 5m BackupStaleTime since last run > 48h warning 5m HighMemoryUsageMemory > 90% warning 5m HighCPUUsageCPU > 90% warning 10m HighDiskUsageRoot filesystem > 90% critical 5m HostRestartedUptime < 5 minutes warning 0m JellyfinDownHTTP probe fails on :8096 critical 3m CloudflareTunnelDownHTTP probe fails on :20241 critical 3m OCIBucketNearLimitBucket usage > 85% of 20 GB warning 5m
Route Receiver Destination severity = criticalcritical Telegram bot + Gmail All other alerts default Telegram bot + Gmail Watchdog null Silenced InfoInhibitor null Silenced
Critical alerts are sent immediately to both a Telegram bot (musashi_homelab_bot) and a Gmail address. The dual-channel approach ensures notifications are received even if one channel is unavailable.
The observability stack doesnβt just monitor β it actively gates deployments . During every canary rollout of Backstage, Argo Rollouts queries Prometheus for real-time application health metrics derived from Traefikβs ServiceMonitor:
Argo Rollouts creates AnalysisRun
ββββββββββββββββββββββββββββββββββββββββββββββββ
β traefik_service_requests_total{code="5xx"} β
β Γ· traefik_service_requests_total β
β β must be β€ 0.25 β
β histogram_quantile(0.95, β
β traefik_service_request_duration_bucket) β
β β must be β€ 5s β
β kube_pod_container_status_restarts_total β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Pass: continue canary β 100% promotion
Fail: automatic rollback to stable version
This turns Prometheus from a passive observer into an active safety gate , automatically protecting production from bad deployments.
Component Retention Storage Prometheus 5 days 10 GiB PVC (local-path) Loki 5 days (120h) 5 GiB PVC (local-path) AlertManager N/A (config only) 1 GiB PVC (local-path) Grafana N/A (dashboards + config) 5 GiB PVC (local-path)
All PVCs are stored under /opt/local-path-provisioner/ on the host SSD and are included in the daily restic backup to OCI Object Storage.
Component Memory Prometheus ~400β600 MB Grafana ~150β200 MB Loki ~256β400 MB Alloy ~128β200 MB AlertManager ~50 MB kube-state-metrics ~50 MB node-exporter ~30 MB Pushgateway ~20 MB Blackbox Exporter ~20 MB Total ~1.1β1.6 GB