Skip to content

Backup Strategy & Monitoring

Running a single-node Kubernetes cluster on bare metal means there is no built-in redundancy β€” no multi-node replication, no managed cloud snapshots. If the disk dies or a namespace is accidentally deleted, recovery depends entirely on backups. This article documents the full backup strategy: what is backed up, how, where, and how every backup run is monitored in real time.

A single backup approach fails in predictable ways. A Velero cluster snapshot doesn’t help if the host OS is corrupted. A raw file copy doesn’t capture the Kubernetes resource definitions needed to reassemble the cluster. The strategy is designed so that each layer covers a gap the others leave:

LayerWhatToolTarget
1Kubernetes resource definitions + PV dataVelero + KopiaOCI Object Storage
2Raw PVC data (/opt/local-path-provisioner/)resticOCI Object Storage
3PostgreSQL logical dumppg_dumpallHost disk β†’ included in Layer 2
4Host configuration (/home/ifurlan, /etc)resticOCI Object Storage

All remote backup data lands in Oracle Cloud Infrastructure (OCI) Object Storage, SΓ£o Paulo region (sa-saopaulo-1). OCI’s Always Free tier includes 20 GB of object storage β€” enough to hold compressed, deduplicated backup repositories across all four layers.

PropertyValue
ProviderOracle Cloud Infrastructure
Regionsa-saopaulo-1 (SΓ£o Paulo)
Storage classStandard
Free tier20 GB
Cost$0

The bucket is organized by prefix, one per backup job:

PrefixJob
velero/Velero-managed Kopia repository
restic-k8s/Raw PVC data (restic)
restic-host/Host configuration (restic)
pg-dump/PostgreSQL dump files

OCI Object Storage bucket showing prefix breakdown and total usage against the 20 GB free tier

OCI access is configured through an API key pair. Credentials are stored as environment variables in the respective systemd service unit files and as a Kubernetes Secret for Velero’s BackupStorageLocation.


Layer 1 β€” Velero + Kopia (Kubernetes Resources + Volumes)

Section titled β€œLayer 1 β€” Velero + Kopia (Kubernetes Resources + Volumes)”

Velero is the industry-standard Kubernetes backup tool. It captures two categories of data:

  • Resource definitions β€” every Kubernetes object in the targeted namespaces (Deployments, Services, ConfigMaps, Secrets, PVCs, CRDs, etc.)
  • PersistentVolume data β€” the actual bytes inside PVCs, using a pluggable backup driver

Velero originally used restic as its volume backup driver. Starting with Velero v1.10, Kopia became the recommended (and now default) driver. The reasons:

FeatureresticKopia
DeduplicationChunk-levelChunk-level
EncryptionAES-256AES-256-GCM
CompressionNoYes (zstd)
Parallel uploadsLimitedYes (concurrent)
Repository lockingPessimistic (slow)Optimistic (fast)
Maintenance jobsManualAutomated

Kopia’s parallel upload capability and optimistic locking make it significantly faster for large PVCs. Automated maintenance (compaction, GC) keeps the repository size bounded without cron jobs.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ velero namespace β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ velero (pod) β”‚ β”‚ node-agent β”‚ β”‚
β”‚ β”‚ - Schedules β”‚ β”‚ (DaemonSet) β”‚ β”‚
β”‚ β”‚ - Orchestrates β”‚ β”‚ - Mounts PVCs β”‚ β”‚
β”‚ β”‚ - BackupStorageLocation β”‚ β”‚ - Kopia β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό β–Ό
K8s API (resource PVC host paths
snapshots β†’ S3) β†’ Kopia repo
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
OCI Object Storage
(velero/ prefix)

The velero pod orchestrates backup scheduling and communicates with the Kubernetes API to snapshot resource definitions. The node-agent DaemonSet runs on the host and mounts PVC directories directly from the local-path storage paths β€” bypassing the need for storage-provider snapshots.

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: default
namespace: velero
spec:
provider: community.openshift.io/object-store
objectStorage:
bucket: homelab-backups
prefix: velero
config:
region: sa-saopaulo-1
s3Url: https://objectstorage.sa-saopaulo-1.oraclecloud.com
s3ForcePathStyle: "true"

Separate schedules are defined for each critical namespace, allowing independent retention policies and granular restore points:

ScheduleNamespaceCadence
backstage-dailybackstageDaily
harbor-dailyharborDaily
n8n-dailyn8nDaily
postgres-dailypostgresDaily
cnpg-system-dailycnpg-systemDaily
velero-dailyveleroDaily (includes sealed-secrets backup)

After each backup, Velero triggers a maintenance job that runs compaction and garbage collection on the Kopia repository. These appear as short-lived pods in the velero namespace:

velero backstage-default-kopia-maintain-job Completed
velero harbor-default-kopia-maintain-job Completed
velero n8n-default-kopia-maintain-job Completed
velero postgres-default-kopia-maintain-job Completed
velero cnpg-system-default-kopia-maintain-job Completed
velero velero-default-kopia-maintain-job Completed

Maintenance keeps the repository lean by removing unreferenced data chunks left by expired or deleted backups.


Velero covers Kubernetes-aware backups, but a second restic job backs up the raw PVC data on the host independently. This provides a safety net that doesn’t depend on Velero or the Kubernetes API:

/opt/local-path-provisioner/ ←── all PVC data, all namespaces
β”‚
β–Ό
restic backup
(systemd timer: daily)
β”‚
β–Ό
OCI Object Storage (restic-k8s/ prefix)

The backup uses a restic repository initialized directly in the OCI bucket. Restic deduplicates and compresses data before uploading, so only changed chunks are sent on each run.

The job is managed by a systemd timer (restic-k8s-backup.timer) that fires the restic-k8s-backup.service unit daily. After completing, the service pushes metrics to Prometheus Pushgateway:

Terminal window
# Metrics pushed after each run
backup_success{job="restic_k8s_data"} 1
backup_duration_seconds{job="restic_k8s_data"} 42.3
backup_files_new{job="restic_k8s_data"} 12
backup_files_changed{job="restic_k8s_data"} 87
backup_timestamp{job="restic_k8s_data"} 1748000000

These metrics power the backup monitoring dashboard in Grafana (covered below).


Application data lives in the CloudNativePG-managed PostgreSQL cluster (postgres namespace). Even though the PVC data is backed up by both Velero (Layer 1) and restic (Layer 2), a logical dump adds an extra recovery option: restoring into a clean database instance without needing to replay a PVC snapshot.

A daily systemd timer runs pg_dumpall and writes the dump to the host filesystem. The dump file is then automatically included in the Layer 2 restic backup.

pg_dumpall (all databases)
β”‚
β–Ό
/var/backups/pg_dump/dump.sql
β”‚
β–Ό
Included in Layer 2 restic backup
β”‚
β–Ό
OCI Object Storage (restic-k8s/ prefix)

After each run, the dump size is pushed to Pushgateway:

Terminal window
pg_dump_size_bytes{job="pg_dump"} 4823042
pg_dump_success{job="pg_dump"} 1
pg_dump_duration_seconds{job="pg_dump"} 8.2

The dump size trend in Grafana serves as an early warning: an unexpected drop indicates a database problem, an unexpected spike indicates unusual data growth.


The host OS and user configuration is backed up to a separate restic repository (restic-host/ prefix in OCI). This covers the files that wouldn’t exist in a Kubernetes PVC:

PathContents
/home/ifurlanShell configs, SSH keys, scripts, dotfiles
/etcSystem configuration, network settings, systemd units

A separate restic repository is used (not the same as Layer 2) so that retention and exclusion policies can be tuned independently. Large binary files and cache directories are excluded via a .resticignore file.

After each run, metrics are pushed to Pushgateway:

Terminal window
backup_success{job="restic_host"} 1
backup_duration_seconds{job="restic_host"} 15.6
backup_timestamp{job="restic_host"} 1748000000

A fifth systemd service (oci-bucket-stats.service) runs after all backup jobs and queries the OCI API to collect bucket-level statistics. These are pushed to Pushgateway:

Terminal window
oci_bucket_size_bytes{prefix="velero"} 1234567890
oci_bucket_size_bytes{prefix="restic-k8s"} 987654321
oci_bucket_size_bytes{prefix="restic-host"} 456789012
oci_bucket_size_bytes{prefix="pg-dump"} 123456789
oci_bucket_object_count{prefix="velero"} 4821
oci_bucket_limit_bytes 21474836480

The oci_bucket_limit_bytes metric represents the 20 GB OCI free tier ceiling. Tracking the total usage against this limit drives the OCIBucketNearLimit alert β€” fired when usage exceeds 85% of 20 GB.


Every backup layer feeds metrics into Prometheus Pushgateway, making the full backup posture visible in Grafana. The custom β€œHomelab Backup & Storage” dashboard provides an at-a-glance view of backup health across all jobs.

Custom Grafana "Homelab Backup & Storage" dashboard β€” all backup layers showing green status

Backup Status

Type: Stat panels (one per job)

Shows OK (green) or FAILED (red) for each backup job. Backed by backup_success from Pushgateway. The most important panels β€” any red is immediately actionable.

Jobs tracked: restic_k8s_data, restic_host, pg_dump

Time Since Last Backup

Type: Stat panels

Displays how long ago the last successful backup ran, derived from backup_timestamp. Color thresholds:

  • Green: < 25 hours
  • Yellow: 25–48 hours
  • Red: > 48 hours

Red here triggers the BackupStale alert in AlertManager.

Backup Duration

Type: Bar chart (time series)

Historical backup durations per job. Useful for spotting regressions β€” a job that suddenly takes 3Γ— longer than usual may indicate a large new file, a slow OCI connection, or repository fragmentation.

Backup Files Changed

Type: Stacked bars

Shows files_new and files_changed per restic run. Helps understand backup churn β€” high change volume correlates with longer backup durations and higher OCI egress.

PostgreSQL Dump Size

Type: Line chart

Tracks pg_dump_size_bytes over time. The trend line shows database growth. A sudden drop is a red flag (backup or dump failure); gradual growth is expected and healthy.

OCI Bucket Usage vs Limit

Type: Gauge

Shows total OCI bucket usage against the 20 GB free tier ceiling. The gauge turns yellow at 85% and red at 95%, matching the OCIBucketNearLimit alert threshold.

OCI Bucket Size by Prefix

Type: Pie chart (donut)

Breaks down bucket usage by prefix (velero/, restic-k8s/, restic-host/, pg-dump/). Makes it easy to see which backup layer is consuming the most storage.

OCI Bucket Objects by Prefix

Type: Bar gauge

Shows the object count per prefix. Kopia/Velero tends to produce many small objects (chunks); a rapidly growing object count can indicate that maintenance jobs are not running.


Backup alerts are defined in the homelab-alerts PrometheusRule and routed by AlertManager to Telegram and Gmail:

AlertConditionSeverity
BackupFailedbackup_success == 0 for any jobcritical
BackupStaletime() - backup_timestamp > 172800 (48h)warning
OCIBucketNearLimitTotal bucket usage > 85% of 20 GBwarning

Critical alerts fire immediately to both the Telegram bot (musashi_homelab_bot) and Gmail. Warning alerts follow the same path.


Terminal window
# List available backups
velero backup get
# Restore a specific namespace from the latest backup
velero restore create --from-backup <backup-name> \
--include-namespaces backstage
# Watch restore progress
velero restore describe <restore-name>

velero backup get output showing recent successful backups across all scheduled namespaces

Terminal window
# Mount the restic repository and browse snapshots
restic -r s3:https://objectstorage.sa-saopaulo-1.oraclecloud.com/homelab-backups/restic-k8s \
snapshots
# Restore a specific snapshot to a target directory
restic -r <repo> restore <snapshot-id> \
--target /opt/local-path-provisioner/
Terminal window
# Restore into a running PostgreSQL instance
psql -U postgres < /var/backups/pg_dump/dump.sql
Terminal window
restic -r s3:https://objectstorage.sa-saopaulo-1.oraclecloud.com/homelab-backups/restic-host \
restore latest --target /

LayerRPO (max data loss)RTO (recovery time)
Velero + Kopia~24 hoursMinutes (velero restore)
restic PVC backup~24 hoursMinutes to hours (depends on PVC size)
pg_dumpall~24 hoursMinutes
Host config~24 hoursMinutes

All jobs run daily. RPO is therefore bounded at approximately 24 hours across all layers.