Skip to content

GitOps with ArgoCD & Argo Rollouts

GitOps is the operational model where Git is the single source of truth for both application and infrastructure state. Every change to a running system is made through a Git commit β€” never by running kubectl apply directly against the cluster. The cluster continuously reconciles its actual state against the desired state declared in Git.

This setup implements GitOps for the Backstage Internal Developer Portal, using a two-repository pattern, ArgoCD for continuous deployment, GitHub Actions for CI, and Argo Rollouts for safe progressive delivery.

The GitOps workflow is split across two GitHub repositories:

RepositoryPurpose
igfurlan/backstageApplication source code, Dockerfile, GitHub Actions workflows
igfurlan/backstage-gitopsKubernetes manifests, Kustomize overlays, Argo Rollout definition

This separation is intentional: it decouples the application lifecycle (builds, tests) from the deployment lifecycle (manifest changes, environment promotions). ArgoCD only watches the GitOps repo β€” it never has access to the application source code.

The backstage-gitops repository uses a Kustomize overlay pattern:

backstage-gitops/
β”œβ”€β”€ base/
β”‚ β”œβ”€β”€ kustomization.yaml # Assembles all base resources
β”‚ β”œβ”€β”€ namespace.yaml # backstage namespace
β”‚ β”œβ”€β”€ rollout.yaml # Argo Rollout (replaces Deployment)
β”‚ β”œβ”€β”€ service.yaml # ClusterIP service
β”‚ β”œβ”€β”€ httproute.yaml # Gateway API HTTPRoute
β”‚ β”œβ”€β”€ analysis-template.yaml # Prometheus-based canary analysis
β”‚ └── sealed-secret.yaml # Encrypted DB credentials
└── overlays/
└── production/
β”œβ”€β”€ kustomization.yaml # Extends base
└── patches/
└── resources.yaml # Image tag + resource limits patch

The base/ layer defines the universal configuration. The production/ overlay applies environment-specific patches β€” most importantly, the container image tag, which is updated automatically by the CI pipeline on every successful push to main.

ArgoCD monitors the overlays/production path in the GitOps repository:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: backstage
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/igfurlan/backstage-gitops.git
targetRevision: main
path: overlays/production
destination:
server: https://kubernetes.default.svc
namespace: backstage
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true

Key policies:

  • automated β€” ArgoCD syncs automatically when the GitOps repo changes; no manual approval needed
  • prune: true β€” resources removed from Git are also removed from the cluster
  • selfHeal: true β€” if someone manually changes a resource in the cluster, ArgoCD reverts it to match Git within seconds

ArgoCD showing the backstage application in Synced/Healthy state with full resource tree

The CI pipeline in backstage/.github/workflows/ci.yaml runs two jobs depending on the event type:

Job: validate

Triggered when a PR is opened against main. Builds the Docker image using BuildKit cache but does not push it. This validates that the Dockerfile is correct and the build succeeds without spending registry storage.

Steps:

  1. Checkout code
  2. Set up Docker Buildx
  3. Log in to GitHub Container Registry (GHCR)
  4. Build image (no push) β€” uses GitHub Actions cache (type=gha)

The pipeline uses GitHub Actions cache (type=gha) for Docker layer caching. Build times drop significantly after the first run because unchanged layers are reused from the cache.

# Image tag update step (from ci.yaml)
- name: Update image tag
run: |
cd gitops
sed -i "s|value: ghcr.io/igfurlan/backstage:sha-.*|\
value: ghcr.io/igfurlan/backstage:sha-${{ steps.sha.outputs.short }}|" \
overlays/production/patches/resources.yaml
git add .
git diff --staged --quiet || \
git commit -m "deploy: backstage sha-${{ steps.sha.outputs.short }}"
git push

GitHub Actions CI pipeline β€” validate job confirming the app repo β†’ CI β†’ gitops repo flow

GitHub Actions CI run after merge to main β€” all steps passing: build, push to GHCR, gitops update

Instead of a standard Kubernetes Deployment, Backstage uses an Argo Rollout resource. This enables canary deployments β€” gradually shifting traffic to the new version while monitoring error rates and latency in real time.

The canary strategy is defined in base/rollout.yaml:

strategy:
canary:
analysis:
templates:
- templateName: backstage-canary-check
startingStep: 1
steps:
- setWeight: 20 # 20% of traffic to new version
- pause: { duration: 60s }
- setWeight: 50 # 50% of traffic
- pause: { duration: 60s }
- setWeight: 80 # 80% of traffic
- pause: { duration: 30s }
# Full 100% promotion happens automatically if analysis passes

The rollout progresses through 3 traffic weight stages, pausing at each to observe metrics. At step 1, an AnalysisRun begins and runs continuously throughout the rollout.

Argo Rollouts dashboard showing canary deployment progress with traffic weight stages

Argo Rollouts UI showing active AnalysisRun with Prometheus metric evaluation during canary

The AnalysisTemplate (backstage-canary-check) queries Prometheus at 30-second intervals throughout the canary rollout. Three metrics are evaluated:

Error Rate

Metric: HTTP 5xx error rate on Traefik service requests Query: sum(rate(traefik_service_requests_total{...code=~"5.."}[2m])) / sum(rate(...)) Threshold: ≀ 25% error rate (result[0] <= 0.25) Failure limit: 3 consecutive failures before rollback

p95 Latency

Metric: 95th percentile request duration via Traefik histogram Query: histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket{...}[2m])) by (le)) Threshold: ≀ 5 seconds Failure limit: 3 consecutive failures before rollback

Pod Restarts

Metric: Container restart count increase in the backstage namespace Query: sum(increase(kube_pod_container_status_restarts_total{namespace="backstage",...}[2m])) Threshold: < 2 restarts Failure limit: 2 consecutive failures before rollback

If any metric exceeds its threshold too many times, the AnalysisRun fails and Argo Rollouts automatically rolls back to the previous stable version β€” without any human intervention.

# From analysis-template.yaml
metrics:
- name: error-rate
interval: 30s
successCondition: len(result) == 0 || isNaN(result[0]) || result[0] <= 0.25
failureLimit: 3
provider:
prometheus:
address: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
query: |
sum(rate(traefik_service_requests_total{exported_service=~"backstage-backstage-.*",code=~"5.."}[2m]))
/
sum(rate(traefik_service_requests_total{exported_service=~"backstage-backstage-.*"}[2m]))

This integration closes the loop between the observability stack (Prometheus + Traefik metrics) and the deployment pipeline, creating a true automated feedback loop.

Storing Kubernetes secrets in Git is normally a security anti-pattern β€” secrets would be visible to anyone with repository access. Sealed Secrets (by Bitnami) solves this by allowing secrets to be encrypted before being committed to Git.

The backstage database credentials (POSTGRES_HOST, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_PORT, BACKEND_SECRET) are stored as a SealedSecret resource in base/sealed-secret.yaml.

The workflow:

  1. Create a plain Kubernetes Secret locally (never committed)
  2. Encrypt it with kubeseal, using the cluster’s public key: kubeseal --format yaml < secret.yaml > sealed-secret.yaml
  3. Commit the SealedSecret YAML to Git (safe β€” it’s ciphertext)
  4. When ArgoCD applies it, the sealed-secrets-controller decrypts it and creates the real Secret in the cluster

The sealed data can only be decrypted by the cluster that holds the corresponding private key. Even if the GitOps repository is public, the encrypted secrets are worthless to an attacker without access to the cluster.

SealedSecret resource β€” backstage-db sealed secret as seen in the cluster

gitops flow diagram