Skip to content

Runbook: Incident resizing of ArgoCD-managed StatefulSets

Last updated: 2026-02-26 Author: John Adedigba JIRA: DEVOPSBLN-7015

Applies to any ArgoCD-managed StatefulSet: Thanos, Kafka (via Strimzi), Druid, Aerospike, etc. For background on why these procedures work, see ADR-0001.

Prerequisites

  • ArgoCD UI access
  • kubectl access to the target GKE cluster
  • Know which ArgoCD app manages the target StatefulSet

Decision tree

What do you need? Disable auto-sync? Procedure
CPU / memory change (OOMKill, throttling) Yes Procedure A
Replica count change (traffic spike) Yes Procedure A
PVC size (disk full) No Procedure B
VolumeClaimTemplate change (immutable) Yes Procedure C

Procedure A

Resource or replica resize (CPU, memory, env, replica count).

Downtime: Yes. Patching the StatefulSet triggers a rolling restart of pods.

1. Disable auto-sync

ArgoCD UI (recommended): App Details -> Sync Policy -> Disable Auto-Sync

CLI: argocd app set <app-name> --sync-policy none

2. Apply the fix

# Resource change
kubectl patch sts <name> -n <ns> --type merge -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"requests":{"memory":"<new>"},"limits":{"memory":"<new>"}}}]}}}}'

# Replica scale
kubectl scale sts <name> -n <ns> --replicas=<N>

3. Verify

  • Pods running with new resources/replica count: kubectl get pods -n <ns> -w
  • Service healthy (check dashboards, health endpoints)
  • ArgoCD shows OutOfSync but does NOT revert

4. Reconcile Git

Open a GitLab MR updating values.yaml to match. Merge.

5. Re-enable auto-sync

ArgoCD UI: App Details -> Sync Policy -> Enable Auto-Sync (tick Self Heal + Prune)

CLI: argocd app set <app-name> --sync-policy automated --self-heal --auto-prune

6. Confirm

  • ArgoCD shows Synced + Healthy
  • argocd app diff <app-name> shows no diff
  • No pod restarts during reconciliation

Procedure B

PVC resize (disk full).

Downtime: Be careful -- pod restart required for filesystem expansion on GKE standard-rwo. Use only on apps that can tolerate downtime. Auto-sync stays enabled throughout.

1. Patch the PVC

kubectl patch pvc <pvc-name> -n <ns> \
  -p '{"spec":{"resources":{"requests":{"storage":"<new-size>"}}}}'

2. Delete the pod to trigger filesystem expansion

kubectl delete pod <pod-name> -n <ns>

GKE standard-rwo requires a remount. The PVC shows FileSystemResizePending until the new pod mounts the volume.

3. Verify

  • PVC capacity shows new size: kubectl get pvc <pvc-name> -n <ns>
  • ArgoCD stays Synced + Healthy throughout

4. Reconcile Git

Update persistence.size in values.yaml via MR so future pods get the correct size.

Procedure C

VolumeClaimTemplate resize (immutable field change).

Downtime: None. Existing pods keep running throughout.

VolumeClaimTemplates are immutable -- Kubernetes rejects in-place updates. This procedure deletes the StatefulSet while keeping pods alive.

1. Disable auto-sync

Same as Procedure A step 1.

2. Delete the StatefulSet (keep pods)

kubectl delete sts <name> -n <ns> --cascade=orphan

Pods and PVCs stay running.

3. Update Git

Update VCT size in values.yaml, push and merge to Git.

4. Re-enable auto-sync

ArgoCD recreates the StatefulSet with the new VCT. The StatefulSet controller adopts orphaned pods by name -- no new pods created, no restarts.

5. Confirm

  • ArgoCD shows Synced + Healthy
  • Pod age unchanged (same pod, not recreated)
  • Zero downtime

Operator-managed workloads (Strimzi / Kafka)

ArgoCD manages the Custom Resource (e.g. Kafka CR), not the StatefulSet directly. The operator manages the STS.

  • Disable ArgoCD auto-sync before editing the CR (same as Procedure A)
  • Disabling ArgoCD sync does NOT pause the operator -- Strimzi continues reconciling independently
  • Reconcile CR changes back to Git, then re-enable sync

Rollback

Resource/replica changes: Re-enable auto-sync without merging a Git MR. ArgoCD reverts to Git state.

PVC resize: Not reversible (PVCs can grow but not shrink). To reduce, create a new PVC and migrate data.

Cascade=orphan: If the STS was not recreated, trigger a manual sync: argocd app sync <app-name>. If pods are stuck, delete them -- the STS controller recreates them.

Post-incident checklist

  • [ ] All in-cluster changes reconciled to Git via merged MRs
  • [ ] Auto-sync re-enabled on all affected apps
  • [ ] All affected apps show Synced + Healthy
  • [ ] argocd app diff <app-name> shows no diff
  • [ ] Incident documented (timeline, what was changed, MR links)