Runbook: Incident resizing of ArgoCD-managed StatefulSets¶

Last updated: 2026-02-26 Author: John Adedigba JIRA: DEVOPSBLN-7015

Applies to any ArgoCD-managed StatefulSet: Thanos, Kafka (via Strimzi), Druid, Aerospike, etc. For background on why these procedures work, see ADR-0001.

Prerequisites¶

ArgoCD UI access
kubectl access to the target GKE cluster
Know which ArgoCD app manages the target StatefulSet

Decision tree¶

What do you need?	Disable auto-sync?	Procedure
CPU / memory change (OOMKill, throttling)	Yes	Procedure A
Replica count change (traffic spike)	Yes	Procedure A
PVC size (disk full)	No	Procedure B
VolumeClaimTemplate change (immutable)	Yes	Procedure C

Procedure A¶

Resource or replica resize (CPU, memory, env, replica count).

Downtime: Yes. Patching the StatefulSet triggers a rolling restart of pods.

1. Disable auto-sync¶

ArgoCD UI (recommended): App Details -> Sync Policy -> Disable Auto-Sync

CLI: argocd app set <app-name> --sync-policy none

2. Apply the fix¶

# Resource change
kubectl patch sts <name> -n <ns> --type merge -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"requests":{"memory":"<new>"},"limits":{"memory":"<new>"}}}]}}}}'

# Replica scale
kubectl scale sts <name> -n <ns> --replicas=<N>

3. Verify¶

Pods running with new resources/replica count: kubectl get pods -n <ns> -w
Service healthy (check dashboards, health endpoints)
ArgoCD shows OutOfSync but does NOT revert

4. Reconcile Git¶

Open a GitLab MR updating values.yaml to match. Merge.

5. Re-enable auto-sync¶

ArgoCD UI: App Details -> Sync Policy -> Enable Auto-Sync (tick Self Heal + Prune)

CLI: argocd app set <app-name> --sync-policy automated --self-heal --auto-prune

6. Confirm¶

ArgoCD shows Synced + Healthy
argocd app diff <app-name> shows no diff
No pod restarts during reconciliation

Procedure B¶

PVC resize (disk full).

Downtime: Be careful -- pod restart required for filesystem expansion on GKE standard-rwo. Use only on apps that can tolerate downtime. Auto-sync stays enabled throughout.

1. Patch the PVC¶

kubectl patch pvc <pvc-name> -n <ns> \
  -p '{"spec":{"resources":{"requests":{"storage":"<new-size>"}}}}'

2. Delete the pod to trigger filesystem expansion¶

kubectl delete pod <pod-name> -n <ns>

GKE standard-rwo requires a remount. The PVC shows FileSystemResizePending until the new pod mounts the volume.

3. Verify¶

PVC capacity shows new size: kubectl get pvc <pvc-name> -n <ns>
ArgoCD stays Synced + Healthy throughout

4. Reconcile Git¶

Update persistence.size in values.yaml via MR so future pods get the correct size.

Procedure C¶

VolumeClaimTemplate resize (immutable field change).

Downtime: None. Existing pods keep running throughout.

VolumeClaimTemplates are immutable -- Kubernetes rejects in-place updates. This procedure deletes the StatefulSet while keeping pods alive.

1. Disable auto-sync¶

Same as Procedure A step 1.

2. Delete the StatefulSet (keep pods)¶

kubectl delete sts <name> -n <ns> --cascade=orphan

Pods and PVCs stay running.

3. Update Git¶

Update VCT size in values.yaml, push and merge to Git.

4. Re-enable auto-sync¶

ArgoCD recreates the StatefulSet with the new VCT. The StatefulSet controller adopts orphaned pods by name -- no new pods created, no restarts.

5. Confirm¶

ArgoCD shows Synced + Healthy
Pod age unchanged (same pod, not recreated)
Zero downtime

Operator-managed workloads (Strimzi / Kafka)¶

ArgoCD manages the Custom Resource (e.g. Kafka CR), not the StatefulSet directly. The operator manages the STS.

Disable ArgoCD auto-sync before editing the CR (same as Procedure A)
Disabling ArgoCD sync does NOT pause the operator -- Strimzi continues reconciling independently
Reconcile CR changes back to Git, then re-enable sync

Rollback¶

Resource/replica changes: Re-enable auto-sync without merging a Git MR. ArgoCD reverts to Git state.

PVC resize: Not reversible (PVCs can grow but not shrink). To reduce, create a new PVC and migrate data.

Cascade=orphan: If the STS was not recreated, trigger a manual sync: argocd app sync <app-name>. If pods are stuck, delete them -- the STS controller recreates them.

Post-incident checklist¶

[ ] All in-cluster changes reconciled to Git via merged MRs
[ ] Auto-sync re-enabled on all affected apps
[ ] All affected apps show Synced + Healthy
[ ] argocd app diff <app-name> shows no diff
[ ] Incident documented (timeline, what was changed, MR links)