Runbook: Incident resizing of ArgoCD-managed StatefulSets¶
Last updated: 2026-02-26 Author: John Adedigba JIRA: DEVOPSBLN-7015
Applies to any ArgoCD-managed StatefulSet: Thanos, Kafka (via Strimzi), Druid, Aerospike, etc. For background on why these procedures work, see ADR-0001.
Prerequisites¶
- ArgoCD UI access
kubectlaccess to the target GKE cluster- Know which ArgoCD app manages the target StatefulSet
Decision tree¶
| What do you need? | Disable auto-sync? | Procedure |
|---|---|---|
| CPU / memory change (OOMKill, throttling) | Yes | Procedure A |
| Replica count change (traffic spike) | Yes | Procedure A |
| PVC size (disk full) | No | Procedure B |
| VolumeClaimTemplate change (immutable) | Yes | Procedure C |
Procedure A¶
Resource or replica resize (CPU, memory, env, replica count).
Downtime: Yes. Patching the StatefulSet triggers a rolling restart of pods.
1. Disable auto-sync¶
ArgoCD UI (recommended): App Details -> Sync Policy -> Disable Auto-Sync
CLI: argocd app set <app-name> --sync-policy none
2. Apply the fix¶
# Resource change
kubectl patch sts <name> -n <ns> --type merge -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"requests":{"memory":"<new>"},"limits":{"memory":"<new>"}}}]}}}}'
# Replica scale
kubectl scale sts <name> -n <ns> --replicas=<N>
3. Verify¶
- Pods running with new resources/replica count:
kubectl get pods -n <ns> -w - Service healthy (check dashboards, health endpoints)
- ArgoCD shows OutOfSync but does NOT revert
4. Reconcile Git¶
Open a GitLab MR updating values.yaml to match. Merge.
5. Re-enable auto-sync¶
ArgoCD UI: App Details -> Sync Policy -> Enable Auto-Sync (tick Self Heal + Prune)
CLI: argocd app set <app-name> --sync-policy automated --self-heal --auto-prune
6. Confirm¶
- ArgoCD shows Synced + Healthy
argocd app diff <app-name>shows no diff- No pod restarts during reconciliation
Procedure B¶
PVC resize (disk full).
Downtime: Be careful -- pod restart required for filesystem expansion on GKE standard-rwo. Use only on apps that can tolerate downtime. Auto-sync stays enabled throughout.
1. Patch the PVC¶
kubectl patch pvc <pvc-name> -n <ns> \
-p '{"spec":{"resources":{"requests":{"storage":"<new-size>"}}}}'
2. Delete the pod to trigger filesystem expansion¶
GKE standard-rwo requires a remount. The PVC shows FileSystemResizePending until the new pod mounts the volume.
3. Verify¶
- PVC capacity shows new size:
kubectl get pvc <pvc-name> -n <ns> - ArgoCD stays Synced + Healthy throughout
4. Reconcile Git¶
Update persistence.size in values.yaml via MR so future pods get the correct size.
Procedure C¶
VolumeClaimTemplate resize (immutable field change).
Downtime: None. Existing pods keep running throughout.
VolumeClaimTemplates are immutable -- Kubernetes rejects in-place updates. This procedure deletes the StatefulSet while keeping pods alive.
1. Disable auto-sync¶
Same as Procedure A step 1.
2. Delete the StatefulSet (keep pods)¶
Pods and PVCs stay running.
3. Update Git¶
Update VCT size in values.yaml, push and merge to Git.
4. Re-enable auto-sync¶
ArgoCD recreates the StatefulSet with the new VCT. The StatefulSet controller adopts orphaned pods by name -- no new pods created, no restarts.
5. Confirm¶
- ArgoCD shows Synced + Healthy
- Pod age unchanged (same pod, not recreated)
- Zero downtime
Operator-managed workloads (Strimzi / Kafka)¶
ArgoCD manages the Custom Resource (e.g. Kafka CR), not the StatefulSet directly. The operator manages the STS.
- Disable ArgoCD auto-sync before editing the CR (same as Procedure A)
- Disabling ArgoCD sync does NOT pause the operator -- Strimzi continues reconciling independently
- Reconcile CR changes back to Git, then re-enable sync
Rollback¶
Resource/replica changes: Re-enable auto-sync without merging a Git MR. ArgoCD reverts to Git state.
PVC resize: Not reversible (PVCs can grow but not shrink). To reduce, create a new PVC and migrate data.
Cascade=orphan: If the STS was not recreated, trigger a manual sync: argocd app sync <app-name>. If pods are stuck, delete them -- the STS controller recreates them.
Post-incident checklist¶
- [ ] All in-cluster changes reconciled to Git via merged MRs
- [ ] Auto-sync re-enabled on all affected apps
- [ ] All affected apps show Synced + Healthy
- [ ]
argocd app diff <app-name>shows no diff - [ ] Incident documented (timeline, what was changed, MR links)