ArgoCD StatefulSet incident test results (DEVOPSBLN-7015)¶
Date: 2026-02-26
Target: Thanos Storegateway on gke-infra-growth-dev-useast1
Config: selfHeal: true, prune: true, ServerSideApply=true, RespectIgnoreDifferences=true. PVC and STS VCT fields ignored via ignoreDifferences.
Related: ADR-0001 | Incident runbook
Test 1: Anti-pattern -- patching without disabling sync¶
Goal: prove the race condition between kubectl and selfHeal.
1a. Patching a field ArgoCD does NOT own (resources, absent from Helm template)
- Patched the STS to add memory requests (field not in Helm template):
- ArgoCD showed Synced / Progressing within 4 seconds
- Pod came up with new resources after ~60 seconds
- Waited 3+ minutes -- selfHeal did not revert the change
- ArgoCD stayed Synced + Healthy the entire time
Result: fields not in the Helm template are invisible to selfHeal under SSA.
1b. Patching a field ArgoCD DOES own (image tag, defined in Helm template)
- Patched the STS to change the image tag:
- ArgoCD detected OutOfSync within ~2 seconds
- selfHeal reverted the image back to
v0.34.1within 10 seconds - Pod was recreated with the original image
Result: template-defined fields are reverted almost immediately.
Test 2: Golden Rule -- disable sync, patch, reconcile¶
Goal: validate the recommended incident workflow end to end.
Blocker discovered: the first attempt to disable sync was overridden by the ApplicationSet controller within ~3 seconds:
argocd app set thanos --sync-policy none
# returned successfully, but sync policy was immediately re-applied by the ApplicationSet controller
Fix: deployed ignoreApplicationDifferences with jsonPointers: [/spec/syncPolicy] on both infra-cluster-infra-apps and core-cluster-infra-apps ApplicationSets. After deploying, sync disable persisted.
After the fix:
- Disabled auto-sync:
- Patched memory resources on the STS:
- Waited 30 seconds -- resources stuck, no revert. ArgoCD showed Synced (non-owned field under SSA)
- Patched image to test an owned field:
- ArgoCD showed OutOfSync but did not revert -- sync was disabled. Waited 15 seconds to confirm.
- Reverted in-cluster state to match Git:
- Re-enabled auto-sync:
- Final state: Synced + Healthy, zero drift, zero pod restarts during reconciliation
Result: ~2 minutes to fix. Workflow is safe with ignoreApplicationDifferences in place.
Test 3: PVC resize with sync enabled¶
Goal: confirm ignoreDifferences covers PVC resize without any ArgoCD interference.
- Patched the PVC with auto-sync enabled:
- PVC showed
Resizing+FileSystemResizePendingconditions within 5 seconds. Capacity still reported 8Gi. - ArgoCD stayed Synced + Healthy --
ignoreDifferencesprevented it from seeing PVC drift. - Deleted the pod to trigger filesystem expansion (GKE
standard-rworequires a remount): - Pod came back after ~36 seconds. PVC capacity now showed 10Gi.
- ArgoCD still Synced + Healthy.
Result: PVC resize is invisible to ArgoCD. No sync disable needed. Pod restart required on GKE for filesystem expansion.
Test 4: VolumeClaimTemplate update via cascade=orphan¶
Goal: time the immutable-field resize procedure. PVC was already 10Gi from Test 3, but the STS VCT still said 8Gi (chart default).
- Disabled auto-sync:
- Deleted the StatefulSet, keeping pods alive: Pod kept running (14 minutes uptime, zero restarts). PVC stayed bound at 10Gi.
- ArgoCD showed OutOfSync / Missing (expected -- STS was gone).
- Re-enabled auto-sync:
- ArgoCD recreated the STS within ~15 seconds. The STS controller adopted the orphaned pod by name -- same pod, now 15 minutes uptime, no restart.
- ArgoCD showed Synced + Healthy. Zero drift.
Result: ~2 minutes total. Zero downtime. STS controller adopts orphaned pods by name convention.
Summary¶
| Scenario | Time | Downtime | Viable? |
|---|---|---|---|
| Anti-pattern: patch owned field, sync enabled | ~10s revert | Risk of cascading restarts | No |
| Anti-pattern: patch non-owned field, sync enabled | Instant, sticks | None | Risky |
| Golden Rule: disable sync + patch | ~2 min | None | Yes |
| PVC resize: patch pvc, sync enabled | ~30s + pod restart | Brief | Yes |
| Cascade=orphan: VCT update | ~2 min | None | Yes |
Config changes made during testing¶
ignoreApplicationDifferencesadded toinfra-cluster-infra-appsandcore-cluster-infra-appsApplicationSets (dev + prod)- ArgoCD version confirmed as v3.2.5 (not v2.6.7 from local CLI)