ArgoCD StatefulSet incident test results (DEVOPSBLN-7015)¶

Date: 2026-02-26 Target: Thanos Storegateway on gke-infra-growth-dev-useast1 Config: selfHeal: true, prune: true, ServerSideApply=true, RespectIgnoreDifferences=true. PVC and STS VCT fields ignored via ignoreDifferences.

Related: ADR-0001 | Incident runbook

Test 1: Anti-pattern -- patching without disabling sync¶

Goal: prove the race condition between kubectl and selfHeal.

1a. Patching a field ArgoCD does NOT own (resources, absent from Helm template)

Patched the STS to add memory requests (field not in Helm template):

kubectl patch sts thanos-storegateway -n monitoring --type merge \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","resources":{"requests":{"memory":"256Mi"}}}]}}}}'

ArgoCD showed Synced / Progressing within 4 seconds
Pod came up with new resources after ~60 seconds
Waited 3+ minutes -- selfHeal did not revert the change
ArgoCD stayed Synced + Healthy the entire time

Result: fields not in the Helm template are invisible to selfHeal under SSA.

1b. Patching a field ArgoCD DOES own (image tag, defined in Helm template)

Patched the STS to change the image tag:

kubectl patch sts thanos-storegateway -n monitoring --type merge \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","image":"quay.io/thanos/thanos:v0.34.0"}]}}}}'

ArgoCD detected OutOfSync within ~2 seconds
selfHeal reverted the image back to v0.34.1 within 10 seconds
Pod was recreated with the original image

Result: template-defined fields are reverted almost immediately.

Test 2: Golden Rule -- disable sync, patch, reconcile¶

Goal: validate the recommended incident workflow end to end.

Blocker discovered: the first attempt to disable sync was overridden by the ApplicationSet controller within ~3 seconds:

argocd app set thanos --sync-policy none
# returned successfully, but sync policy was immediately re-applied by the ApplicationSet controller

Fix: deployed ignoreApplicationDifferences with jsonPointers: [/spec/syncPolicy] on both infra-cluster-infra-apps and core-cluster-infra-apps ApplicationSets. After deploying, sync disable persisted.

After the fix:

Disabled auto-sync:

argocd app set thanos --sync-policy none --grpc-web

Patched memory resources on the STS:

kubectl patch sts thanos-storegateway -n monitoring --type merge \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","resources":{"requests":{"memory":"1Gi"},"limits":{"memory":"2Gi"}}}]}}}}'

Waited 30 seconds -- resources stuck, no revert. ArgoCD showed Synced (non-owned field under SSA)

Patched image to test an owned field:

kubectl patch sts thanos-storegateway -n monitoring --type merge \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","image":"quay.io/thanos/thanos:v0.34.0"}]}}}}'

ArgoCD showed OutOfSync but did not revert -- sync was disabled. Waited 15 seconds to confirm.

Reverted in-cluster state to match Git:

kubectl patch sts thanos-storegateway -n monitoring --type merge \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","image":"quay.io/thanos/thanos:v0.34.1","resources":null}]}}}}'

Re-enabled auto-sync:

argocd app set thanos --sync-policy automated --self-heal --auto-prune

Final state: Synced + Healthy, zero drift, zero pod restarts during reconciliation

Result: ~2 minutes to fix. Workflow is safe with ignoreApplicationDifferences in place.

Test 3: PVC resize with sync enabled¶

Goal: confirm ignoreDifferences covers PVC resize without any ArgoCD interference.

Patched the PVC with auto-sync enabled:

kubectl patch pvc data-thanos-storegateway-0 -n monitoring \
  -p '{"spec":{"resources":{"requests":{"storage":"10Gi"}}}}'

PVC showed Resizing + FileSystemResizePending conditions within 5 seconds. Capacity still reported 8Gi.
ArgoCD stayed Synced + Healthy -- ignoreDifferences prevented it from seeing PVC drift.
Deleted the pod to trigger filesystem expansion (GKE standard-rwo requires a remount):
```
kubectl delete pod thanos-storegateway-0 -n monitoring
```
Pod came back after ~36 seconds. PVC capacity now showed 10Gi.
ArgoCD still Synced + Healthy.

Result: PVC resize is invisible to ArgoCD. No sync disable needed. Pod restart required on GKE for filesystem expansion.

Test 4: VolumeClaimTemplate update via cascade=orphan¶

Goal: time the immutable-field resize procedure. PVC was already 10Gi from Test 3, but the STS VCT still said 8Gi (chart default).

Disabled auto-sync:

argocd app set thanos --sync-policy none --grpc-web

Deleted the StatefulSet, keeping pods alive:
```
kubectl delete sts thanos-storegateway -n monitoring --cascade=orphan
```
Pod kept running (14 minutes uptime, zero restarts). PVC stayed bound at 10Gi.
ArgoCD showed OutOfSync / Missing (expected -- STS was gone).

Re-enabled auto-sync:

argocd app set thanos --sync-policy automated --self-heal --auto-prune

ArgoCD recreated the STS within ~15 seconds. The STS controller adopted the orphaned pod by name -- same pod, now 15 minutes uptime, no restart.
ArgoCD showed Synced + Healthy. Zero drift.

Result: ~2 minutes total. Zero downtime. STS controller adopts orphaned pods by name convention.

Summary¶

Scenario	Time	Downtime	Viable?
Anti-pattern: patch owned field, sync enabled	~10s revert	Risk of cascading restarts	No
Anti-pattern: patch non-owned field, sync enabled	Instant, sticks	None	Risky
Golden Rule: disable sync + patch	~2 min	None	Yes
PVC resize: patch pvc, sync enabled	~30s + pod restart	Brief	Yes
Cascade=orphan: VCT update	~2 min	None	Yes

Config changes made during testing¶

ignoreApplicationDifferences added to infra-cluster-infra-apps and core-cluster-infra-apps ApplicationSets (dev + prod)
ArgoCD version confirmed as v3.2.5 (not v2.6.7 from local CLI)