Skip to content

ArgoCD StatefulSet incident test results (DEVOPSBLN-7015)

Date: 2026-02-26 Target: Thanos Storegateway on gke-infra-growth-dev-useast1 Config: selfHeal: true, prune: true, ServerSideApply=true, RespectIgnoreDifferences=true. PVC and STS VCT fields ignored via ignoreDifferences.

Related: ADR-0001 | Incident runbook


Test 1: Anti-pattern -- patching without disabling sync

Goal: prove the race condition between kubectl and selfHeal.

1a. Patching a field ArgoCD does NOT own (resources, absent from Helm template)

  1. Patched the STS to add memory requests (field not in Helm template):
    kubectl patch sts thanos-storegateway -n monitoring --type merge \
      -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","resources":{"requests":{"memory":"256Mi"}}}]}}}}'
    
  2. ArgoCD showed Synced / Progressing within 4 seconds
  3. Pod came up with new resources after ~60 seconds
  4. Waited 3+ minutes -- selfHeal did not revert the change
  5. ArgoCD stayed Synced + Healthy the entire time

Result: fields not in the Helm template are invisible to selfHeal under SSA.

1b. Patching a field ArgoCD DOES own (image tag, defined in Helm template)

  1. Patched the STS to change the image tag:
    kubectl patch sts thanos-storegateway -n monitoring --type merge \
      -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","image":"quay.io/thanos/thanos:v0.34.0"}]}}}}'
    
  2. ArgoCD detected OutOfSync within ~2 seconds
  3. selfHeal reverted the image back to v0.34.1 within 10 seconds
  4. Pod was recreated with the original image

Result: template-defined fields are reverted almost immediately.


Test 2: Golden Rule -- disable sync, patch, reconcile

Goal: validate the recommended incident workflow end to end.

Blocker discovered: the first attempt to disable sync was overridden by the ApplicationSet controller within ~3 seconds:

argocd app set thanos --sync-policy none
# returned successfully, but sync policy was immediately re-applied by the ApplicationSet controller

Fix: deployed ignoreApplicationDifferences with jsonPointers: [/spec/syncPolicy] on both infra-cluster-infra-apps and core-cluster-infra-apps ApplicationSets. After deploying, sync disable persisted.

After the fix:

  1. Disabled auto-sync:
    argocd app set thanos --sync-policy none --grpc-web
    
  2. Patched memory resources on the STS:
    kubectl patch sts thanos-storegateway -n monitoring --type merge \
      -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","resources":{"requests":{"memory":"1Gi"},"limits":{"memory":"2Gi"}}}]}}}}'
    
  3. Waited 30 seconds -- resources stuck, no revert. ArgoCD showed Synced (non-owned field under SSA)
  4. Patched image to test an owned field:
    kubectl patch sts thanos-storegateway -n monitoring --type merge \
      -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","image":"quay.io/thanos/thanos:v0.34.0"}]}}}}'
    
  5. ArgoCD showed OutOfSync but did not revert -- sync was disabled. Waited 15 seconds to confirm.
  6. Reverted in-cluster state to match Git:
    kubectl patch sts thanos-storegateway -n monitoring --type merge \
      -p '{"spec":{"template":{"spec":{"containers":[{"name":"storegateway","image":"quay.io/thanos/thanos:v0.34.1","resources":null}]}}}}'
    
  7. Re-enabled auto-sync:
    argocd app set thanos --sync-policy automated --self-heal --auto-prune
    
  8. Final state: Synced + Healthy, zero drift, zero pod restarts during reconciliation

Result: ~2 minutes to fix. Workflow is safe with ignoreApplicationDifferences in place.


Test 3: PVC resize with sync enabled

Goal: confirm ignoreDifferences covers PVC resize without any ArgoCD interference.

  1. Patched the PVC with auto-sync enabled:
    kubectl patch pvc data-thanos-storegateway-0 -n monitoring \
      -p '{"spec":{"resources":{"requests":{"storage":"10Gi"}}}}'
    
  2. PVC showed Resizing + FileSystemResizePending conditions within 5 seconds. Capacity still reported 8Gi.
  3. ArgoCD stayed Synced + Healthy -- ignoreDifferences prevented it from seeing PVC drift.
  4. Deleted the pod to trigger filesystem expansion (GKE standard-rwo requires a remount):
    kubectl delete pod thanos-storegateway-0 -n monitoring
    
  5. Pod came back after ~36 seconds. PVC capacity now showed 10Gi.
  6. ArgoCD still Synced + Healthy.

Result: PVC resize is invisible to ArgoCD. No sync disable needed. Pod restart required on GKE for filesystem expansion.


Test 4: VolumeClaimTemplate update via cascade=orphan

Goal: time the immutable-field resize procedure. PVC was already 10Gi from Test 3, but the STS VCT still said 8Gi (chart default).

  1. Disabled auto-sync:
    argocd app set thanos --sync-policy none --grpc-web
    
  2. Deleted the StatefulSet, keeping pods alive:
    kubectl delete sts thanos-storegateway -n monitoring --cascade=orphan
    
    Pod kept running (14 minutes uptime, zero restarts). PVC stayed bound at 10Gi.
  3. ArgoCD showed OutOfSync / Missing (expected -- STS was gone).
  4. Re-enabled auto-sync:
    argocd app set thanos --sync-policy automated --self-heal --auto-prune
    
  5. ArgoCD recreated the STS within ~15 seconds. The STS controller adopted the orphaned pod by name -- same pod, now 15 minutes uptime, no restart.
  6. ArgoCD showed Synced + Healthy. Zero drift.

Result: ~2 minutes total. Zero downtime. STS controller adopts orphaned pods by name convention.


Summary

Scenario Time Downtime Viable?
Anti-pattern: patch owned field, sync enabled ~10s revert Risk of cascading restarts No
Anti-pattern: patch non-owned field, sync enabled Instant, sticks None Risky
Golden Rule: disable sync + patch ~2 min None Yes
PVC resize: patch pvc, sync enabled ~30s + pod restart Brief Yes
Cascade=orphan: VCT update ~2 min None Yes

Config changes made during testing

  • ignoreApplicationDifferences added to infra-cluster-infra-apps and core-cluster-infra-apps ApplicationSets (dev + prod)
  • ArgoCD version confirmed as v3.2.5 (not v2.6.7 from local CLI)