Canary¶

Archived (pre-2022)

Preserved for reference only -- likely outdated. View original | Last updated: December 2019

For Canary deployment, we are using Kubernetes Operator Flagger.

Flagger is a Kubernetes operator that automates the promotion of canary deployments using NGINX (optionally Istio, Linkerd, App Mesh or Gloo) routing for traffic shifting and Prometheus metrics for canary analysis. The canary analysis can be extended with webhooks for running system integration/acceptance tests, load tests, or any other custom validation.

Flagger takes a Kubernetes deployment and optionally a horizontal pod autoscaler (HPA) and creates a series of objects (Kubernetes deployments, ClusterIP services, virtual service, traffic split or ingress) to drive the canary analysis and promotion.

Flagger implements a control loop that gradually shifts traffic to the canary while measuring key performance indicators like HTTP requests success rate, requests average duration and pods health. Based on the analysis of the KPIs a canary is promoted or aborted, and the analysis result is published to Slack.

Flagger overview diagram

Screenshot 2019-12-17 at 10.07.05.png

Install¶

Flagger requires a Kubernetes cluster v1.11 or newer and NGINX ingress 0.24 or newer. Both our production and staging clusters satisfying these prerequisites, for details check this link.

Flagger as a Kubernetes Operator is deployed via helmfile and related Chart can be found - v0.20.0 (Bitbucket)

helmfile --interactive --environment production_eu_west_1 --file helmfile.yaml apply
# from the bln-k8s-common-helm/helm/config/flagger

To generate traffic during canary analysis we create flagger-loadtester service with appropriate helm Chart - v0.9.0 (Bitbucket)

helmfile --interactive --environment production_eu_west_1 --file helmfile.yaml apply
# from the bln-k8s-common-helm/helm/config/flagger_loadtester

Configure¶

To configure Canary workflow for your project you will simply need to deploy next Kubernetes workloads:

HorizontalPodAutoscaler to which deployment should be scaled up. For example acp-edge-ui hpa:

# Source: acp-edge-ui/templates/hpa.yaml

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: acp-edge-ui-main
  namespace: acp-edge-ui
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: acp-edge-ui-main-acp-edge-ui
  minReplicas: 2
  maxReplicas: 4
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 99

- Canary which describes to which resource create a canary deployment workflow. For example acp-edge-ui canary, which points to deployment, ingress, autoscaller :

# Source: acp-edge-ui/templates/canary.yaml

apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: acp-edge-ui-main
  namespace: acp-edge-ui
spec:
  provider: nginx
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: acp-edge-ui-main-acp-edge-ui
  ingressRef:
    apiVersion: extensions/v1beta1
    kind: Ingress
    name: acp-edge-ui-main-acp-edge-ui
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: acp-edge-ui-main
  progressDeadlineSeconds: 300

Analysis of canary deployment and traffic routing is performed through Prometheus metrics, logic is described in this block:

metrics:
    - name: request-success-rate
      threshold: 99
      interval: 2m
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.kube-system/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -I -L http://acp-edge-ui-main-acp-edge-ui.acp-edge-ui:3000"
      - name: load-test
        url: http://flagger-loadtester.kube-system/
        timeout: 10s
        metadata:
          type: cmd
          cmd: "hey -z 1m -q 10 -c 2 http://acp-edge.fyber.com/"
      - name: promotion approve
        type: confirm-promotion
        url: http://flagger-loadtester.kube-system/gate/check

There are two stages of testing: acceptance and load tests.

Canary → Jenkins¶

Flagger Canary deployment could be promoted manually or automatically. By default automation promotion is enabled by approve flag in Jenkins job pipeline and canary deployment will be automatically promoted to production if all automation tests passed.

Screenshot 2019-12-17 at 10.34.13.png

For manual approval of a canary deployment, you can use the confirm-rollout and confirm-promotion webhooks. The confirmation rollout hooks are executed before the pre-rollout hooks. Flagger will halt the canary traffic shifting and analysis until the confirm webhook returns HTTP status 200.

      - name: promotion approve
        type: confirm-promotion
        url: http://flagger-loadtester.kube-system/gate/check

Once this flag is disabled, Flagger will process deployment in the same way as usual until the promotion step, in such case Flagger will hold Canary deployment with 5% of traffic until manual approve is send (can be done via Jenkins)

Screenshot 2019-12-17 at 10.59.46.png

The logic of automation/manual approve is configured through Jenkins shared libraries (method canaryProcessAction) - canaryProcessAction.groovy (Bitbucket)

Slack¶

After successful canary promotion notification will be sent to slack.

Screenshot 2019-12-17 at 14.03.35.png