Canary deployments are a popular tool to reduce risk when deploying software, by exposing a new version to a small subset of traffic before rolling it out more broadly. Creating the machinery to do this kind of controlled rollout, and monitoring for possible problems and rolling back if necessary can be difficult. Flagger is a CNCF incubating project that takes care of much of the undifferentiated heavy lifting of canary deployments on Kubernetes. It supports blue/green canary deployments, and even A/B testing for a number of ingress controllers and service meshes. Additionally, Flagger works with CI/CD tools that deploy to Kubernetes, as it kicks off each canary rollout once a deployment resource has been updated in the Kubernetes API Server.
In this post, we explain how to perform canary deployments on Kubernetes using Flagger to orchestrate the rollout, promotion, and rollback of deployments. We take advantage of Flagger’s built-in Prometheus support to use Amazon Managed Service for Prometheus, a Prometheus-compatible monitoring service, to handle metrics. Also, we use its integration with AWS App Mesh for traffic control. Finally, we show how to observe the canary deployment process with Amazon Managed Grafana, a fully managed service for open source Grafana developed in collaboration with Grafana Labs, using the pre-created Grafana dashboards provided by the Flagger project.
Overview
The following diagram demonstrates the high-level interaction between each component discussed in the post. During a deployment, the numbered arrows show how all the components work together to create a feedback cycle to govern the rollout process.
Let’s briefly walk through each step in the process:
- Traffic arrives from the internet at a Network Load Balancer, and is sent to the App Mesh Ingress Gateway, composed of a Kubernetes deployment of Envoy proxies.
- According to the weights configured in App Mesh, traffic is distributed between the primary and canary application versions, starting with a small percentage.
- The AWS Distro for OpenTelemetry Collector, deployed as a DaemonSet, collects metrics about the behavior of both primary and canary versions. In the example, we will use metrics from the Envoy proxy sidecars deployed by App Mesh to monitor basic stats, such as HTTP status codes, but it is possible to use metrics from any Prometheus exporter.
- Metrics are sent to Amazon Managed Service for Prometheus using the Prometheus remote write API.
- Metrics are monitored by Flagger through the Prometheus query API.
- If the metrics are within specified thresholds, Flagger updates App Mesh custom resource definitions (CRDs) in the Kubernetes API server, which the App Mesh Controller will propagate to the service mesh.
- The Envoy proxies in the App Mesh Ingress Gateway will receive updated configuration from App Mesh, causing them to shift more traffic to the canary version.
- Throughout the process, all metrics can be monitored and visualized in Amazon Managed Grafana.
Now that we know what the end state will be like, let’s walk through the steps for setting up a complete example.
Prerequisites
Before beginning, the following must be installed:
We will use a Bash shell, which is available on macOS and Linux, but the specific commands required will be different on Windows.
Walkthrough
Launch an Amazon EKS cluster
For this walkthrough, we will be working in the us-east-1 Region, which is one of many Regions in which all required services are available. First, we will prepare a configuration file for use with eksctl, and then we can launch an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with the necessary AWS infrastructure. We will conclude this step by installing the App Mesh Controller and creating a mesh to enable traffic shifting.
Let’s create an AWS Identity and Access Management (IAM) policy for the AWS Distro for OpenTelemetry Collector to access Amazon Managed Service for Prometheus:
cat <<EOF > PermissionPolicyIngest.json
{"Version": "2012-10-17",
"Statement": [
{"Effect": "Allow",
"Action": [
"aps:RemoteWrite",
"aps:GetSeries",
"aps:GetLabels",
"aps:GetMetricMetadata"
],
"Resource": "*"
}
]
}
EOF
INGEST_POLICY_ARN=$(aws iam create-policy --policy-name AMPIngest \
--policy-document file://PermissionPolicyIngest.json \
--query 'Policy.Arn' --output text)
Prepare a configuration file for eksctl:
cat <<EOF > cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: mycluster
version: "1.21"
region: us-east-1
managedNodeGroups:
- name: managed-ng-1
instanceType: t3.medium
minSize: 1
maxSize: 2
desiredCapacity: 2
iam:
attachPolicyARNs:
- "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
- "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
- "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
- "arn:aws:iam::aws:policy/AWSAppMeshEnvoyAccess"
iam:
withOIDC: true
serviceAccounts:
- metadata:
name: appmesh-controller
namespace: appmesh-system
attachPolicyARNs:
- "arn:aws:iam::aws:policy/AWSCloudMapFullAccess"
- "arn:aws:iam::aws:policy/AWSAppMeshFullAccess"
- metadata:
name: amp-iamproxy-ingest-service-account
namespace: adot-col
attachPolicyARNs:
- "$INGEST_POLICY_ARN"
- metadata:
name: flagger-service-account
namespace: flagger-system
attachPolicyARNs:
- "arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess"
EOF
Launch the cluster, which will take a few minutes. Note that both the Namespaces
and ServiceAccounts
specified in the configuration file will be precreated in the cluster:
eksctl create cluster -f cluster.yaml
List nodes to make sure that the cluster is accessible:
kubectl get nodes
Once the cluster is available, install the required App Mesh Controller:
kubectl apply -k github.com/aws/eks-charts/stable/appmesh-controller//crds?ref=master
helm repo add eks https://aws.github.io/eks-charts
helm repo update
helm upgrade -i appmesh-controller eks/appmesh-controller \
--namespace appmesh-system \
--set region=us-east-1 \
--set serviceAccount.create=false \
--set serviceAccount.name=appmesh-controller
Create a mesh:
cat << EOF | kubectl apply -f -
apiVersion: appmesh.k8s.aws/v1beta2
kind: Mesh
metadata:
name: global
spec:
namespaceSelector:
matchLabels:
appmesh.k8s.aws/sidecarInjectorWebhook: enabled
EOF
Confirm that the mesh appears as expected in the App Mesh service. We should see output showing active status:
aws appmesh describe-mesh --mesh-name global
Set up Amazon Managed Service for Prometheus
Now that the Amazon EKS cluster is ready, we can set up Amazon Managed Service for Prometheus and related tools to enable the canary deployment process. This will include installing the AWS Distro for OpenTelemetry Collector using the OpenTelemetry Operator, enabling the scrape configurations needed to collect metrics for Flagger, and configuring Amazon Managed Service for Prometheus as a remote write endpoint.
Let’s start by creating an Amazon Managed Service for Prometheus workspace:
WORKSPACE_ID=$(aws amp create-workspace --alias myworkspace \
--query workspaceId --output text)
Now wait for the workspace to have an active status, as shown by this command:
aws amp describe-workspace --workspace-id $WORKSPACE_ID
Next we can install the OpenTelemetry Operator and cert-manager, on which it depends:
helm repo add jetstack https://charts.jetstack.io
helm repo update
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.4.0/cert-manager.crds.yaml
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --version v1.4.0
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
helm install opentelemetry-operator open-telemetry/opentelemetry-operator
Check that it was installed correctly:
kubectl get pod -n opentelemetry-operator-system
We should be shown something like the following:
NAME READY STATUS RESTARTS AGE
opentelemetry-operator-controller-manager-74fb5cd456-x89l4 2/2 Running 0 94m
Now that the operator is deployed, we can create the AWS Distro for OpenTelemetry Collector instance. Let’s start by getting the Prometheus endpoint to write to:
PROM_ENDPOINT=$(aws amp describe-workspace --workspace-id $WORKSPACE_ID \
--query 'workspace.prometheusEndpoint' --output text)
Now let’s give the precreated AWS Distro for OpenTelemetry Collector Service Account permission to access various metrics resources:
kubectl apply -f - <<EOF
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: adotcol-admin-role
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: adotcol-admin-role-binding
subjects:
- kind: ServiceAccount
name: amp-iamproxy-ingest-service-account
namespace: adot-col
roleRef:
kind: ClusterRole
name: adotcol-admin-role
apiGroup: rbac.authorization.k8s.io
EOF
Then we can create the AWS Distro for OpenTelemetry Collector instance:
kubectl apply -f - <<EOF
kind: OpenTelemetryCollector
apiVersion: opentelemetry.io/v1alpha1
metadata:
name: my-collector
namespace: adot-col
spec:
image: "public.ecr.aws/aws-observability/aws-otel-collector:v0.11.0"
serviceAccount: amp-iamproxy-ingest-service-account
mode: daemonset
config: |
receivers:
prometheus:
config:
global:
scrape_interval: 15s
scrape_timeout: 10s
scrape_configs:
- job_name: 'appmesh-envoy'
metrics_path: /stats/prometheus
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_container_name]
action: keep
regex: '^envoy$'
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: \$\${1}:9901
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Exclude high cardinality metrics
metric_relabel_configs:
- source_labels: [ cluster_name ]
regex: '(outbound|inbound|prometheus_stats).*'
action: drop
- source_labels: [ tcp_prefix ]
regex: '(outbound|inbound|prometheus_stats).*'
action: drop
- source_labels: [ listener_address ]
regex: '(.+)'
action: drop
- source_labels: [ http_conn_manager_listener_prefix ]
regex: '(.+)'
action: drop
- source_labels: [ http_conn_manager_prefix ]
regex: '(.+)'
action: drop
- source_labels: [ __name__ ]
regex: 'envoy_tls.*'
action: drop
- source_labels: [ __name__ ]
regex: 'envoy_tcp_downstream.*'
action: drop
- source_labels: [ __name__ ]
regex: 'envoy_http_(stats|admin).*'
action: drop
- source_labels: [ __name__ ]
regex: 'envoy_cluster_(lb|retry|bind|internal|max|original).*'
action: drop
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/\$\${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- default
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kubernetes;https
exporters:
awsprometheusremotewrite:
endpoint: "${PROM_ENDPOINT}api/v1/remote_write"
aws_auth:
region: "us-east-1"
service: "aps"
logging:
loglevel: debug
extensions:
health_check:
pprof:
endpoint: :1888
zpages:
endpoint: :55679
service:
extensions: [pprof, zpages, health_check]
pipelines:
metrics:
receivers: [prometheus]
exporters: [logging, awsprometheusremotewrite]
EOF
Confirm that metrics are being collected by doing a test query with awscurl
:
awscurl --service="aps" --region="us-east-1" \
"${PROM_ENDPOINT}api/v1/query?query=kubernetes_build_info"
A non-empty result shows that metrics are being collected from the cluster. The result should contain data in the result field:
{"status":"success","data":{"resultType":"vector","result":[...this should not be empty...]}}
Set up Amazon Managed Grafana (optional)
Amazon Managed Grafana provides advanced metric visualization and alerting on top of the data stored in Amazon Managed Service for Prometheus, and it will help us observe the canary deployment process. Note that Amazon Managed Grafana requires that either AWS Single Sign-On or another SAML IdP be available for authentication. Setting up Amazon Managed Grafana is not required by Flagger.
Let’s walk through the steps in the documentation for getting started with Amazon Managed Grafana, making sure to select the option to enable Amazon Managed Prometheus as a data source.
Next we will add Amazon Managed Service for Prometheus as a data source as described in the Amazon Managed Grafana documentation. Let’s select the us-east-1 Region and the same Amazon Managed Service for Prometheus workspace we created previously:
Next, in Settings rename the data source toprometheus. This will help when importing a premade dashboard later:
Finally, let’s import a precreated dashboard from the Flagger project by selecting the + icon and Import:
Then we can add the JSON config from the Flagger project and choose Save. The dashboard won’t show much yet, but we will come back to it soon.
Install Flagger
Now we’re ready to install Flagger. First we must install the AWS SIGv4 Proxy Admission Controller to allow Flagger to authenticate with Amazon Managed Service for Prometheus via IAM.
Install the admission controller:
helm upgrade -i aws-sigv4-proxy-admission-controller \
eks/aws-sigv4-proxy-admission-controller \
--namespace kube-system
To simplify Helm configuration, create a values file:
cat <<EOF > values.yaml
podAnnotations:
appmesh.k8s.aws/sidecarInjectorWebhook: disabled
sidecar.aws.signing-proxy/inject: "true"
sidecar.aws.signing-proxy/host: "aps-workspaces.us-east-1.amazonaws.com"
sidecar.aws.signing-proxy/name: aps
metricsServer: "http://localhost:8005/workspaces/$WORKSPACE_ID"
meshProvider: "appmesh:v1beta2"
serviceAccount:
create: false
name: flagger-service-account
EOF
Then install Flagger:
helm repo add flagger https://flagger.app
helm repo update
kubectl apply -f https://raw.githubusercontent.com/fluxcd/flagger/main/artifacts/flagger/crd.yaml
helm upgrade -i flagger flagger/flagger \
--namespace=flagger-system \
-f values.yaml
Check that Flagger is installed and running:
kubectl get pods -n flagger-system
We should get output like the following. (Note that there are two containers in the pod—Flagger and the sigv4 proxy sidecar.)
NAME READY STATUS RESTARTS AGE
flagger-77c65ff87-ftwbm 2/2 Running 0 20s
Deploy the sample app
Now that the cluster, metrics infrastructure, and Flagger are installed, we can install the sample application itself. We will use the standard Podinfo application used in the Flagger project and the accompanying loadtester tool. We also will create the Canary API resource that defines the rollout. We will use a minimal canary configuration. Find details on how to configure Flagger Canaries in the Flagger documentation.
First, let’s create a Namespace called test and enable App Mesh in it:
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: test
labels:
appmesh.k8s.aws/sidecarInjectorWebhook: enabled
EOF
Install the sample application:
helm upgrade -i podinfo flagger/podinfo \
--namespace test \
--set hpa.enabled=false
Install the loadtester:
helm upgrade -i flagger-loadtester flagger/loadtester \
--namespace=test \
--set appmesh.enabled=true \
--set "appmesh.backends[0]=podinfo" \
--set "appmesh.backends[1]=podinfo-canary"
Create the Canary resource:
cat <<EOF | kubectl apply -f -
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: podinfo
namespace: test
spec:
provider: appmesh:v1beta2
targetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
progressDeadlineSeconds: 60
service:
port: 9898
analysis:
interval: 1m
# max number of failed metric checks before rollback
threshold: 5
# canary increment step percentages (0-100)
stepWeights: [2,5,10,20,30,50]
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
EOF
Finally, install the App Mesh Gateway and create a gateway route:
helm upgrade -i appmesh-gateway eks/appmesh-gateway \
--namespace test
cat << EOF | kubectl apply -f -
apiVersion: appmesh.k8s.aws/v1beta2
kind: GatewayRoute
metadata:
name: podinfo
namespace: test
spec:
httpRoute:
match:
prefix: "/"
action:
target:
virtualService:
virtualServiceRef:
name: podinfo
EOF
Once the load balancer is provisioned and health checks pass, we can find the sample application at the load balancer’s public address:
kubectl get svc -n test --field-selector 'metadata.name==appmesh-gateway' \
-o=jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}{"\n"}'
It should look something like the following:
Test canary deployment
Now that both Flagger and the sample app are deployed, we can experience how canary rollout (and rollback) works in action.
First, let’s trigger an update to the Podinfo deployment and watch it be promoted successfully. Then, let’s make another change and force the canary checks to fail, causing a rollback.
Start by making a trivial change to the PodSpec:
kubectl -n test set image deployment/podinfo \
podinfo=stefanprodan/podinfo:5.0.1
Then make sure that some load is applied to the Podinfo app, split between canary and primary according to the rollout process. This command will run for a while without producing output while generating traffic:
kubectl -n test exec -it deploy/flagger-loadtester \
-- hey -z 30m -c 2 -q 20 http://podinfo.test:9898
Then we can watch for new events on the Canary resource in another shell:
kubectl get event --namespace test --field-selector involvedObject.kind=Canary -w
We are shown the following as the rollout progresses:
LAST SEEN TYPE REASON OBJECT MESSAGE
0s Normal Synced canary/podinfo New revision detected! Scaling up podinfo.test
0s Normal Synced canary/podinfo Starting canary analysis for podinfo.test
0s Normal Synced canary/podinfo Advance podinfo.test canary weight 2
0s Normal Synced canary/podinfo Advance podinfo.test canary weight 5
0s Normal Synced canary/podinfo Advance podinfo.test canary weight 10
0s Normal Synced canary/podinfo Advance podinfo.test canary weight 20
0s Normal Synced canary/podinfo Advance podinfo.test canary weight 50
0s Normal Synced canary/podinfo Copying podinfo.test template spec to podinfo-primary.test
0s Normal Synced canary/podinfo Routing all traffic to primary
0s Normal Synced canary/podinfo Promotion completed! Scaling down podinfo.test
Now if we go back to the AppMesh Canary dashboard we imported into Amazon Managed Grafana, we are shown graphs of the metrics used by Flagger. Note the increasing canary traffic over time. We may need to refresh the page and confirm that the values selected for Namespace/Primary/Canary are “test”/”podinfo-primary”/”podinfo” respectively:
Let’s repeat the process one more time:
kubectl -n test set image deployment/podinfo \
podinfo=stefanprodan/podinfo:5.0.2
Apply load again:
kubectl -n test exec -it deploy/flagger-loadtester \
-- hey -z 30m -c 2 -q 20 http://podinfo.test:9898
But this time let’s wreak some havoc on the Prometheus metrics for the canary in a separate shell:
kubectl -n test exec -it deploy/flagger-loadtester \
-- hey -z 30m -c 1 -q 5 http://podinfo-canary.test:9898/status/500
When we monitor canary events again, we are shown warnings like the following, and the rollout ultimately fails:
LAST SEEN TYPE REASON OBJECT MESSAGE
0s Normal Synced canary/podinfo New revision detected! Scaling up podinfo.test
0s Normal Synced canary/podinfo Starting canary analysis for podinfo.test
0s Normal Synced canary/podinfo Advance podinfo.test canary weight 2
0s Normal Synced canary/podinfo Advance podinfo.test canary weight 5
0s Normal Synced canary/podinfo Advance podinfo.test canary weight 10
0s Warning Synced canary/podinfo Halt podinfo.test advancement success rate 46.04% < 99%
0s Warning Synced canary/podinfo Halt podinfo.test advancement success rate 64.85% < 99%
[...abridged...]
0s Warning Synced canary/podinfo Rolling back podinfo.test failed checks threshold reached 5
0s Warning Synced canary/podinfo Canary failed! Scaling down podinfo.test
Again, we can view the corresponding metrics in Amazon Managed Grafana, including the decrease in success rate on the canary that caused the rollout to fail:
This shows how Flagger can safely roll out healthy new versions and prevent the release of versions containing errors detectable by Prometheus metrics.
Cleanup
Now that we are done trying out everything, let’s walk through deleting all the resources used.
Start with the Kubernetes resources that need to be deleted to clean up associated App Mesh resources:
kubectl delete ns test
kubectl delete mesh global
Then, using eksctl
, delete the Amazon EKS cluster:
eksctl delete cluster mycluster
Clean up the Amazon Managed Service for Prometheus workspace and ingest policy:
aws amp delete-workspace --workspace-id $WORKSPACE_ID
aws iam delete-policy --policy-arn $INGEST_POLICY_ARN
Refer to the documentation to delete the Amazon Managed Grafana workspace in the AWS Management Console.
Conclusion
This post explained the process for setting up and using Flagger for canary deployments together with Amazon EKS, Amazon Managed Service for Prometheus, Amazon Managed Grafana, and AWS App Mesh. We only scratched the surface of what’s possible with canary deployment and metrics-driven rollback, but we hope this helps show the value that canary rollouts provide.
Beyond what was covered in this post, you can also consider Flagger’s Slack integration and webhook notifications for failed canary deployments. Amazon Managed Grafana also allows you to alert on a wide variety of metrics data and supports notifying an Amazon Simple Notification Service (Amazon SNS) topic as well as Slack, PagerDuty, and more.
To learn more, read the Amazon Builders’ Library post about automating safe, hands-off deployments, which discusses how we gradually deploy to production at AWS to minimize risk. You can also check out the Flagger hands-on guide that includes additional Weaveworks tools for GitOps.
Leave a Reply