Graceful shutdown and zero downtime deployment in Kubernetes

TL; DR: In this article, you will learn how to prevent disconnected connections when a Pod starts or shuts down. You will also learn how to gracefully shut down long-running tasks.

Graceful shutdown and zero downtime deployment in Kubernetes

In Kubernetes, creating and deleting Pods is one of the most common tasks.

When you perform rolling updates, extended deployments, every new release, every job and cron job, etc., a Pod is created.

But after eviction, Pods are also deleted and recreated-for example, when you mark a node as unschedulable.

If the nature of these Pods is so short, what happens when the Pod is told to shut down in the process of responding to the request?

Did the request complete before closing?

What about the next requests, are those requests redirected elsewhere?

Before discussing what happens when a Pod is deleted, it is necessary to discuss what happens when a Pod is created.

Suppose you want to create the following Pod in the cluster:

pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: web
      image: nginx
      ports:
        - name: web
          containerPort: 80

You can submit YAML definitions to the cluster in the following ways:

    kubectl apply -f pod.yaml

After entering the command, kubectl submits the Pod definition to the Kubernetes API.

This is the starting point of the journey.

Save the state of the cluster in the database

The API receives and checks the Pod definition, and then stores it in the database etcd.

The Pod will also be added to the scheduler's queue.

Scheduler:

Check definition
Collect detailed information about the workload, such as CPU and memory requests, and then
Determine which node is best for running it (through processes called filters and predicates).

At the end of the process:

Mark the Pod as Scheduled in etcd.
A node is assigned to the Pod.
The state of the Pod is stored in etcd.

but Pod still does not exist .

When kubectl apply -f , send YAML configuration to Kubernetes API.

The API saves the Pod in the etcd database.

The scheduler assigns the best node to the Pod, and the status of the Pod changes to Pending. Pod only exists in etcd.

So who creates the Pod in your node?

`kubelet` -k8s proxy

The job of kubelet is to poll the control plane for updates.

You can imagine kubelet constantly asking the master node: "I take care of worker node 1, do you have any new pods for me?".

When there is a Pod, kubelet will create it.

That's roughly the case.

kubelet will not create Pods on its own. Instead, delegate the work to three other components:

Container Runtime Interface (CRI)-The component that creates the container for the Pod.
Container Network Interface (CNI)-A component that connects containers to the cluster network and assigns IP addresses.
Container Storage Interface (CSI)-The component that mounts the volume in the container.

In most cases, the work of the container runtime interface (CRI) is similar to:

docker run -d <my-container-image>

The Container Network Interface (CNI) is a bit interesting because it is responsible for:

Generate a valid IP address for the Pod.
Connect the container to the rest of the network.

As you can imagine, there are several ways to connect the container to the network and assign a valid IP address (you can choose between IPv4 or IPv6, or you can assign multiple IP addresses).

For example, Docker creates a virtual Ethernet pair and connects it to a bridge, while AWS-CNI connects Pods directly to the rest of the virtual private cloud (VPC).

When the container network interface finishes its work, the Pod has connected to the rest of the network and assigned a valid IP address.

There is only one problem.

Kubelet knows the IP address (because it calls the container network interface), but the control plane does not.

No one tells the master node that the Pod has been assigned an IP address and is ready to receive traffic.

As far as the control plane is concerned, Pods are still being created.

kubelet is to collect all the details of the Pod (such as the IP address) and report it back to the control plane.

You can imagine checking etcd can not only show where the Pod is running, but also its IP address.

Kubelet polls the control plane for updates.

When a new Pod is assigned to its Node, the kubelet retrieves the details.

The kubelet doesn't create the Pod itself. It relies on three components: the Container Runtime Interface, Container Network Interface and Container Storage Interface.

Once all three components have successfully completed, the Pod is Running in your Node and has an IP address assigned.

The kubelet reports the IP address back to the control plane.

If Pod is not part of any service, then this is the end of the journey.

The Pod has been created and can be used.

If the Pod is part of the service, there are a few more steps to perform.

Pod and service

When creating a service, you usually need to pay attention to two pieces of information:

Selector, used to specify the Pod that will receive traffic.
This targetPort- the traffic received through the port used by the cabin.
A typical YAML definition of a service is as follows:

service.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  ports:
  - port: 80
    targetPort: 3000
  selector:
    name: app

When submitting the Service to the cluster with kubectl apply, Kubernetes will find all Pods with the same label as the selector (name: app) and collect their IP addresses-but only if they have passed the Readiness probe.

Then, for each IP address, it connects the IP address and port together.

If the IP address is 10.0.0.3 and, targetPort 3000Kubernetes connects the two values and is called the endpoint.

IP address + port = endpoint
---------------------------------
10.0.0.3   + 3000 = 10.0.0.3:3000

The endpoint is stored in another object named Endpoint in etcd.

Senseless?

Kubernetes refers to:

IP address + port pair (10.0.0.3:3000) is the endpoint (referred to as the lowercase e endpoint in this article and Learnk8s materials).
Endpoints (referred to as uppercase E endpoints in this article and Learnk8s materials) are a collection of endpoints.

The endpoint object is a real object in Kubernetes. For each service, Kubernete will automatically create an endpoint object.

You can use the following methods to verify:

kubectl get services,endpoints
NAME                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)
service/my-service-1   ClusterIP   10.105.17.65   <none>        80/TCP
service/my-service-2   ClusterIP   10.96.0.1      <none>        443/TCP

NAME                     ENDPOINTS
endpoints/my-service-1   172.17.0.6:80,172.17.0.7:80
endpoints/my-service-2   192.168.99.100:8443

The endpoint collects all IP addresses and ports from the Pod.

But not just once.

In the following cases, the Endpoint object will be refreshed with the new endpoint list:

Create a Pod.
Pod has been deleted.
Modified the label on the Pod.

So, you can imagine that every time a Pod is created and after the kubelet publishes its IP address to the master node, Kubernetes will update all endpoints to reflect the change:

kubectl get services,endpoints
NAME                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)
service/my-service-1   ClusterIP   10.105.17.65   <none>        80/TCP
service/my-service-2   ClusterIP   10.96.0.1      <none>        443/TCP

NAME                     ENDPOINTS
endpoints/my-service-1   172.17.0.6:80,172.17.0.7:80,172.17.0.8:80
endpoints/my-service-2   192.168.99.100:8443

Very good, the endpoint is stored in the control plane, and the endpoint object has been updated.

In this figure, a Pod is deployed in the cluster. Pod is a service. If you want to check etcd, you can find detailed information and services of Pod.

What happens when a new Pod is deployed?

Kubernetes has to keep track of the Pod and its IP address. The Service should route traffic to the new endpoint, so the IP address and port should be propagated.

What happens when another Pod is deployed?

The exact same process. A new "row" for the Pod is created in the database, and the endpoint is propagated.

What happens when a Pod is deleted, though?

The Service immediately removes the endpoint, and, eventually, the Pod is removed from the database too.

Kubernetes reacts to every small change in your cluster.

Are you ready to start using Pod?

there are more.

a lot more!

Use endpoints in Kubernetes

Endpoints are used by several components in Kubernetes.

Kube-proxy uses endpoints to set iptables rules on nodes.

Therefore, every time a change is made to the endpoint (object), kube-proxy retrieves a new list of IP addresses and ports and writes new iptables rules.

Let us consider a three-node cluster with two Pods and no Service. The state of the Pod is stored in etcd.

Let's consider this three-node cluster with two Pods and no Services. The state of the Pods is stored in etcd.

What happens when you create a Service?
Kubernetes created an Endpoint object and collects all the endpoints (IP address and port pairs) from the Pods.

Kube-proxy daemon is subscribed to changes to Endpoints.

When an Endpoint is added, removed or updated, kube-proxy retrieves the new list of endpoints.

Kube-proxy uses the endpoints to creating iptables rules on each Node of your cluster.

The Ingress controller uses the same endpoint list.

ingress controller is the component in the cluster that routes external traffic to the cluster.

When setting up the Ingress list, Service is usually specified as the target:
ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  rules:
  - http:
      paths:
      - backend:
          service:
            name: my-service
            port:
              number: 80
        path: /
        pathType: Prefix

In fact, the traffic is not routed to the service.

Instead, the Ingress controller sets up a subscription that will be notified every time the endpoint of the service changes.

Ingress will route the traffic directly to the Pod, thereby skipping the service.

As you can imagine, every time the endpoint (object) is changed, Ingress retrieves a new list of IP addresses and ports, and reconfigures the controller to include the new Pod.

In this photo, there is an Ingress controller with two copies and a Service Deployment.

There are more examples of Kubernetes components that can be subscribed to changes to endpoints.

CoreDNS, the DNS component in the cluster, is another example.

If you use a headless type of service, CoreDNS must subscribe to the changes to the endpoint and reconfigure itself every time an endpoint is added or deleted.

The same endpoints are used by service grids such as Istio or Linkerd, and cloud providers have also created type: LoadBalancer services for countless operators.

You must remember that there are several components that are subscribed to changes to the endpoint, and they may receive notifications about endpoint updates at different times.

Is it enough, or something happens after the Pod is created?

This time you are done!

A quick review of what happened when the Pod was created:

Pod is stored in etcd.
The scheduler allocates a node. It writes the node to etcd.
Notify kubelet of new and scheduled Pods.
The kubelet delegates the delegation of container creation to the container runtime interface (CRI).
The kubelet stands for attaching the container to the container network interface (CNI).
Kubelet delegates the mounting volume in the container to the container storage interface (CSI).
The container network interface is assigned an IP address.
Kubelet reports the IP address to the control plane.
The IP address is stored in etcd.
If your Pod is a service:

Kubelet waits for a successful "ready" probe.
Notify all relevant endpoints (objects) of changes.
The endpoint adds the new endpoint (IP address + port pair) to its list.
Notify Kube-proxy of endpoint changes. Kube-proxy updates the iptables rules on each node.
The ingress controller that notifies the endpoint of the change. The controller routes the traffic to the new IP address.
CoreDNS will be notified about endpoint changes. If the service type is Headless, the DNS entry is updated.
Notify the cloud provider of endpoint changes. If the service is type: LoadBalancer, configure the new endpoint as part of the load balancer pool.
Endpoint changes will notify all service meshes installed in the cluster.
Any other operators who subscribe to endpoint changes will also be notified.
Such a long list is surprisingly a common task-creating a Pod.

Pod is running. It's time to discuss what happens when it is deleted.

Delete Pods

You may have guessed it, but when deleting a Pod, you must follow the same steps, but in reverse.

First, the endpoint should be deleted from the endpoint (object).

This time the "ready" probe will be ignored and immediately removed from the control plane.

Trigger all events to kube-proxy, Ingress controller, DNS, service mesh, etc. in turn.

These components will update their internal state and stop routing traffic to IP addresses.

Since the component may be busy performing other operations, there is no guarantee how long it will take to remove the IP address from its internal state.

For some people, it may take less than a second. For others, it may take more time.

If you want to delete a pod using kubectl delete pod, the command first reaches the Kubernetes API.
1 /5
If you want to delete pod using kubectl delete pod, the command will first reach the Kubernetes API.

At the same time, the status of Pod in etcd is changed to Termination.

The kubelet will be notified to change and delegate:

Unload any volume from the container to the container storage interface (CSI).
Detach the container from the network and release the IP address to the container network interface (CNI).
Destroy the container to the container runtime interface (CRI).
In other words, Kubernetes follows exactly the same steps as creating a Pod, but the opposite.

If you want to delete a pod using kubectl delete pod, the command first reaches the Kubernetes API.
1 /3
If you want to delete pod using kubectl delete pod, the command will first reach the Kubernetes API.

However, there are subtle but essential differences.

When you terminate the Pod, both the endpoint and the signal sent to the kubelet will be deleted.

When the Pod is first created, Kubernetes waits for the kubelet to report the IP address, and then initiates endpoint propagation.

However, when you delete the Pod, the events will start in parallel.

This can cause quite a lot of game conditions.

What should I do if the Pod is deleted before the endpoint is propagated?

Deleting the endpoint and deleting the Pod will happen at the same time.
1 /3
Deleting the endpoint and deleting the Pod will happen at the same time.

Normal shutdown
You may experience downtime when the Pod terminates before the endpoint is removed from the kube-proxy or Ingress controller.

And, if you think about it, it makes sense.

Kubernetes still routes traffic to IP addresses, but Pods no longer exist.

Ingress controller, kube-proxy, CoreDNS, etc. do not have enough time to remove the IP address from their internal state.

Ideally, before deleting a Pod, Kubernetes should wait for all components in the cluster to have an updated endpoint list.

But Kubernetes can't work like that.

Kubernetes provides robust primitives to distribute endpoints (ie Endpoint objects and higher-level abstractions, such as Endpoint Slices).

However, Kubernetes does not verify that the components subscribing to endpoint changes are the latest information about the cluster state.

So, how to avoid this race condition and ensure that the Pod is deleted after the endpoint is propagated?

You should wait

When the Pod is about to be deleted, it will receive the SIGTERM signal.

Your application can catch this signal and start shutting down.

Since endpoints are unlikely to be removed from all components in Kubernetes immediately, you can:

Please wait for a while and then exit.
Although SIGTERM is used, incoming traffic can still be processed.
Finally, close the existing long-term connection (perhaps a database connection or WebSocket).
Close the process.
How long should you wait?

By default, Kubernetes will send the SIGTERM signal and wait for 30 seconds, and then forcefully terminate the process.

Therefore, you can continue the operation for the first 15 seconds because nothing happened.

It is hoped that the interval should be sufficient to propagate the endpoint deletion to kube-proxy, Ingress controller, CoreDNS, etc.

Therefore, less and less traffic will reach your Pod until it stops.

After 15 seconds, you can safely close the connection to the database (or any persistent connection) and terminate the process.

If you think you need more time, you can stop the process at 20 or 25 seconds.

However, you should remember that Kubernetes will forcibly terminate the process after 30 seconds (unless you change the definition of terminationGracePeriodSecondsPod).

What if you cannot change the code to wait longer?

You can call the script to wait a fixed time and then exit the application.

Before calling SIGTERM, KubernetespreStop exposes a hook in the Pod.

You can set the preStop hook to wait 15 seconds.

Let's look at an example:

pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: web
      image: nginx
      ports:
        - name: web
          containerPort: 80
      lifecycle:
        preStop:
          exec:
            command: ["sleep", "15"]

The preStop hook is one of the Pod LifeCycle hooks.

Is it recommended to delay 15 seconds?

It depends on the situation, but it may be a wise way to start testing.

Here is an overview of the options you can choose:

As you already know, when a Pod is deleted, kubelet will be notified of the change.
1 /5
As you already know, when a Pod is deleted, kubelet will be notified of the change.

Grace period and rolling updates
Normal shutdown is applicable to the Pod to be deleted.

But what if you don’t delete the Pod?

Even if you don't do this, Kubernetes will always delete the Pod.

In particular, every time a newer version of the application is deployed, Kubernetes creates and deletes Pods.

As the image is changed during deployment, Kubernetes will gradually roll out the changes.

pod.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 3
  selector:
    matchLabels:
      name: app
  template:
    metadata:
      labels:
        name: app
    spec:
      containers:
      - name: app
        # image: nginx:1.18 OLD
        image: nginx:1.19
        ports:
          - containerPort: 3000

If you have three copies, and once you submit a new YAML resource Kubernetes, then:

Create a Pod with the new container image.
Destroy the existing Pod.
Wait for the Pod to be ready.
And repeat the above steps until all Pods are migrated to the newer version.

Kubernetes only repeats each cycle after the new Pod is ready to receive traffic (in other words, it passes the "ready" check).

Does Kubernetes wait for the Pod to be deleted before moving to the next Pod?

Do not.

If you have 10 Pods, and the Pod needs 2 seconds of preparation time and 20 shutdown times, the following will happen:

Create the first Pod and terminate the previous Pod.
After Kubernetes creates a new Pod, it takes 2 seconds to prepare.
At the same time, the terminated Pod will terminate for 20 seconds
After 20 seconds, all new Pods have been enabled (10 Pods, ready in 2 seconds), and all previous 10 Pods will be terminated (the first Terminated Pod will exit).

In total, you doubled the number of Pods in a short period of time (10 runs, 10 terminations).

Rolling update and graceful shutdown
Compared to "ready" probes, the longer the grace period, the more pods you have "running" (and "terminated") at the same time.

Is it not good

Not necessarily, because you have to be careful not to disconnect.

Terminate long-running tasks
What about long-term work?

If you want to transcode a large video, is there any way to delay the stop of the Pod?

Suppose you have a Deployment with three copies.

Each copy is assigned a video for transcoding, which may take several hours to complete.

When you trigger a rolling update, the Pod will complete the task within 30 seconds and then kill it.

How to avoid delay in closing the Pod?

You can increase its terminationGracePeriodSeconds to several hours.

However, the endpoint of the Pod is unreachable at this time.

Unreachable pods
If you expose metrics to monitor the Pod, your device will not be able to access the Pod.

why?

Tools such as Prometheus rely on Endpoints to grab Pods in the cluster.

However, once the Pod is deleted, the endpoint deletion will propagate across the cluster, even to Prometheus!

You should consider creating a new deployment for each new version, rather than increasing the grace period.

When you create a brand new deployment, the existing deployment will remain unchanged.

Long-running jobs can continue to process video as usual.

When finished, you can delete them manually.

If you want to delete them automatically, you may need to set up an autoscaler that can scale the deployment to zero copies when they run out of tasks.

An example of such a Pod automatic scaler is Osiris, which is a universal, zero-scaling component of Kubernetes.

This technique is sometimes referred to as Rainbow deployment, and is very useful every time you have to make the previous Pod run longer than the grace period.

Another good example is WebSockets.

If you are streaming real-time updates to users, you may not want to terminate the WebSocket every time you publish.

If you travel frequently during the day, it may cause multiple interruptions in the real-time feed.

Creating a new deployment for each version is a less obvious but better option.

Existing users can continue to stream updates, while the latest Deployment serves new users.

When the user disconnects from the old Pod, you can gradually reduce the number of copies and exit the past deployment.

to sum up

You should be aware that Pods are removed from the cluster because their IP addresses may still be used to route traffic.

Instead of closing the Pods immediately, consider waiting a longer time in the application or setting a preStop hook.

The Pod should be deleted only after all the endpoints in the cluster are propagated and removed from kube-proxy, Ingress controller, CoreDNS, etc.

If your Pod is running long-term tasks such as video transcoding or using WebSockets for real-time updates, you should consider using Rainbow deployment.

In Rainbow deployment, you create a new Deployment for each release, and delete the previous release after running out of connections (or tasks).

After the long-running task is complete, you can manually delete the older deployment.

Or you can automatically scale the deployment to zero

Graceful shutdown and zero downtime deployment in Kubernetes