Demystifying the core technology of Kubernetes on stateful services

background

As Kubernetes has become the most popular solution for cloud natives, more and more traditional services have migrated from virtual machines and physical machines to Kubernetes. Cloud vendors such as Tencent’s self-developed cloud deployment are also mainly pushing businesses to deploy services through Kubernetes and enjoy Kubernetes It brings benefits such as elastic expansion and contraction, high availability, automated scheduling, and multi-platform support. However, most of the currently deployed services based on Kubernetes are stateless. Why is it more difficult to containerize stateful services than stateless services? What are its difficulties? What are their respective solutions?

This article will combine my understanding of Kubernetes, rich experience in stateful service development, governance, and containerization to analyze the difficulties of stateful containerization and the corresponding solutions for you. I hope this article can help you understand stateful services. Difficulties in containerization, and can flexibly choose solutions based on their own stateful service scenarios, and efficiently and stably containerize stateful services on Kubernetes to improve development, operation and maintenance efficiency and product competitiveness.

Stateful service containerization challenges

In order to simplify the problem and avoid excessive abstraction, I will use the commonly used Redis cluster as a specific case to explain in detail how to containerize a Redis cluster, and use this case to further analyze and expand the common problems in the stateful service scenario.

The following figure is the overall architecture diagram of the Redis cluster solution codis (quoted from Codis project ).

codis is a proxy-based distributed Redis cluster solution, which consists of the following core components:

zookeeper/etcd, stateful metadata storage, generally deployed on odd nodes
codis-proxy, a stateless component, by calculating the crc16 hash value of the key, and forwarding the key to the corresponding back-end codis-group according to the preshard routing table information stored in zookeeper/etcd
codis-group is composed of a group of Redis active and standby nodes, one active and multiple standby, responsible for data read, write and storage
codis-dashboard cluster control plane API service, through which you can add or delete nodes, migrate data, etc.
redis-sentinel, a cluster high-availability component, responsible for detecting and monitoring the survival of the Redis master, and initiating a backup switch when the master fails

So how do we containerize a codis cluster based on Kubernetes and use kubectl to manipulate resources to create and efficiently manage a codis cluster with one click?

In the case of containerized stateful services like codis, we need to solve the following problems:

How to describe your stateful service in the language of Kubernetes?
How to choose the right workload deployment for your stateful service?
When the built-in workload of kubernetes cannot directly describe the business scenario, what Kubernetes extension mechanism should you choose?
How to make security changes to stateful services?
How to ensure that your stateful service active and standby instance Pods are scheduled to different fault domains?
How does a stateful service instance failure self-heal?
How to meet the high network performance requirements of stateful services after containerization?
How to meet the high storage performance requirements of stateful services after containerization?
How to verify the stability of stateful services after containerization?

Below is a systematic analysis of the technical difficulties of containerized stateful services with a mind map. Next, I will explain the containerized solutions for you from the above aspects.

Load type

The first question of containerization of stateful services is how to describe your stateful services with Kubernetes-style APIs and languages?

Kubernetes abstracts various business scenarios in the complex software world, and has built-in workloads such as Pod, Deployment, and StatefulSet. So what are the usage scenarios of each workload?

Pod is the smallest unit of scheduling and deployment. It consists of a set of containers that share resources such as networks and data volumes. Why is Pod designed as a group of containers instead of one? Because in actual complex business scenarios, a business container often cannot complete certain complex functions independently. For example, you want to use an auxiliary container to help you download cold backup snapshot files, do log forwarding, etc. Thanks to the excellent design of Pod, auxiliary container You can share the same network namespace and data volume with your Redis, MySQL, etcd, zookeeper and other stateful containers to help the main business container complete the above work. This auxiliary container is called sidecar in Kubernetes and is widely used in auxiliary scenarios such as logging, forwarding, and service mesh, and has become a Kubernetes design pattern. The excellent design of Pod comes from the summary and sublimation of Google's internal Borg operating experience for more than ten years, which can significantly reduce the cost of containerizing your complex business.

The business process is successfully containerized through Pod. However, Pod itself does not have the characteristics of high availability, automatic scaling, rolling update, etc. Therefore, in order to solve the above challenges, Kubernetes provides a more advanced Workload Deployment, through which you can realize Pod Fault self-healing, rolling update, and combined with HPA components can realize advanced features such as automatic expansion and contraction according to CPU, memory or custom indicators. It is generally used to describe stateless service scenarios, so it is particularly suitable for the ones we discussed above. Stateless components in a state cluster, such as the proxy component of the codis cluster.

So why is Deployment not suitable for statefulness? The main reason is that the Pod name generated by the Deployment is changed, there is no stable network identity, no stable persistent storage, and the order cannot be controlled during the rolling update process, which is very important for statefulness. On the one hand, stateful services communicate with each other through stable network identities are their basic requirements for high availability and data reliability. For example, in etcd, a log submission must be confirmed and persisted by more than half of the nodes in the cluster. In Redis, The master and slave establish a master-slave synchronization relationship based on a stable network identity. On the other hand, whether it is other components such as etcd or Redis, after the Pod is abnormally rebuilt, the business often hopes that its corresponding persistent data cannot be lost.

In order to solve the pain points of the above stateful service scenarios, Kubernetes has designed and implemented StatefulSet to describe such scenarios. It can provide each Pod with a unique name, fixed network identity, persistent data storage, and orderly rolling update release. mechanism. Based on StatefulSet, you can conveniently deploy a single stateful service such as etcd, zookeeper, etc. into a container.

Through Deployment and StatefulSet, we can quickly containerize most of the services in real business scenarios, but the business requirements are diversified, and their respective technology stacks and business scenarios are also very different. Some hope to realize Pod fixed IP, which is convenient and quick to connect. In traditional load balancing, some hope that the Pod will not be rebuilt and support in-situ updates during the release process, and some hope that any Statefulset Pod can be updated. So how does Kubernetes meet the diverse demands?

Expansion mechanism

Kubernetes design provides a powerful external extension system, as shown in the figure below (quoted from kubernetes blog), from kubectl plugin to Aggreated API Server, to CRD, custom scheduler, to operator, network plug-in (CNI), storage Plug-in (CSI). Everything can be expanded, fully empowering business, so that each business can be customized based on the Kubernetes extension mechanism to meet everyone's specific scenario requirements.

CRD and Aggreated API Server

When you encounter Deployment and StatefulSet that cannot meet your demands, Kubernetes provides mechanisms such as CRD and Aggreated API Server, Operator to expand API resources, combine your specific domain and application knowledge, and realize automated resource management and operation and maintenance tasks. .

CRD, or CustomResourceDefinition, is a built-in resource extension method of Kubernetes. It integrates kube-apiextension-server inside the apiserver. There is no need to run additional Apiservers in the Kubernetes cluster. It is responsible for implementing Kubernetes API CRUD, Watch and other conventional API operations. It supports kubectl, Authentication, authorization, and auditing, but does not support customization such as SubResource log/exec, and does not support custom storage. It is stored on the etcd of the Kubernetes cluster itself. If a large number of CRD resources need to be stored, it will have a certain impact on the performance of the Kubernetes cluster etcd. Limits the ability of services to migrate from different clusters.

Aggreated API Server, that is, aggregated ApiServer, belongs to this category like our commonly used metrics-server. With this feature, Kubernetes splits a huge single apiserver into multiple aggregated apiservers according to resource categories. The scalability is further enhanced, and new APIs do not need to be modified. Kubernetes code, developers write their own ApiServer and deploy it in the Kubernetes cluster, and register information such as the group name of the custom resource and the service name of the apiserver to the Kubernetes cluster through the apiservice resource. When the Kubernetes ApiServer receives a custom resource request, it is based on The apiservice resource information is forwarded to a custom apiserver, supports kubectl, supports configuration authentication, authorization, auditing, supports custom third-party etcd storage, and supports customized development of other advanced features such as subResource log/exec.

In general, CRD provides simple, extended resource creation and storage capabilities without any programming, while Aggreated API Server provides a mechanism that allows you to have more refined control of API behavior, allowing you to customize storage, Use Protobuf protocol, etc.

Enhanced Workload

In order to meet the business's above-mentioned in-situ update, designated Pod update and other advanced feature requirements, Tencent internal and the community have provided corresponding solutions. Tencent has internally tested StatefulSetPlus (not open source) and tkestack TAPP (open sourced), which have been tested in a large-scale production environment. There is also Ali’s open source project Openkruise in the community. Pingcap has also launched a current solution to solve the problem of StatefulSet designated Pod update The advanced-statefulset project that is still in an experimental state.

StatefulSetPlus is designed to satisfy a large number of traditional businesses in Tencent on Kubernetes. It is compatible with all the features of StatefulSet, supports in-situ upgrades of containers, docks with the ipamd component of TKE, realizes a fixed IP, supports HPA, and supports when Node is unavailable. , Pod automatically drifts to realize self-healing, and supports manual batch upgrades and other features.

Openkruise includes a series of Kubernetes-enhanced controller components, including CloneSet, Advanced StatefulSet, SideCarSet, etc. CloneSet is a Workload that focuses on solving the pain points of stateless services. It supports in-situ update, designated Pod deletion, and supports Partition and Advanced StatefulSet during rolling update. As the name suggests, it is an enhanced version of StatefulSet, which also supports in-place updates, pauses, and maximum unavailability.

After using the enhanced workload component, your stateful service will have superior features such as in-place updates under traditional virtual machine and physical machine deployment modes, and fixed IP. However, at this time, do you directly containerize your service based on workloads such as StatefulSetPlus, TAPP, or define a custom resource based on the Kubernetes extension mechanism, specifically used to describe the various components of your stateful service, and write self based on workloads such as StatefulSetPlus, TAPP, etc. What about the defined operator?

The former is suitable for simple stateful service scenarios. They have few components and are easy to manage. At the same time, they don't need any Kubernetes programming knowledge or development. The latter is suitable for more complex scenarios and requires you to understand the Kubernetes programming mode and know how to customize and extend resources and write controllers. You can combine your knowledge of the stateful service domain and write a very powerful controller based on enhanced workloads such as StatefulSetPlus and TAPP to help you complete a complex, multi-component stateful service creation and management work with one click. Available, auto-scaling and other features.

Based on operator extension

In our codis cluster case above, you can choose to customize a CRD resource to describe a complete codis cluster through the CRD extension mechanism of Kubernetes, as shown in the following figure.

After implementing a declarative description of your stateful business objects through CRD, we also need to implement your business logic through the operator mechanism provided by Kubernetes. The core principle of the Kubernetes operator is the controller idea. It obtains and monitors the expected state and actual state of the business object from the API Server, compares the difference between the expected state and the actual state, and performs a consistency tuning operation to make the actual state meet the desired state.

Its core working principle is shown in the figure above (quoted from the community).

Obtain initial state data (CRD, etc.) from kube-apiserver through the List operation of the Reflector component.
Obtain the ResourceVersion of the resource from the return structure of the List request, and specify the ResourceVersion through the Watch mechanism to monitor the data changes after the List in real time.
After receiving the event, it is added to the Delta FIFO queue and processed by the Informer component.
Informer forwards the events in the delta FIFO queue to the Indexer component, and the Indexer component stores the events persistently in the local cache.
Operator developers can register callback functions for Add, Update, and Delete events through the Informer component. After the Informer component receives the event, it will call back the business function. For example, in a typical controller usage scenario, each event is usually added to the WorkQueue. Each coordinated goroutine of the operator takes out the message from the queue, parses the key, and maintains the local from the Informer mechanism through the key. Read data from Cache.
For example, after receiving the event of creating a Codis CRD object, it is found that no Deployment/TAPP components related to this object are actually running. At this time, you can create a proxy service through the Deployment API, and a Redis service through the TAPP API.

Scheduling

After solving how to host your stateful services based on the built-in workload and its extension mechanism description of Kubernetes, the second problem you face is how to ensure that "equivalent" Pods in stateful services are deployed across fault domains to ensure statefulness High availability of services?

How to understand the "equivalent" Pod first? In the codis and TDSQL clusters, a group of Redis/MySQL active and standby instances are responsible for processing requests for the same data shard, and achieve high availability through active and standby. Since the active and standby instance Pods are responsible for the same data sharding, we call them equivalent Pods. The production environment expects that they should be deployed across fault domains.

Second, how to understand the fault domain? The fault domain represents the scope of potential faults, which can be divided into host-level, rack-level, switch-level, and availability zone-level. A set of Redis active/standby cases should at least achieve host-level high availability. If the node where the primary instance of any shard is located fails, the standby instance should be automatically promoted to primary, and all shards of the entire Redis cluster can still provide services. Similarly, in a TDSQL cluster, a group of MySQL instances should at least implement disaster recovery at switch and availability zone levels to ensure high availability of core storage services.

So how to achieve the above-mentioned equivalent Pod deployment across fault domains?

The answer is scheduling. The built-in scheduler of Kubernetes can automatically assign Pods to the best nodes according to the resources and scheduling strategies required by your Pod. At the same time, it also provides a powerful scheduling extension mechanism that allows you to easily implement custom scheduling strategies. In general, in a simple stateful service scenario, you can implement Pod cross-fault domain deployment based on the affinity and anti-affinity advanced scheduling strategies provided by Kubernetes.

Assuming that we want to deploy a three-node etcd cluster through containerization and high availability, the fault domain is the availability zone, and each etcd node is required to be distributed on different availability zone nodes. How do we based on the affinity and anti-affinity provided by Kubernetes? And (anti affinity) feature to achieve cross-availability zone deployment?

Affinity and Anti-Affinity

Very simple, we can add the following anti-affinity configuration to the workload of deploying etcd, declare the label of the target etcd cluster Pod, the topology domain is the node availability zone, and it is the hard affinity rule. If the Pod does not meet the rules, it will not be able to be scheduled. .

So how does the scheduler schedule when it encounters a Pod with anti-affinity configuration added?

affinity:
  PodAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: etcd_cluster
          operator: In
          values: ["etcd-test"]
      topologyKey: failure-domain.beta.Kubernetes.io/zone

First, after the scheduler monitors the Pod to be scheduled generated by etcd workload, it uses the tags in the anti-affinity configuration to query the node and available zone information of the scheduled Pod. Then, in the screening stage of the scheduler, if the available zone of the candidate node is compared with the available zone of the candidate node. If the available zones of the scheduling Pod are consistent, they will be eliminated. The nodes that finally enter the evaluation stage are all nodes that meet the constraints of Pod cross-availability zone deployment. According to the evaluation strategy configured by the scheduler, an optimal node is selected and the Pod is bound On this node, Pod cross-availability zone deployment and disaster recovery are finally realized.

However, in complex scenarios such as codis clusters, TDSQL distributed clusters, etc., Kubernetes' built-in scheduler may not be able to meet your needs, but it provides the following extension mechanisms to help you customize scheduling strategies to achieve various complex scenarios Scheduling appeals.

Custom scheduling strategy, extend scheduler, etc.

First, you can modify the scheduler's filtering/predicates and scoring/priorities strategies to configure scheduling strategies that meet your business needs. For example, if you want to reduce costs and use the smallest number of nodes to support all services in the cluster, then we need to allow Pod to prioritize scheduling on nodes that meet its resource requirements and have more allocated resources. In this scenario, you can modify the priority strategy and configure the MostRequestedPriority strategy to increase the weight.

Then you can implement the extend scheduler based on the Kubernetes scheduler. In the predicates and priorities stage of the scheduler, you can call back your extended scheduling service to meet your scheduling requirements. For example, if you want a group of MySQL or Redis active/standby instances responsible for the same data sharding to achieve cross-node disaster recovery, then you can implement your own predicates function to delete the nodes of the same group of scheduled Pods from candidate nodes, and ensure access The nodes of the priorities process are all to meet your business aspirations.

Then you can implement your own independent scheduler based on the Kubernetes scheduler. After deploying the independent scheduler to the cluster, you only need to declare the Pod's schedulerName as your independent scheduler.

scheduler framwork

Finally, Kubernetes introduced a new scheduler extension framework in version 1.15, which added hooks before and after the core process of the scheduler. Select the Pod to be scheduled, support custom queuing algorithms, the screening process provides PreFilter and Filter interfaces, the scoring process adds interfaces such as PreScore, Score, NormalizeScore, and the binding process provides three interfaces: PreBind, Bind, and PostBind. Based on the new scheduler extension framework, business can be more refined and low-cost control scheduling strategy, custom scheduling strategy is simpler and more efficient.

High availability

After solving the scheduling problem, our stateful service can be deployed with high availability. However, high-availability deployment does not mean that services can be highly available to provide external services. After containerization, we may encounter more stability challenges than traditional physical machine and virtual machine mode deployment. Stability challenges may come from business-written operators, Kubernetes components, runtime components such as docker/containerd, Linux kernel, etc. How to deal with the stability problems caused by the above various factors?

We should treat Pod exceptions as normalized cases in design. After any Pod is abnormal, we should have a self-healing mechanism in containerized scenarios. If it is a stateless service, we only need to add reasonable survival and readiness checks for the business Pod, and the Pod will be automatically rebuilt after an exception, and the node failure Pod will automatically drift to other nodes. However, in the stateful service scenario, even if you carry your stateful service workload and support the automatic Pod drift function after node failure, it may not meet the business requirements due to the long Pod self-healing time and data security. Why?

Assuming that in a codis cluster, a node where the Redis master node is located suddenly "disconnects", if you wait for 5 minutes before entering the self-healing process, it will cause 5 minutes of unavailability to the outside, which is obviously for important stateful service scenarios. unacceptable. Even if you reduce the self-healing time of the node's loss of connection, you cannot guarantee its data security. In case the cluster network has a split brain at this time and the lost node is also providing services to the outside world, there will be multiple master double writes, which may eventually Cause data loss.

So what is the high-availability solution for stateful service security? This depends on the high-availability implementation mechanism of the stateful service itself, and the Kubernetes container platform layer cannot provide a secure solution. Commonly used high-availability solutions for stateful services include consensus algorithms such as master-backup replication, decentralized replication, and raft/paxos. Below, I will briefly explain the differences, advantages and disadvantages of the three, and introduce the precautions in the containerization process.

Master-slave replication

Like the codis cluster case and the TDSQL cluster case we discussed above, both are based on the high availability of active-standby replication, and the implementation is simpler than decentralized replication and consensus algorithms. The master-backup replication can be divided into master-backup full synchronous replication, asynchronous replication, and semi-synchronous replication.

Full synchronous replication means that after the master receives a write request, it must wait for all slave nodes to confirm the return before returning to the client successfully. Therefore, if one slave node fails, the entire system will be unavailable. This solution is to ensure multiple copies The consistency of the set is sacrificed for availability, and it is generally not used much.

Asynchronous replication means that after the master receives a write request, it can return it to the client in time, and asynchronously forward the request to each copy. However, if it fails before the request is forwarded to the copy, it may cause data loss, but the availability is the highest of.

Semi-synchronous replication is between full synchronous replication and asynchronous replication. It means that after the master receives a write request, at least one replica can receive the data, and then it can return to the client successfully, achieving data consistency and availability. Balance and trade-offs.

Based on the stateful service implemented in the active-standby replication mode, the business needs to implement and deploy the active-standby switching HA service. HA services can be divided into active reporting type and distributed detection type according to the implementation architecture. Active reporting type takes MySQL master/backup switchover in TDSQL as an example. Agents are deployed on each MySQL node to report heartbeats to the metadata storage cluster (zookeeper/etcd). If the master heartbeat is lost, the HA service will quickly initiate a master/backup switchover. The distributed detection type takes Redis sentinel as an example. An odd number of sentinel nodes are deployed. Each sentinel node regularly detects the availability of Redis active and standby instances, and exchanges the detection results through the gossip protocol. If the failure of a main Redis node reaches the majority recognition, Then one of the sentries initiates the main-standby switching process.

Generally speaking, based on the stateful service of active and standby replication, in the traditional deployment mode, after a node fails, it depends on operation and maintenance and manual replacement of nodes. The stateful service after containerization can realize the automatic replacement of faulty nodes and rapid vertical expansion through the operator, which significantly reduces the complexity of operation and maintenance. However, the Pod may be rebuilt. The HA service responsible for the active/standby switch should be deployed. Switching of standby Pod to improve usability. If the business is very sensitive to data consistency, frequent switching may increase the probability of data loss. Instability factors can be reduced by using dedicated nodes, stable and newer runtimes and Kubernetes versions.

Decentralized replication

The opposite of master-slave replication is decentralized replication, which means that in a cluster of n replica nodes, any node can accept write requests, but a successful write requires w nodes to confirm, and reads must also query at least r Nodes. You can set appropriate w/r parameters according to the sensitivity of the actual business scenario to data consistency. For example, you want any client to read the new value after each write. If n is 3 copies, you can set w and r to 2, so that when you read two nodes, one node must contain The new value written recently, this kind of reading is called quorum read.

AWS's dynamo system is implemented based on a decentralized replication algorithm. Its advantages are that the roles of nodes are equal, the complexity of operation and maintenance is reduced, the availability is higher, the difficulty of containerization is lower, and there is no need to deploy HA components, etc., but it has disadvantages It is decentralized replication, which must cause various write conflicts, and business needs to pay attention to conflict handling, etc.

Consensus algorithm

Databases based on replication algorithms, in order to ensure service availability, most provide ultimate consistency. Whether it is master-slave replication or decentralized replication, there are certain shortcomings and cannot meet the requirements of strong data consistency and high availability.

How to solve the dilemma of the above replication algorithm?

The answer is the raft/paxos consensus algorithm, which was first proposed based on the background of the replication state machine. It consists of a consensus module, a log module, and a state machine, as shown in the figure below (quoted from Raft's paper). The consensus module is used to ensure the consistency of the logs of each node, and then each node executes the instructions in sequence based on the same log, and finally the results of each replicated state machine are consistent. Here I take the raft algorithm as an example. It consists of leader election, log replication, and security. After the leader node fails, the follower node can quickly initiate a new leader election and ensure data security. After the follower node fails, as long as most nodes survive , Does not affect the overall availability of the cluster.

Stateful services implemented based on consensus algorithms, typical cases are etcd/zookeeper/tikv, etc. In this architecture, the service itself integrates a leader election algorithm and a data synchronization mechanism, making operation and maintenance and containerization more complex than master-standby replication. Services should be significantly reduced, and containerization is more secure. Even if a bug in the containerization process causes the leader node to fail, thanks to the consensus algorithm, data security and service availability are almost unaffected. Therefore, it is first recommended to containerize stateful services that use the consensus algorithm.

high performance

After achieving the goal of more stable operation of stateful services in Kubernetes, the next goal is to pursue high performance and faster speed, and the high energy of stateful services relies on the underlying containerized network solution and disk IO solution. In the traditional deployment model of physical machines and virtual machines, stateful services have fixed IP, high-performance underlay networks, and high-performance local SSD disks. After containerization, how to achieve the performance of the traditional model? I will briefly explain the Kubernetes solution separately from the network and storage.

Scalable network solutions

The first is a scalable, plug-in network solution. Thanks to Google’s years of experience and lessons in Borg containerization operations, in the Kubernetes network model, each Pod has an independent IP, and each Pod can communicate across hosts without NAT. At the same time, Pods can also communicate with Node nodes. The excellent network model of Kubernetes is well compatible with traditional physical machine and virtual machine business network solutions, making Kubernetes simpler in traditional business. The most important thing is that Kubernetes provides an open CNI network plug-in standard, which describes how to assign IP to Pod and implement Pod cross-node container intercommunication. Each open source and cloud vendor can achieve high performance based on their own business scenarios and underlying networks. The low-latency CNI plug-in finally achieves cross-node container intercommunication.

In the various Kubernetes network solutions based on CNI implementation, the implementation of the data packet receiving and sending mode can be divided into two types: underlay and overlay. The former is directly based on the underlying network to achieve interconnection and has good performance. The latter is based on tunnel forwarding. It builds a virtual network on the basis of the underlying network and adds tunneling technology, so there is a certain performance loss.

Here I take the open source flannel and tke cluster network solutions as examples to illustrate their respective solutions, advantages and disadvantages.

In flannel, it is designed to support multiple forwarding modes such as udp, vxlan, and host-gw on the back end. The udp and vxlan forwarding modes are implemented based on the overlay tunnel forwarding mode. It supports encapsulating the original request in udp and vxlan packets, and then forwards them to the destination container based on the underlay network. udp is to perform data encapsulation and unpacking operations in user mode, with poor performance. It is generally used for debugging and low-level kernels that do not support the vxlan protocol. Vxlan completes the data encapsulation and unpacking operation in the kernel mode, with less performance loss. The host-gw mode is to directly distribute the IP routing information of each subnet to each node to achieve cross-host Pod network communication without the need for data packet encapsulation and unpacking operations. Compared with udp/vxlan, it has the best performance, but The Layer 2 network of each host node is required to be connected.

In the tke cluster network solution, we also support a variety of network communication solutions, and have experienced the evolution of three modes from global route, VPC-CNI to Pod independent network card. Global route refers to the global route. When each node joins the cluster, a unique Pod cidr will be assigned, and tke will issue a global route through the VPC interface to the parent machine of the user's VPC. When the ip accessed by the user's VPC container and node belongs to this Pod cir, it will match this global routing rule and forward it to the target node. In this solution, Pod CIDR is not a VPC resource, so it is not a first-class citizen of the VPC, and cannot use the features of VPC security groups, but it is simple and does not require any data unpacking operation at the user VPC layer, and its performance is no better. Big loss.

In order to solve the problem that a series of VPC features cannot be used because the container Pod IP is not a first-class citizen of the VPC, the tke cluster implements the VPC-CNI network mode. The Pod IP comes from the user's VPC subnet, cross-node container network communication, node-to-container communication Consistent with the communication principle of the CVM node in the VPC, the bottom layer is implemented based on the GRE tunnel routing and forwarding of the VPC, and the data packet is forwarded to the target container through policy routing in the node. Based on this solution, container Pod IP can enjoy the characteristics of VPC, realize a series of advanced features such as CLB direct connection to Pod and fixed IP.

Recently, in order to meet the more extreme requirements for container network performance in games, storage and other businesses, the TKE team has launched a next-generation network solution. Pod exclusively uses the VPC-CNI mode of flexible network cards. It no longer passes through the node's network protocol stack, which greatly shortens the container. Access link and delay, and enable PPS to reach the upper limit of the whole machine. Based on this solution, we have realized Pod binding EIP/NAT, no longer relying on the external network access capability of the node, support Pod binding security group, and realize Pod-level security isolation. For details, you can read the related article at the end of the article.

Based on the scalable network model of Kubernetes, the business can implement high-performance network plug-ins for specific scenarios. For example, Tenc's internal tenc platform implements the sriov-cni CNI plug-in based on SR-IOV technology, which can provide Kubernetes with a high-performance Layer 2 VLAN network solution. Especially for scenarios that require high network performance, such as distributed machine learning training, game back-end services, etc.

Scalable storage solutions

After introducing scalable network solutions, another core bottleneck of stateful services is the demand for high-performance storage IO. In the traditional deployment model, stateful services generally use local hard disks, and different types of disks such as HDD and SSD are selected according to the service type, specifications, and external SLA. So how to meet storage demands in different scenarios in Kubernetes?

In the Kubernetes storage system, this problem is divided into several sub-problems to solve elegantly, and has good scalability and maintainability. Whether it is a local disk, cloud disk, NFS and other file systems, it can be implemented based on its extension. Corresponding plug-ins, and realize the separation of development, operation and maintenance responsibilities.

So how is the storage system of Kubernetes built?

I use how to mount a data storage disk to your stateful Pod application as an example to introduce the scalable storage system of Kubernetes. It can be divided into the following steps:

How to apply for a storage disk? (consumer)
How does the Kubernetes storage system describe a storage disk? Is it manual creation of storage disks or automated on-demand creation of storage disks? (Producer)
How to match the storage resource pool disks with the requirements of the storage disk applicant? (Controller)
How to describe the type of storage disk, the data deletion strategy, and the service provider information of this type of disk? (storageClass)
How to implement the corresponding storage data volume plug-in? (FlexVolume, CSI)

First of all, a resource named PVC is provided in Kubernetes, which describes the type, access mode, and capacity specifications of the storage disk that the application applies for. For example, if you want to apply for a cloud disk with a storage class of cbs and a size of 100G for the etcd service, you can create one The following PVC.

apiVersion: v1
kind: PersistentVolumeClaim
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: cbs

Secondly, Kubernetes provides a resource called PV, which describes the type, access mode, and capacity specifications of the storage disk. It corresponds to a real disk and supports two modes of manual and automatic creation. The figure below depicts a 100G cbs hard drive.

apiVersion: v1
kind: PersistentVolume
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  persistentVolumeReclaimPolicy: Delete
  qcloudCbs:
    cbsDiskId: disk-r1z73a3x
  storageClassName: cbs
  volumeMode: Filesystem

Then, when the application creates a PVC resource, the Kubernetes controller will try to match it with the PV, whether the type of storage disk is the same, whether the capacity of the PV meets the requirements of the PVC, if the matching is successful, the status of the PV will be changed Become a binding, the controller will further attach the storage resource corresponding to this PV to the node where the application Pod is located. After the attach is successful, the kubelet component on the node will mount the corresponding data directory to the storage disk to realize reading write.

The above is the process of applying for a disk, so how to support multiple types of block storage and network file systems through the PV/PVC system in the container? For example, block storage services support ordinary HHD cloud disks, SSD high-performance cloud disks, SSD cloud disks, local disks, etc., and remote network file systems support NFS. Secondly, how does the Kubernetes controller dynamically create PV on demand?

In order to support multiple types of storage requirements, Kubernetes provides a StorageClass resource to describe a storage class. It describes the types of storage disks, binding and deletion strategies, and what service components provide resource creation. For example, the high-performance version and the basic version of the MySQL service rely on different types of storage disks. You only need to fill in the corresponding storageclass name when creating the PVC.

Finally, in order to support the storage data volumes of the open source community and cloud vendors, Kubernetes provides a storage data volume expansion mechanism, from the early in-tree built-in data volume, to the FlexVolume plug-in, and now to GA's containerized storage CSI With the plug-in mechanism, storage service providers can integrate any storage system into the Kubernetes storage system. For example, the provisioner of storage cbs is the TKE team of Tencent Cloud. Based on the flexvolume/CSI extension mechanism of Kubernetes, we will create and delete cbs hard disks through Tencent Cloud CBS API.

apiVersion: storage.Kubernetes.io/v1
kind: StorageClass
parameters:
  type: cbs
provisioner: cloud.tencent.com/qcloud-cbs
reclaimPolicy: Delete
volumeBindingMode: Immediate

In order to meet the extreme pursuit of disk IO performance for stateful services, Kubernetes implements the local pv mechanism based on the PV/PVC storage system introduced above, which can avoid network IO overhead and allow your services to have higher IO read and write performance . The core of local pv is to abstract local disks and lvm partitions into PVs, use local pv Pods, and rely on delayed binding characteristics to achieve accurate scheduling to target nodes.

The key core technical points of local pv are capacity isolation (lvm, xfs quota), IO isolation (cgroup v1 generally requires a customized kernel, cgroup v2 supports buffer io, etc.), dynamic provision and other issues. In order to solve the above or some of the pain points, the community was also born A series of open source projects, such as TopoLVM (support dynamic provision, lvm), sig-storage-local-static-provisioner and other projects, various cloud vendors such as Tencent also have corresponding local pv solutions. In general, local pv is suitable for storage services such as disk io-sensitive etcd, MySQL, tidb, etc. For example, pingcap's tidb project recommends using local pv in a production environment.

The disadvantage of local pv is that after the node fails, the data cannot be accessed, may be lost, and cannot be expanded vertically (limited by the node's disk capacity, etc.). Therefore, this places higher requirements on the stateful service itself and its operator. The service itself needs to ensure data security through the master-backup replication protocol and consensus algorithm. After any node fails, the operator can expand the new node in time, and restore data from cold standby and leader snapshots. For example, tidb's tikv service, when an instance is detected to be abnormal, it will automatically expand the new instance, and complete data replication through the raft protocol.

Chaos Engineering

Through the above technical solutions, after solving a series of pain points such as load type selection, custom resource expansion, scheduling, high availability, high-performance network, high-performance storage, stability testing, etc., we can build stable, high-availability, and high-availability based on Kubernetes. Stateful service with elastic scaling.

So how to verify the stability of the stateful service after containerization?

The community provides a number of chaos engineering open source projects based on Kubernetes, such as pingcap's chaos-mesh, and provides multiple fault injections such as Pod chaos/Network chaos/IO chaos. Based on chaos mesh, you can quickly inject Pod failures, disk IO, network IO and other abnormalities into any Pod in the cluster, helping you quickly discover the stateful service itself and operator bugs, and verify the stability of the cluster. In the TKE team, we investigate and reproduce etcd bugs based on chaos mesh, and pressure test the stability of etcd clusters, which greatly reduces the difficulty of reproducing complex bugs and helps us improve the stability of etcd kernel.

to sum up

This article introduces how to use the description of Kubernetes to deploy your stateful services through the selection of the workload and the selection of the expansion mechanism from each component in a stateful cluster. Due to its particularity, stateful services have data security, high availability, and high performance as their core goals. In order to ensure the high availability of services, they can be implemented through scheduling and HA services. Through Kubernetes' multiple scheduler extension mechanisms, you can deploy the equivalent Pod of your stateful service across fault domains. Through the active/standby switching service and consensus algorithm, you can complete that after the primary node fails, the standby node will automatically be promoted as the primary node to ensure high service availability. High performance mainly depends on network and storage performance. Kubernetes provides CNI network model, PV/PVC storage system, and CSI expansion mechanism to meet the customization requirements in various business scenarios. Finally, the application of chaos engineering in stateful services is introduced. Through chaos engineering, you can simulate the fault tolerance of your stateful services under various abnormal scenarios and help you verify and improve the stability of the system.

Reference

Container Service (Tencent Kubernetes Engine, TKE) is a one-stop cloud native PaaS service platform based on Kubernetes provided by Tencent Cloud. Provide users with enterprise-level services that integrate container cluster scheduling, Helm application orchestration, Docker image management, Istio service management, automated DevOps, and a full set of monitoring operation and maintenance systems.
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !