author
Wang Cheng, Tencent Cloud R&D engineer, Kubernetes contributor, is engaged in the containerization of database products, resource management and control, etc., focusing on the fields of Kubernetes, Go, and cloud native.
Overview
Entering the world of K8s, you will find that there are many interfaces that are convenient to expand, including CSI, CNI, CRI, etc. These interfaces are abstracted out to better provide openness, expansion, specification and other capabilities.
K8s persistent storage has undergone a migration from in-tree Volume to CSI Plugin (out-of-tree). On the one hand, it is to decouple the K8s core backbone code from Volume-related code for better maintenance; on the other hand, It is for the convenience of major cloud vendors to implement a unified interface and provide personalized cloud storage capabilities, in order to achieve an open and win-win cloud storage ecosystem.
This article analyzes the CSI implementation mechanism from the core life cycles of creating (Create), attach (Attach), detach (Detach), mount (Mount), unmount (Unmount), and delete (Delete) the persistent volume PV.
Related terms
Term | Definition |
---|---|
CSI | Container Storage Interface. |
CNI | Container Network Interface. |
CRI | Container Runtime Interface. |
PV | Persistent Volume. |
PVC | Persistent Volume Claim. |
StorageClass | Defined by provisioner(i.e. Storage Provider), to assemble Volume parameters as a resource object. |
Volume | A unit of storage that will be made available inside of a CO-managed container, via the CSI. |
Block Volume | A volume that will appear as a block device inside the container. |
Mounted Volume | A volume that will be mounted using the specified file system and appear as a directory inside the container. |
CO | Container Orchestration system, communicates with Plugins using CSI service RPCs. |
SP | Storage Provider, the vendor of a CSI plugin implementation. |
RPC | Remote Procedure Call. |
Node | A host where the user workload will be running, uniquely identifiable from the perspective of a Plugin by a node ID. |
Plugin | Aka “plugin implementation”, a gRPC endpoint that implements the CSI Services. |
Plugin Supervisor | Process that governs the lifecycle of a Plugin, MAY be the CO. |
Workload | The atomic unit of "work" scheduled by a CO. This MAY be a container or a collection of containers. |
This article and subsequent related articles are based on K8s v1.22
Process overview
PV creation core process:
apiserver
creates a Pod, and creates a VolumePodSpec.Volumes
PVController
monitors PV informer, adds related Annotation (such as pv.kubernetes.io/provisioned-by), and tunes to realize PVC/PV binding (Bound);- Judge
StorageClass.volumeBindingMode
:WaitForFirstConsumer
waits for Pod to be successfully scheduled to Node before proceeding to PV creation,Immediate
calls PV creation logic immediately without waiting for Pod scheduling; external-provisioner
monitors PV informer and calls RPC-CreateVolume to create a Volume;AttachDetachController
will be successfully bound (Bound) PVC/PV, through the InTreeToCSITranslator converter, the internal logic ofVolumeAttachment
the creation of the 0617a518a41fcb resource type;external-attacher
monitors VolumeAttachment informer and calls RPC-ControllerPublishVolume to implement AttachVolume;kubelet
reconcile continuous tuning:controllerAttachDetachEnabled || PluginIsAttachable
and the current Volume status, and finally the Volume is mounted to the specified directory of the Pod for use by the Container;
Speaking from CSI
CSI (Container Storage Interface) is an industry standard interface specification ( https://github.com/container-storage-interface/spec ) jointly formulated by community members from Kubernetes, Mesos, Docker, etc., which aims to store any The system is exposed to containerized applications.
The CSI specification defines the minimum operation set and deployment recommendations for storage providers to implement a CSI-compatible Volume Plugin. The main focus of the CSI specification is to declare the interfaces that Volume Plugin must implement.
First look at the life cycle of Volume:
CreateVolume +------------+ DeleteVolume
+------------->| CREATED +--------------+
| +---+----^---+ |
| Controller | | Controller v
+++ Publish | | Unpublish +++
|X| Volume | | Volume | |
+-+ +---v----+---+ +-+
| NODE_READY |
+---+----^---+
Node | | Node
Stage | | Unstage
Volume | | Volume
+---v----+---+
| VOL_READY |
+---+----^---+
Node | | Node
Publish | | Unpublish
Volume | | Volume
+---v----+---+
| PUBLISHED |
+------------+
The lifecycle of a dynamically provisioned volume, from
creation to destruction, when the Node Plugin advertises the
STAGE_UNSTAGE_VOLUME capability.
As can be seen from the volume life cycle, a persistent volume needs to go through the following stages to reach the Pod usable state:
CreateVolume -> ControllerPublishVolume -> NodeStageVolume -> NodePublishVolume
When deleting a Volume, it will go through the following reverse phases:
NodeUnpublishVolume -> NodeUnstageVolume -> ControllerUnpublishVolume -> DeleteVolume
Each step of the above process actually corresponds to the standard interface provided by CSI. Cloud storage vendors only need to implement their own cloud storage plug-in according to the standard interface, and they can seamlessly connect with the K8s underlying orchestration system to provide diversified cloud storage. , Backup, snapshot (snapshot) and other capabilities.
Multi-component collaboration
In order to achieve high-scalability, out-of-tree persistent volume management capabilities, in the implementation of K8s CSI, the relevant coordinated components are:
Component introduction
- kube-controller-manager: K8s resource controller, mainly through PVController, AttachDetach to achieve persistent volume binding (Bound) / unbound (Unbound), attach (Attach) / detach (Detach);
- CSI-plugin: K8s is separated independently to realize the logic control and call of the CSI standard specification interface, which is the core hub of the entire CSI control logic;
- node-driver-registrar: a made official K8s sig team maintains an auxiliary vessel (Sidecar) , which uses kubelet plug-in registration mechanism to kubelet registered plug-in, plug-ins need to request CSI Identity service to get plug-in information;
- external-provisioner: is a official K8s sig team maintains an auxiliary vessel (Sidecar) , the main function is to achieve creation (Create) lasting volume, delete (Delete);
- external-attacher: a is a official K8s sig maintenance team secondary container (Sidecar) , the main function is to achieve long-lasting volume attachment (Attach), separation (the Detach);
- external-snapshotter: is a official K8s sig team maintains an auxiliary vessel (Sidecar) , the main function is to achieve a snapshot (VolumeSnapshot) lasting volume, backup and recovery and other capabilities;
- external-resizer: a is a official K8s sig maintenance team secondary container (Sidecar) , the main function is to achieve a lasting elastic volume expansion volume reduction, cloud vendors need to provide the ability to plug-in;
- kubelet: The control hub running on each Node in K8s. The main function is to tune the attachment, mounting, monitoring, detection, and reporting of Pod and Volume on the node;
- cloud-storage-provider: plug-ins implemented by major cloud storage vendors based on CSI standard interfaces, including Identity service, Controller service, and Node service;
Component communication
Since the code of CSI plugin is considered untrustworthy in K8s, CSI Controller Server and External CSI SideCar, CSI Node Server and Kubelet communicate through Unix Socket, and use gRPC (HTTP/2) with the Storage Service provided by cloud storage vendors. Communication:
RPC call
It can be seen from the CSI standard specifications that cloud storage vendors want to seamlessly access the K8s container orchestration system and need to implement related interfaces according to the specifications. The related interfaces are mainly:
- Identity service: Both Node Plugin and Controller Plugin must implement these RPC sets, coordinate the version information of K8s and CSI, and be responsible for exposing the information of this plug-in.
- Controller Controller Service: Controller Plugin must implement these RPC sets, create and manage Volume, corresponding to the attach/detach volume operation in K8s.
- Node node service: Node Plugin must implement these RPC sets, mount the Volume storage volume to the specified directory, corresponding to the mount/unmount volume operation in K8s.
Related RPC interface functions are as follows:
Create/Delete PV
The creation and deletion of persistent volume PV in K8s are implemented by the external-provisioner component. The relevant engineering code is at: [ https://github.com/kubernetes-csi/external-provisioner]
First, get the command line parameters through the standard cmd method and execute the newController -> Run() logic. The relevant code is as follows:
// external-provisioner/cmd/csi-provisioner/csi-provisioner.go
main() {
...
// 初始化控制器,实现 Volume 创建/删除接口
csiProvisioner := ctrl.NewCSIProvisioner(
clientset,
*operationTimeout,
identity,
*volumeNamePrefix,
*volumeNameUUIDLength,
grpcClient,
snapClient,
provisionerName,
pluginCapabilities,
controllerCapabilities,
...
)
...
// 真正的 ProvisionController,包装了上面的 CSIProvisioner
provisionController = controller.NewProvisionController(
clientset,
provisionerName,
csiProvisioner,
provisionerOptions...,
)
...
run := func(ctx context.Context) {
...
// Run 运行起来
provisionController.Run(ctx)
}
}
Next, call the PV creation/delete process:
PV creation: runClaimWorker -> syncClaimHandler -> syncClaim -> provisionClaimOperation -> Provision -> CreateVolume
PV deletion: runVolumeWorker -> syncVolumeHandler -> syncVolume -> deleteVolumeOperation -> Delete -> DeleteVolume
Related interfaces are abstracted by sigs.k8s.io/sig-storage-lib-external-provisioner:
// 通过 vendor 方式引入 sigs.k8s.io/sig-storage-lib-external-provisioner
// external-provisioner/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v7/controller/volume.go
type Provisioner interface {
// 调用 PRC CreateVolume 接口实现 PV 创建
Provision(context.Context, ProvisionOptions) (*v1.PersistentVolume, ProvisioningState, error)
// 调用 PRC DeleteVolume 接口实现 PV 删除
Delete(context.Context, *v1.PersistentVolume) error
}
Controller tuning
The controllers related to PV in K8s include PVController and AttachDetachController.
PVController
PVController adds related annotations (such as pv.kubernetes.io/provisioned-by) to the PVC, and the external-provisioner component is responsible for completing the creation/deletion of the corresponding PV, and then the PVController monitors the status of the successful PV creation and completes the binding with the PVC (Bound), the reconcile task is completed. Then hand it over to the AttachDetachController controller for the next logical processing.
It is worth mentioning that PVController uses the local cache to efficiently realize the status update and binding event processing of PVC and PV, which is equivalent to maintaining a local store for Add/Update/Delete in addition to the K8s informer mechanism. Event handling.
First, through the standard newController -> Run() logic:
// kubernetes/pkg/controller/volume/persistentvolume/pv_controller_base.go
func NewController(p ControllerParameters) (*PersistentVolumeController, error) {
...
// 初始化 PVController
controller := &PersistentVolumeController{
volumes: newPersistentVolumeOrderedIndex(),
claims: cache.NewStore(cache.DeletionHandlingMetaNamespaceKeyFunc),
kubeClient: p.KubeClient,
eventRecorder: eventRecorder,
runningOperations: goroutinemap.NewGoRoutineMap(true /* exponentialBackOffOnError */),
cloud: p.Cloud,
enableDynamicProvisioning: p.EnableDynamicProvisioning,
clusterName: p.ClusterName,
createProvisionedPVRetryCount: createProvisionedPVRetryCount,
createProvisionedPVInterval: createProvisionedPVInterval,
claimQueue: workqueue.NewNamed("claims"),
volumeQueue: workqueue.NewNamed("volumes"),
resyncPeriod: p.SyncPeriod,
operationTimestamps: metrics.NewOperationStartTimeCache(),
}
...
// PV 增删改事件监听
p.VolumeInformer.Informer().AddEventHandler(
cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { controller.enqueueWork(controller.volumeQueue, obj) },
UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueWork(controller.volumeQueue, newObj) },
DeleteFunc: func(obj interface{}) { controller.enqueueWork(controller.volumeQueue, obj) },
},
)
...
// PVC 增删改事件监听
p.ClaimInformer.Informer().AddEventHandler(
cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { controller.enqueueWork(controller.claimQueue, obj) },
UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueWork(controller.claimQueue, newObj) },
DeleteFunc: func(obj interface{}) { controller.enqueueWork(controller.claimQueue, obj) },
},
)
...
return controller, nil
}
Then, call the PVC/PV binding/unbinding logic:
PVC/PV binding: claimWorker -> updateClaim -> syncClaim -> syncBoundClaim -> bind
Unbind PVC/PV: volumeWorker -> updateVolume -> syncVolume -> unbindVolume
AttachDetachController
AttachDetachController converts the successfully bound PVC/PV through the InTreeToCSITranslator converter to realize the conversion from Volume managed by in-tree mode to CSI plug-in mode managed by out-of-tree mode.
Then, the internal logic of VolumeAttachment
implements the creation/deletion of the resource type 0617a518a42a31, and the reconcile task is completed. Then hand it over to the external-attacher component for the next logical processing.
The relevant core code is implemented in reconciler.Run() as follows:
// kubernetes/pkg/controller/volume/attachdetach/reconciler/reconciler.go
func (rc *reconciler) reconcile() {
// 先进行 DetachVolume,确保因 Pod 重新调度到其他节点的 Volume 提前分离(Detach)
for _, attachedVolume := range rc.actualStateOfWorld.GetAttachedVolumes() {
// 如果不在期望状态的 Volume,则调用 DetachVolume 删除 VolumeAttachment 资源对象
if !rc.desiredStateOfWorld.VolumeExists(
attachedVolume.VolumeName, attachedVolume.NodeName) {
...
err = rc.attacherDetacher.DetachVolume(attachedVolume.AttachedVolume, verifySafeToDetach, rc.actualStateOfWorld)
...
}
}
// 调用 AttachVolume 创建 VolumeAttachment 资源对象
rc.attachDesiredVolumes()
...
}
Attach/detach Volume
Attach and detach (Detach) of the persistent volume PV in K8s are implemented by the external-attacher component. The relevant engineering code is at: [ https://github.com/kubernetes-csi/external-attacher]
The external-attacher component observes the VolumeAttachment object created by the AttachDetachController in the previous step. If the Driver name in its .spec.Attacher specifies the CSI Plugin in the same Pod, it calls the ControllerPublish interface of the CSI Plugin to perform Volume Attach.
First, get the command line parameters through the standard cmd method, and execute the newController -> Run() logic. The relevant code is as follows:
// external-attacher/cmd/csi-attacher/main.go
func main() {
...
ctrl := controller.NewCSIAttachController(
clientset,
csiAttacher,
handler,
factory.Storage().V1().VolumeAttachments(),
factory.Core().V1().PersistentVolumes(),
workqueue.NewItemExponentialFailureRateLimiter(*retryIntervalStart, *retryIntervalMax),
workqueue.NewItemExponentialFailureRateLimiter(*retryIntervalStart, *retryIntervalMax),
supportsListVolumesPublishedNodes,
*reconcileSync,
)
run := func(ctx context.Context) {
stopCh := ctx.Done()
factory.Start(stopCh)
ctrl.Run(int(*workerThreads), stopCh)
}
...
}
Next, call Volume attachment/detachment logic:
Volume Attach: syncVA -> SyncNewOrUpdatedVolumeAttachment -> syncAttach -> csiAttach -> Attach -> ControllerPublishVolume
Volume separation (Detach): syncVA -> SyncNewOrUpdatedVolumeAttachment -> syncDetach -> csiDetach -> Detach -> ControllerUnpublishVolume
kubelet mount/unmount Volume
Mounting and unmounting of persistent volume PV in K8s are implemented by the kubelet component.
The kubelet starts the reconcile loop through VolumeManager. When it observes that a new Pod of PV with PersistentVolumeSource as CSI is scheduled to this node, it calls the reconcile function to perform Attach/Detach/Mount/Unmount related logic processing.
// kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go
func (rc *reconciler) reconcile() {
// 先进行 UnmountVolume,确保因 Pod 删除被重新 Attach 到其他 Pod 的 Volume 提前卸载(Unmount)
rc.unmountVolumes()
// 接着通过判断 controllerAttachDetachEnabled || PluginIsAttachable 及当前 Volume 状态
// 进行 AttachVolume / MountVolume / ExpandInUseVolume
rc.mountAttachVolumes()
// 卸载(Unmount) 或分离(Detach) 不再需要(Pod 删除)的 Volume
rc.unmountDetachDevices()
}
The related call logic is as follows:
Volume mount (Mount): reconcile -> mountAttachVolumes -> MountVolume -> SetUp -> SetUpAt -> NodePublishVolume
Volume Unmount (Unmount): reconcile -> unmountVolumes -> UnmountVolume -> TearDown -> TearDownAt -> NodeUnpublishVolume
summary
This article analyzes the core life cycle processes of the creation (Create), attach (Attach), detach (Detach), mount (Mount), unmount (Unmount), delete (Delete) of the persistent volume PV in K8s, and carry out the CSI implementation mechanism. In order to better understand the K8s CSI operation process, the relevant process logic is explained through source code and graphics.
It can be seen that K8s uses the CSI Plugin (out-of-tree) plug-in method to open storage capabilities. On the one hand, it is to decouple the K8s core backbone code from the Volume-related code for better maintenance; on the other hand, it is in compliance with the CSI specification. Under the interface, it is convenient for major cloud vendors to implement related interfaces according to business requirements and provide personalized cloud storage capabilities, in order to achieve an open and win-win cloud storage ecosystem.
PS: For more content, please pay attention to k8s-club
Relevant information
- CSI specification
- Kubernetes source code
- kubernetes-csi source code
- kubernetes-sig-storage source code
- K8s CSI concept
- K8s CSI Introduction
about us
For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~
Welfare:
①公众号后台回复【手册】,可获得《腾讯云原生路线图手册》&《腾讯云原生最佳实践》~
②公众号后台回复【系列】,可获得《15个系列100+篇超实用云原生原创干货合集》,包含Kubernetes 降本增效、K8s 性能优化实践、最佳实践等系列。
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。