1

1. Background

At present, containers have become the mainstream choice for enterprises to go to the cloud. After in-depth research and development and promotion in the second half of 2019, OPPO will basically realize the large-scale use of kubernetes-based containers and the full-service cloudification in 2020. The advantages of containers are agility and high performance. However, due to the need to share the host's kernel and incomplete isolation, when users need to modify many customized kernel parameters or run high-version Linux containers on low-version Linux hosts, or just When higher isolation is required, it is difficult to achieve on the container. Due to historical reasons, there are still some businesses within the company that require the use of strongly isolated virtual machines, so it is imperative to provide virtual machine services.

After investigation, we found that for companies that have built container platforms, most of the virtual machine management solutions are to maintain a set of OpenStack or similar systems. However, OpenStack is large and heavy, with high maintenance costs, and the underlying resources cannot be managed uniformly, which will bring a lot of inconvenience to hybrid scheduling. Therefore, we will use unified control plane management to realize unified scheduling and management of containers and virtual machines as the main direction of model selection.

2. Scheme selection Kubevirt or Virtlet

Virtual machines and containers are managed through the k8s platform. The best projects in the industry include kubevirt and virtlet.
Kubevirt is an open source Redhat project that runs virtual machines in containers. It uses k8s add-on to increase the resource type VirtualMachineInstance (VMI) using k8s CRD, and uses the container's image registry to create virtual machines and provide VM lifecycle management.

Virtlet is an implementation of Kubernetes (Container Runtime Interface), which can run Pods based on virtual machines on Kubernetes. (CRI enables Kubernetes to run non-docker containers, such as Rkt).
The picture below is a part of the comparison chart between Kubevirt and Virtlet that we made when we made the selection in early 2020. It can be seen that Virtlet uses the same resource type Pod to describe the container and the virtual machine. Therefore, if the native method is used, the virtual machine can only have two states of Running and Delete. It cannot support pause/unpause, start/stop and other virtual machines exclusive. Status, obviously this cannot meet user needs. If you want to support these states, you have to customize kubelet deeply, which will cause too much coupling between virtual machine management and container management; in addition, considering that the virtlet community was not as active as the kubevirt community at that time, the solution we finally chose was Kubevirt.

3. Kubevirt introduction

3.1 Correspondence between VmiCRD/Pod/Domain

3.2 Introduction to components

The various component services of kubevirt are deployed on k8s, among which virt-api and virt-controller are deployments, which can be deployed with multiple copies and high availability, virt-api is stateless and can be extended arbitrarily; virt-controller is through elections A host is selected to provide services; virt-handler is deployed as a daemonset, and each virtual machine node runs a virt-handler; and a virt-launcher service corresponds to a virtual machine. Whenever a virtual machine is created, it will Create a corresponding virt-launcher pod.

virt-api:
1) kubevirt API service, kubevirt works in CRD mode, virt-api provides custom api request processing, you can execute the synchronization command virtctl vnc/pause/unpause/stop/start vm through the virtctl command.

virt-controller:
1) Communicate with k8s api-server to monitor VMI resource creation and deletion events, and trigger corresponding operations
2) Create a virt-launcher pod according to the VMI definition, and the virtual machine will run in the pod
3) Monitor the pod status and update the VMI status accordingly

virt-handler:
1) Run on the node of kubelet, update heartbeat regularly, and mark "kubevirt.io/schedulable"
2) Monitor the k8s apiserver when it finds that the marked nodeName of the VMI matches its own node, it is responsible for the life cycle management of the virtual machine

virt-launcher:
1) Run as a pod
2) Generate virtual machine template according to VMI definition, and create virtual machine through libvirt API
3) Each virtual machine corresponds to an independent libvirtd
4) Communication with libvirt to provide virtual machine life cycle management

4. Kubevirt architecture transformation

4.1 Native architecture


The management plane is coupled with the data plane in the native architecture. Running a virtual machine in a virt-launcher pod, when due to uncertain reasons (for example, docker or physical machine reasons, or virt-launcher itself hangs and upgrades, etc.), the virtual machine will be caused after the virt-launcher container exits Also exit, which will affect the user's use and increase the stability risk of the virtual machine. Therefore, we have made a transformation on the basis of the original structure.

Modification point:
1) Move the kvm and libvirtd processes on the data plane out of the virt-laucher container on the management plane, and the libvirtd process on the physical machine manages all virtual machines on this physical machine.
2) The new virt-start-hook component is used to connect network components, storage components, and xml path changes, etc.
3) Reconstruct the virtual machine image production and distribution method, and realize the rapid distribution of images with the help of the object storage management of OCS.

In addition to the separation of the management plane and the data plane, we have also done a lot of work in terms of stability enhancement. For example, it is realized that every component of kubevirt fails, fails, or becomes abnormal under any circumstances at any time, and it will not affect the operation of the normal virtual machine, and the test is required to cover the test under abnormal conditions of these components; after the physical machine is restarted The virtual machine can normally resume production-level requirements such as life cycle management, which further guarantees the stability of the entire virtual machine management system.

4.2 Architecture after transformation

4.3 Process of creating a virtual machine after architecture transformation

1) The user creates vmi crd, kubectl create -f vmi.yaml.
2) The virt-controller watches to the new vmi object and creates the corresponding virt-launcher pod for vmi.
3) After the virt-launcher pod is created, the scheduler kube-scheduler of k8s will schedule it to the eligible kubevirt node.
4) Then virt-controller will update the nodeName of the virt-launcher pod to the vmi object.
5) After the kubevirt node node watch to vmi is scheduled to this node, it will mount the basic image of the virtual machine to the specified location, and then call the syncVMI interface of virt-launcher to create the domain.
6) After virt-launcher receives the creation request, it converts the vmi object into a domain object, then calls virt-start-hook, creates an incremental mirror disk of the qcow2 virtual machine based on the backingFile, and converts the relevant path in the domain xml into a path on the physical machine , Request the network, configure the xml, and then return the final configured xml to virt-launcher.
7) After virt-launcher receives the return of virt-start-hook, it calls libvirtd on the physical machine to define domain xml and create domain.

4.4 Process of deleting virtual machines after architecture transformation

1) The user executes the delete vmi command, kubectl delete -f vmi.yaml.
2) virt-handler watches the update event of vmi, and the deletionTimeStamp of vmi is not empty, call virt-launcher shutdownDomain, virt-launcher calls virt-start-hook to release the network and then calls libvirtd to shut down.
3) The domain shutdown message is watched by virt-launcher and sent to virt-handler, virt-handler calls virt-launcher deleteDomain according to the state of vmi and domain shutdown, virt-launcher calls virt-start-hook to delete the network and then calls libvirtd undefineDomain.
4) The domain undefine message is watched by virt-launcher and sent to virt-handler, virt-handler updates vmi according to the status of vmi and domain deleted, adds the condition of domain deleted, and then cleans up the junk files and paths of the domain.
5) The virt-controller watch to the vmi state deleteTimeStamp is not empty, and the condition DomainDeleted of vmi is True, then the virt-launcher pod is deleted, and then after the pod is deleted, the vmi finalizer is cleaned up so that vmi is automatically deleted.

5. Storage solution

5.1 Native image storage solution

The original image file of the virtual machine in kubevirt will be ADD to the /disk path of the docker base image and pushed to the image center for use when creating a virtual machine.

When creating a virtual machine, a vmi crd will be created, and the name of the virtual machine image that needs to be used will be recorded in vmi. After vmi is created, virt-controller will create a corresponding virt-launcher pod for vmi. There are two containers in the virt-launcher pod One is the container compute running the virt-launcher process, and the other is the container-disk responsible for storing the virtual machine image. The imageName of the container-disk container is the virtual machine image name recorded in vmi. After the virt-launcher pod is created, kubelet will download the container-disk image and then start the container-disk container. After the container-disk is started, it will always monitor the disk_0.sock file under -copy-path, and the sock file will be mapped to the path /var/run/kubevirt/container-disk/vmiUUID/ on the physical machine through hostPath.

The virt-handler pod will use HostPid so that the pid and mount information of the physical machine can be seen in the virt-handler container. When creating a virtual machine, virt-handler will find the pid of the container-disk process according to the disk_0.sock file of vmi, mark it as Cpid, and then find the disk number of the container-disk container root disk according to /proc/Cpid/mountInfo, and then according to The disk number of the container-disk root disk and the mount information of the physical machine (/proc/1/mountInfo) Find the location of the container-disk root disk on the physical machine, and then assemble the path of the virtual machine image file /disk/xxx.qcow2 , Get the actual storage location sourceFile of the original image of the virtual machine on the physical machine, and then mount the sourceFile to the targetFile for use as a backingFile when creating a virtual machine later.

5.2 Local Disk Storage

In the native kubevirt, the incremental image file xxx.qcow2 created based on the basic image backingFile can only be placed in emptydir, and the data disk of our container generally uses the lvm method. If you save the two methods of use, in the virtual machine container In the mixed deployment scenario, it is not conducive to the unified planning and scheduling of physical machine disks. Therefore, we also support the storage of virtual machine incremental image files on the lvm disk requested by the virt-launcher container on a native basis, thereby maintaining the virtual The consistency of the use of disks between the machine and the container. In addition, we also support the creation of a separate qcow2 empty disk for the virtual machine and mount it as a data disk, which is also stored in another lvm disk requested by the virt-launcher container.

5.3 Cloud Disk Storage

We have docked cloud storage for the system disk and data disk of the virtual machine, which is convenient for users to use in migration or some other scenarios.

5.3.1 System disk access to cloud disk

To connect the system disk to cloud storage, first upload the basic image of the virtual machine to the pvc under basic ns, and then create a volume snapshot based on this pvc. When creating a virtual machine under a certain namespace, you need to copy the volume snapshot of the basic image from the basic ns to your own namespace, and then create a new pvc based on the copied volume snapshot for the virtual machine to use. Among them, upload the basic image of the virtual machine to the pvc under the basic namespace and make a snapshot. We have made a tool to upload the image for unified management; and the system disk pvc needed to create the virtual machine and the pvc mounted to the vmi are one of For the series of operations, we realize unified automated management through a newly defined crd and a new crd controller.

5.3.2 Data Disk Access Cloud Disk

To connect the data disk to cloud storage, first create pvc under the namespace where the virtual machine is located, and then configure the pvc to the yaml of vmi. When virt-controller creates the virt-launcher pod corresponding to vmi, it will follow the configuration of pvc in vmi, Configure the pvc volume to the virt-launcher pod, and then the storage component will mount a directory with pvc information to the pod, and then virt-start-hook will configure the cloud disk to the pod according to the information in the pvc directory in the virt-launcher pod The xml of the domain is used by the virtual machine.

6. Extended functions

6.1 Support virtual machine shutdown/start/restart

Native kubevirt provides some synchronization interfaces, such as pause and unpause, whose functions are respectively to suspend and wake up the virtual machine. The native stop and start need to operate vm crd, which will cause the virtual machine to be destroyed and rebuilt, which cannot meet our needs. In addition, because the original architecture does not support the shutdown and start of virtual machines, it does not provide direct stop, start, and reboot interfaces for the virtual machine (stop corresponds to shutdown). Our users have this requirement. Because the kubevirt after the architecture transformation supports the shutdown and start of the virtual machine, we also defined and developed the stop/start/reboot interface of the virtual machine on the basis of pause/unpause vmi, and Intermediate states such as stopping, starting, and rebooting have been added to facilitate users to check and use.

6.2 Support virtual machine static expansion and shrinking CPU/memory/local disk

The CPU/Mem/local disk is also provided with a synchronous interface when the CPU/Mem/local disk is stopped and expanded. When expanding this function, before finally modifying the xml configuration of the virtual machine, you need to dynamically expand the relevant resources of the virt-launcher pod in order to check whether there are enough resources on the node where the virtual machine is located for expansion. If the node where the resource is insufficient, you need to intercept the Request for expansion, and roll back related modifications to vmi, pod and other related configurations. And dynamic expansion pod configuration, native kubernetes is not supported, this is another set of solutions we provide in our internal k8s.

6.3 Support for binding core and large page memory of virtual machine CPU

The CPU binding function is mainly realized by combining the kubelet's cpuset function, and the kubelet configuration—cpu-manager-policy=static is required to enable the container's binding core function. The process is roughly like this, the vmi configuration on the CPU related binding core configuration dedicatedCpuPlacement=”true”, etc., and then create a guaranteed virt-launcher pod, the virt-launcher pod is scheduled to the kubelet node with the binding core configuration turned on, and the kubelet is virt -launcher pod allocates the specified cpu core, and then the virt-launcher process checks which cores it has from its own container, and then configures these cores in the virtual machine xml, thereby realizing the virtual machine and container through the way of kubelet management cpu Unified management of cpuquota and cpuset allocation methods. The virtual machine large page memory is also a combination of k8s resource management, that is, by using the existing large page memory resources in k8s, it is realized by pod occupation and then allocated to the virtual machine.

6.4 Other functions

In addition to the extended functions introduced above, we have also implemented support for static and dynamic addition or reduction of cloud disks for virtual machines, resetting passwords, viewing virtual machine xml, supporting cloud disk read-only restrictions, support for pass-through GPU, pass-through physical machine disks, virtionet Support multiple queues, IP display optimization and other requirements for users to use.

Summarize

At present, we have provided virtual machine and container services in multiple clusters at the same time, realizing hybrid cluster management. Virtual machines produced based on this solution have been provided to many businesses in our private cloud field, providing strong guarantees in terms of stability and performance. The main task in the next step is to implement hybrid deployment of containers and virtual machines on nodes, so that not only can unified scheduling be carried out on the control plane, but also mixed management can be carried out on the data plane.

In addition to the work described in this article, we have also implemented virtual machine snapshots, image production and distribution, static migration and other solutions. Our team will continue to post and share in the future.

Author profile
Weiwei OPPO Senior Backend Engineer
Mainly engaged in scheduling, containerization, hybrid cloud and other related tasks.

Get more exciting content, scan the code to follow the [OPPO Digital Intelligence Technology] public account


OPPO数智技术
612 声望950 粉丝