Hybrid management practice of virtual machine and container

1. Background

At present, containers have become the mainstream choice for enterprises to go to the cloud. After in-depth research and development and promotion in the second half of 2019, OPPO will basically realize the large-scale use of kubernetes-based containers and the full-service cloudification in 2020. The advantages of containers are agility and high performance. However, due to the need to share the host's kernel and incomplete isolation, when users need to modify many customized kernel parameters or run high-version Linux containers on low-version Linux hosts, or just When higher isolation is required, it is difficult to achieve on the container. Due to historical reasons, there are still some businesses within the company that require the use of strongly isolated virtual machines, so it is imperative to provide virtual machine services.

After investigation, we found that for companies that have built container platforms, most of the virtual machine management solutions are to maintain a set of OpenStack or similar systems. However, OpenStack is large and heavy, with high maintenance costs, and the underlying resources cannot be managed uniformly, which will bring a lot of inconvenience to hybrid scheduling. Therefore, we will use unified control plane management to realize unified scheduling and management of containers and virtual machines as the main direction of model selection.

2. Scheme selection Kubevirt or Virtlet

Virtual machines and containers are managed through the k8s platform. The best projects in the industry include kubevirt and virtlet.
Kubevirt is an open source Redhat project that runs virtual machines in containers. It uses k8s add-on to increase the resource type VirtualMachineInstance (VMI) using k8s CRD, and uses the container's image registry to create virtual machines and provide VM lifecycle management.

Virtlet is an implementation of Kubernetes (Container Runtime Interface), which can run Pods based on virtual machines on Kubernetes. (CRI enables Kubernetes to run non-docker containers, such as Rkt).
The picture below is a part of the comparison chart between Kubevirt and Virtlet that we made when we made the selection in early 2020. It can be seen that Virtlet uses the same resource type Pod to describe the container and the virtual machine. Therefore, if the native method is used, the virtual machine can only have two states of Running and Delete. It cannot support pause/unpause, start/stop and other virtual machines exclusive. Status, obviously this cannot meet user needs. If you want to support these states, you have to customize kubelet deeply, which will cause too much coupling between virtual machine management and container management; in addition, considering that the virtlet community was not as active as the kubevirt community at that time, the solution we finally chose was Kubevirt.

3. Kubevirt introduction

3.1 Correspondence between VmiCRD/Pod/Domain

3.2 Introduction to components

The various component services of kubevirt are deployed on k8s, among which virt-api and virt-controller are deployments, which can be deployed with multiple copies and high availability, virt-api is stateless and can be extended arbitrarily; virt-controller is through elections A host is selected to provide services; virt-handler is deployed as a daemonset, and each virtual machine node runs a virt-handler; and a virt-launcher service corresponds to a virtual machine. Whenever a virtual machine is created, it will Create a corresponding virt-launcher pod.

virt-api:
1) kubevirt API service, kubevirt works in CRD mode, virt-api provides custom api request processing, you can execute the synchronization command virtctl vnc/pause/unpause/stop/start vm through the virtctl command.

virt-controller:
1) Communicate with k8s api-server to monitor VMI resource creation and deletion events, and trigger corresponding operations
2) Create a virt-launcher pod according to the VMI definition, and the virtual machine will run in the pod
3) Monitor the pod status and update the VMI status accordingly

virt-handler:
1) Run on the node of kubelet, update heartbeat regularly, and mark "kubevirt.io/schedulable"
2) Monitor the k8s apiserver when it finds that the marked nodeName of the VMI matches its own node, it is responsible for the life cycle management of the virtual machine

virt-launcher:
1) Run as a pod
2) Generate virtual machine template according to VMI definition, and create virtual machine through libvirt API
3) Each virtual machine corresponds to an independent libvirtd
4) Communication with libvirt to provide virtual machine life cycle management

4. Kubevirt architecture transformation

4.1 Native architecture


The management plane is coupled with the data plane in the native architecture. Running a virtual machine in a virt-launcher pod, when due to uncertain reasons (for example, docker or physical machine reasons, or virt-launcher itself hangs and upgrades, etc.), the virtual machine will be caused after the virt-launcher container exits Also exit, which will affect the user's use and increase the stability risk of the virtual machine. Therefore, we have made a transformation on the basis of the original structure.

Modification point:
1) Move the kvm and libvirtd processes on the data plane out of the virt-laucher container on the management plane, and the libvirtd process on the physical machine manages all virtual machines on this physical machine.
2) The new virt-start-hook component is used to connect network components, storage components, and xml path changes, etc.
3) Reconstruct the virtual machine image production and distribution method, and realize the rapid distribution of images with the help of the object storage management of OCS.

In addition to the separation of the management plane and the data plane, we have also done a lot of work in terms of stability enhancement. For example, it is realized that every component of kubevirt fails, fails, or becomes abnormal under any circumstances at any time, and it will not affect the operation of the normal virtual machine, and the test is required to cover the test under abnormal conditions of these components; after the physical machine is restarted The virtual machine can normally resume production-level requirements such as life cycle management, which further guarantees the stability of the entire virtual machine management system.

4.2 Architecture after transformation

4.3 Process of creating a virtual machine after architecture transformation

1) The user creates vmi crd, kubectl create -f vmi.yaml.
2) The virt-controller watches to the new vmi object and creates the corresponding virt-launcher pod for vmi.
3) After the virt-launcher pod is created, the scheduler kube-scheduler of k8s will schedule it to the eligible kubevirt node.
4) Then virt-controller will update the nodeName of the virt-launcher pod to the vmi object.
5) After the kubevirt node node watch to vmi is scheduled to this node, it will mount the basic image of the virtual machine to the specified location, and then call the syncVMI interface of virt-launcher to create the domain.
6) After virt-launcher receives the creation request, it converts the vmi object into a domain object, then calls virt-start-hook, creates an incremental mirror disk of the qcow2 virtual machine based on the backingFile, and converts the relevant path in the domain xml into a path on the physical machine , Request the network, configure the xml, and then return the final configured xml to virt-launcher.
7) After virt-launcher receives the return of virt-start-hook, it calls libvirtd on the physical machine to define domain xml and create domain.

4.4 Process of deleting virtual machines after architecture transformation

1) The user executes the delete vmi command, kubectl delete -f vmi.yaml.
2) virt-handler watches the update event of vmi, and the deletionTimeStamp of vmi is not empty, call virt-launcher shutdownDomain, virt-launcher calls virt-start-hook to release the network and then calls libvirtd to shut down.
3) The domain shutdown message is watched by virt-launcher and sent to virt-handler, virt-handler calls virt-launcher deleteDomain according to the state of vmi and domain shutdown, virt-launcher calls virt-start-hook to delete the network and then calls libvirtd undefineDomain.
4) The domain undefine message is watched by virt-launcher and sent to virt-handler, virt-handler updates vmi according to the status of vmi and domain deleted, adds the condition of domain deleted, and then cleans up the junk files and paths of the domain.
5) The virt-controller watch to the vmi state deleteTimeStamp is not empty, and the condition DomainDeleted of vmi is True, then the virt-launcher pod is deleted, and then after the pod is deleted, the vmi finalizer is cleaned up so that vmi is automatically deleted.

5. Storage solution

5.1 Native image storage solution

The original image file of the virtual machine in kubevirt will be ADD to the /disk path of the docker base image and pushed to the image center for use when creating a virtual machine.

When creating a virtual machine, a vmi crd will be created, and the name of the virtual machine image that needs to be used will be recorded in vmi. After vmi is created, virt-controller will create a corresponding virt-launcher pod for vmi. There are two containers in the virt-launcher pod One is the container compute running the virt-launcher process, and the other is the container-disk responsible for storing the virtual machine image. The imageName of the container-disk container is the virtual machine image name recorded in vmi. After the virt-launcher pod is created, kubelet will download the container-disk image and then start the container-disk container. After the container-disk is started, it will always monitor the disk_0.sock file under -copy-path, and the sock file will be mapped to the path /var/run/kubevirt/container-disk/vmiUUID/ on the physical machine through hostPath.

The virt-handler pod will use HostPid so that the pid and mount information of the physical machine can be seen in the virt-handler container. When creating a virtual machine, virt-handler will find the pid of the container-disk process according to the disk_0.sock file of vmi, mark it as Cpid, and then find the disk number of the container-disk container root disk according to /proc/Cpid/mountInfo, and then according to The disk number of the container-disk root disk and the mount information of the physical machine (/proc/1/mountInfo) Find the location of the container-disk root disk on the physical machine, and then assemble the path of the virtual machine image file /disk/xxx.qcow2 , Get the actual storage location sourceFile of the original image of the virtual machine on the physical machine, and then mount the sourceFile to the targetFile for use as a backingFile when creating a virtual machine later.

5.2 Local Disk Storage

In the native kubevirt, the incremental image file xxx.qcow2 created based on the basic image backingFile can only be placed in emptydir, and the data disk of our container generally uses the lvm method. If you save the two methods of use, in the virtual machine container In the mixed deployment scenario, it is not conducive to the unified planning and scheduling of physical machine disks. Therefore, we also support the storage of virtual machine incremental image files on the lvm disk requested by the virt-launcher container on a native basis, thereby maintaining the virtual The consistency of the use of disks between the machine and the container. In addition, we also support the creation of a separate qcow2 empty disk for the virtual machine and mount it as a data disk, which is also stored in another lvm disk requested by the virt-launcher container.

5.3 Cloud Disk Storage

We have docked cloud storage for the system disk and data disk of the virtual machine, which is convenient for users to use in migration or some other scenarios.

5.3.1 System disk access to cloud disk

To connect the system disk to cloud storage, first upload the basic image of the virtual machine to the pvc under basic ns, and then create a volume snapshot based on this pvc. When creating a virtual machine under a certain namespace, you need to copy the volume snapshot of the basic image from the basic ns to your own namespace, and then create a new pvc based on the copied volume snapshot for the virtual machine to use. Among them, upload the basic image of the virtual machine to the pvc under the basic namespace and make a snapshot. We have made a tool to upload the image for unified management; and the system disk pvc needed to create the virtual machine and the pvc mounted to the vmi are one of For the series of operations, we realize unified automated management through a newly defined crd and a new crd controller.

5.3.2 Data Disk Access Cloud Disk

To connect the data disk to cloud storage, first create pvc under the namespace where the virtual machine is located, and then configure the pvc to the yaml of vmi. When virt-controller creates the virt-launcher pod corresponding to vmi, it will follow the configuration of pvc in vmi, Configure the pvc volume to the virt-launcher pod, and then the storage component will mount a directory with pvc information to the pod, and then virt-start-hook will configure the cloud disk to the pod according to the information in the pvc directory in the virt-launcher pod The xml of the domain is used by the virtual machine.

6. Extended functions

6.1 Support virtual machine shutdown/start/restart

Native kubevirt provides some synchronization interfaces, such as pause and unpause, whose functions are respectively to suspend and wake up the virtual machine. The native stop and start need to operate vm crd, which will cause the virtual machine to be destroyed and rebuilt, which cannot meet our needs. In addition, because the original architecture does not support the shutdown and start of virtual machines, it does not provide direct stop, start, and reboot interfaces for the virtual machine (stop corresponds to shutdown). Our users have this requirement. Because the kubevirt after the architecture transformation supports the shutdown and start of the virtual machine, we also defined and developed the stop/start/reboot interface of the virtual machine on the basis of pause/unpause vmi, and Intermediate states such as stopping, starting, and rebooting have been added to facilitate users to check and use.

6.2 Support virtual machine static expansion and shrinking CPU/memory/local disk

The CPU/Mem/local disk is also provided with a synchronous interface when the CPU/Mem/local disk is stopped and expanded. When expanding this function, before finally modifying the xml configuration of the virtual machine, you need to dynamically expand the relevant resources of the virt-launcher pod in order to check whether there are enough resources on the node where the virtual machine is located for expansion. If the node where the resource is insufficient, you need to intercept the Request for expansion, and roll back related modifications to vmi, pod and other related configurations. And dynamic expansion pod configuration, native kubernetes is not supported, this is another set of solutions we provide in our internal k8s.

6.3 Support for binding core and large page memory of virtual machine CPU

The CPU binding function is mainly realized by combining the kubelet's cpuset function, and the kubelet configuration—cpu-manager-policy=static is required to enable the container's binding core function. The process is roughly like this, the vmi configuration on the CPU related binding core configuration dedicatedCpuPlacement=”true”, etc., and then create a guaranteed virt-launcher pod, the virt-launcher pod is scheduled to the kubelet node with the binding core configuration turned on, and the kubelet is virt -launcher pod allocates the specified cpu core, and then the virt-launcher process checks which cores it has from its own container, and then configures these cores in the virtual machine xml, thereby realizing the virtual machine and container through the way of kubelet management cpu Unified management of cpuquota and cpuset allocation methods. The virtual machine large page memory is also a combination of k8s resource management, that is, by using the existing large page memory resources in k8s, it is realized by pod occupation and then allocated to the virtual machine.

6.4 Other functions

In addition to the extended functions introduced above, we have also implemented support for static and dynamic addition or reduction of cloud disks for virtual machines, resetting passwords, viewing virtual machine xml, supporting cloud disk read-only restrictions, support for pass-through GPU, pass-through physical machine disks, virtionet Support multiple queues, IP display optimization and other requirements for users to use.

Summarize

At present, we have provided virtual machine and container services in multiple clusters at the same time, realizing hybrid cluster management. Virtual machines produced based on this solution have been provided to many businesses in our private cloud field, providing strong guarantees in terms of stability and performance. The main task in the next step is to implement hybrid deployment of containers and virtual machines on nodes, so that not only can unified scheduling be carried out on the control plane, but also mixed management can be carried out on the data plane.

In addition to the work described in this article, we have also implemented virtual machine snapshots, image production and distribution, static migration and other solutions. Our team will continue to post and share in the future.

Author profile
Weiwei OPPO Senior Backend Engineer
Mainly engaged in scheduling, containerization, hybrid cloud and other related tasks.

Get more exciting content, scan the code to follow the [OPPO Digital Intelligence Technology] public account


OPPO数智技术
OPPO前沿互联网技术及活动分享,公众号:OPPO_tech
604 声望
945 粉丝
0 条评论
推荐阅读
OPPO云数据库访问服务技术揭秘
MySQL是OPPO使用最广泛的关系数据库,不同编程语言的微服务都是通过MySQL官方的SDK直连真实的数据库实例。这种最传统的使用方式,会给业务开发和数据库运维带来一系列影响效率和稳定性的问题。

OPPO数智技术1阅读 1.1k

花了几个月时间把 MySQL 重新巩固了一遍,梳理了一篇几万字 “超硬核” 的保姆式学习教程!(持续更新中~)
MySQL 是最流行的关系型数据库管理系统,在 WEB 应用方面 MySQL 是最好的 RDBMS(Relational Database Management System:关系数据库管理系统)应用软件之一。

民工哥14阅读 1.9k

封面图
终于卷完了!Redis 打怪升级进阶成神之路(2023 最新版)!
是一种非关系型数据库服务,它能解决常规数据库的并发能力,比如传统的数据库的IO与性能的瓶颈,同样它是关系型数据库的一个补充,有着比较好的高效率与高性能。专注于key-value查询的redis、memcached、ttserver。

民工哥10阅读 810

封面图
算法可视化:一文弄懂 10 大排序算法
在本文中,我们将通过动图可视化加文字的形式,循序渐进全面介绍不同类型的算法及其用途(包括原理、优缺点及使用场景)并提供 Python 和 JavaScript 两种语言的示例代码。除此之外,每个算法都会附有一些技术说...

破晓L7阅读 906

封面图
硬卷完了!MongoDB 打怪升级进阶成神之路( 2023 最新版 )!
前面我们学习:MySQL 打怪升级进阶成神之路、Redis 打怪升级进阶成神之路,然后我们还在继续 NoSQL 的卷王之路。从第一篇文章开始,我们逐步详细介绍了 MogoDB 基础概念、安装和最基本的CURD操作、索引和聚合、工...

民工哥6阅读 450

封面图
「刷起来」Go必看的进阶面试题详解
逃逸分析是Go语言中的一项重要优化技术,可以帮助程序减少内存分配和垃圾回收的开销,从而提高程序的性能。下面是一道涉及逃逸分析的面试题及其详解。

王中阳Go4阅读 1.9k评论 1

封面图
架构设计-高性能篇
大家好,我是易安!今天我们谈一谈架构设计中的高性能架构涉及到的底层思想。本文分为缓存架构,单服务器高性能模型,集群下的高性能模型三个部分,内容很干,希望你仔细阅读。

架构狂人4阅读 771

604 声望
945 粉丝
宣传栏