This article is the content of the author's speech at the GOPS Global Operation and Maintenance Conference, organized by the efficient operation and maintenance community.

This topic will mainly include two aspects. First, in the face of the rapid development and implementation of cloud native technology, how to build a traditional operation and maintenance system and the impact and challenges encountered in the process, there will be a simple analysis.

Secondly, what we have done in the face of different challenges, I will do some sharing based on internal practice, hoping to give you some reference.

For operation and maintenance, it is actually efficiency, stability, and cost. In fact, whether it is stability or improving operation and maintenance efficiency, it is all about money: improved efficiency can reduce labor costs, and stability assurance/monitoring and alarming can be done well, resulting in fewer failures and less loss of money. For operation and maintenance, of course, another very important thing is security, but today we will not talk about security.

Before the official start, let me briefly introduce the technical situation of NetEase. The technology stacks between various BUs within NetEase are often very different, such as games, music, strict selection, etc., which can basically be considered to be completely different industries, each with its own characteristics and industry background, which leads to the It is not realistic to build a unified platform, so we pay more attention to some more subtle points.

1. New challenges for operation and maintenance

new technology stack

The NetEase Shufan department has been in contact with containers since the release of Kubernetes 1.0, and the large-scale use of containers was around 18 years ago. The first challenge faced by businesses after using containers is a new technology stack. How should the operation and maintenance system be planned at this time? The selection of related technologies, including the choice of network/storage solutions, and the planning of cluster-scale capacity, often lead to a lot of embarrassment in the back because of the short-sightedness in the front.

In the containerization scenario, we do not impose any restrictions on the use of Kubernetes, and the business department can call the Kubernetes API at will. Due to the diverse usage patterns, O&M assurance faces more challenges. Many problems caused by misuse of business also require operation and maintenance guarantees.

The early infrastructure (including Docker/kernel/Kubernetes) has been bugged constantly. For example, we have encountered many classic bugs of Docker 18 years ago. However, the newer version in the past two years has a lot less problems.

The mainstream operating system distribution used by our department is debian, which may not be the same as most of my colleagues here who use centos. The good thing about the Debian distribution is that the kernel as well as the software versions are relatively up to date. All the kernel problems encountered also need to be dealt with and repaired by ourselves.

In addition, at the beginning of containerization, it was a new technology stack after all. It was difficult to recruit talents for matching positions, and the labor cost was relatively high.

technical inertia

Everyone can understand the technical inertia, and many companies have their own traditional operation and maintenance platforms. From the traditional operation and maintenance platform to the use of Kubernetes for release management, we found that there are many gaps in the ideology, operation method, and implementation method, and it is very painful to bridge the gap.

This leads to the perception of developers that the virtual machine was used well, what are you doing with the container, and now there is a problem, anyway, it is right to rely on the container.

In a word, the traditional operation and maintenance development method is not ready for cloud native.

knowledge base

Regarding the problem of knowledge base, first of all, in the process of cloud native implementation, many knowledge bases in the current state are not perfect.

Because our team has dealt with a large number of practical problems and has rich experience, we also output a lot of documents.

But we found that when the business side really encountered a problem, it didn't read the document at all, but threw it to us directly. Of course, the reason for this may be due to the lack of cloud-native technical background and lack of understanding, or it may be a problem of awareness. In general, the inheritance cost of the knowledge base is relatively high, and the plans and efficiency of some problems are extremely low.

We believe that in the process of promoting cloud-native containerization, operation and maintenance are currently facing great challenges in this area.

Organization and Personnel Structure

In the traditional development scenario, the top layer is development and testing. There are business architecture teams, application operation and maintenance, system operation and maintenance, and platform development in the middle. The following is the infrastructure guarantee of the entire IDC, and the entire architecture is clearly structured.

But if a company is doing cloud native containerization, you will find that the middle layer has become a mess, and everyone's work has some overlaps. If a platform developer does not know the attitude of the business side using the container, it may happen that the business side will use various strange gameplays to cause problems, and finally, the operation and maintenance will be the bottom line.

The reason for the problem is that the positioning of each engineer has changed. For example, SA used to only manage machines, but now it will involve the network and storage of containers. It is necessary to understand how Kubernetes works. As a result, many people must learn Kubernetes.

Capacity management

Regarding capacity, there are several aspects. One is that the business side applies for resources unreasonably, and the other is that the business side cannot predict the situation.

Another critical capacity issue in operating Kubernetes is that capacity assessments that control the resource consumption of components are often overlooked. Customers often deploy Kubernetes clusters and configure alarms, and then continue to add nodes. Suddenly an accident occurred one day and collapsed. I went to the operation and maintenance to ask what happened, and it may be found that the capacity of the management and control plane is insufficient.

Here is a screenshot, because the Kubernetes APIserver restarted, and the memory increased by more than 20% in a very short time. At the moment of restart, a large number of requests will come in, resulting in a lot of memory consumption. In this case, a threshold alarm is generally configured. Of course, if you don't deal with this kind of problem, once it is triggered, there may be an avalanche effect, and it can't be pulled again until you increase the resource capacity.

Next, we will briefly review the practices we have done in terms of operation and maintenance efficiency, stability assurance, and cost that I just mentioned.

2. Operation and maintenance efficiency improvement

First of all, our cluster uses centralized hosting. Of course, not all departments are under our control. We only manage the clusters that are closely related to us. The entire authority authentication system directly uses my internal employee authentication system to achieve unified authentication. The authority still uses RBAC to authorize individuals. Secondly, because we have a lot of manpower to help customers to troubleshoot, different people and different departments have come over and over again, and they are unwilling to read the documents and what you have done. Therefore, we will automate some common troubleshooting procedures. Finally, for the monitoring data, the monitoring data storage does not directly use the open source system, but uses the internally implemented TSDB to store the monitoring data uniformly, so that the data can be better consumed.

Let's talk about automatic diagnosis and operation and maintenance. The two teachers in front of me have shared similar content. Similarly, we also have knowledge base and pipeline execution. When many companies do assembly lines, they build their own internal platform to connect with other internal systems, which may solve their own problems, but the versatility is not high. What problems are we facing within NetEase? We also do things that way no one else will use you because you are a platform. It is more painful for other departments to adapt to it. We want to use some common solutions. In the Kubernetes scenario, there is support for CRD, which abstracts various things such as operation and maintenance diagnosis and performance inspection into CRD.

We abstract an operation and maintenance operation into an atomic operation and maintenance operation Operation, set a machine to be unschedulable, and judge whether it conforms to a known bug scenario. The orchestration of multiple operations will form an operation and maintenance pipeline OperationSet. For the diagnostic context, we made a Diagnosis abstraction.

There are more ways to trigger the diagnostic pipeline. First, users can manually create a Diagnosis implementation themselves.

We also use Bubble (Netease's internal IM) chatbot internally to implement Chatops, and trigger related processes by chatting with the robot. For chatbots, we don't want to do more complex knowledge understanding, we are more direct. Just send a relatively structured sentence to the chatbot and tell him you can help me see what the problem is. Because the entire certification system of our company is one piece, Bubble Robot will also pass a unified certification system, which can easily find your authority and prevent you from doing things beyond your authority. With this ChatOps you can trigger the execution of the pipeline.

There is also a larger pipeline trigger source, which is the triggering of monitoring alarms. For example, in a business application, after the CPU/memory usage used by the container reaches the threshold, it can automatically trigger the stack information to be dumped once, and then upload the corresponding battle information and the dumped memory file to the object storage. Go inside. In the next process, the dumped data can be taken out for analysis.

Of course, there are some situations like middleware. They often need to guarantee stability. If something happens to my middleware instance, what should I do? Similar to this logic, we can also arrange it, so that we can let other operators to create this kind of Oparater of our new Diagnosis, and realize this thing in this way.

To put it simply, our entire scenario is a set of applications under Kubernetes, which is to use apiserver to accept the relevant CRD, and then use the operator to execute it, which is probably such a thing.

For this piece, we hope that this thing will be made into a platform internally in the future. We hope that this thing will be more generalized, that is, to trigger a process through an event, to do some operation and maintenance operations, operation and maintenance diagnosis, and traditional scripts left over are all can be fully inherited. See: KubeDiag framework technical analysis

Because Kubernetes is a standard API, if you are based on Kubernetes, then some of our experiences may be useful to you, and many things are common. For example, you may have encountered a kernel version before 4.19, the recycling of memcg is problematic, which will lead to a large number of leaks, including the problem that a large number of containers cannot be deleted in the early version of Docker.

We all have a series of workaround methods. We can make it very intelligent. For example, if we alarm and monitor that a Pod has exited for more than 15 minutes and has not been killed, then we may trigger a diagnosis. Let's see if it's a known bug, and if so, we'll automatically recover it by some means. This type of diagnosis is generic.

In traditional scenarios, different companies may have different ways for operation and maintenance personnel to log in to machines. Therefore, in traditional scenarios, we have no way to be universal. But in the Kubernetes scenario, we can do it universally, because Kubernetes RBAC can do permission control. We have a daemonset method as a whole to operate your process and collect a lot of things for you.

There are also more headaches, such as many AI and big data related, mainly AI training, they have C/CPP code, there will be coredump, coredump will bring several problems, which will cause the local disk usage at that time to be very high. If it is high, it will affect other services on the same node. At this time, we can use some methods to achieve a globally unified local coredump collection that does not drop the disk. There are also things like sending data packets, playing flame graphs, etc., many times like performance diagnosis, and some regular software bug workarounds are very general, and these are the underlying capabilities.

If we consider the business level further, there will be some very general capabilities. For example, if a Node is in a half-dead state, the Node can be isolated and so on. We hope to make this thing more perfect in the future, and hope that there will be more scenes and more people will participate.

Our entire project is a framework with a lot of rules. This part has also been released. One of our short-term ideas is to make the interface a little better, and then we will export a lot of experience that we have inherited before as code, put it in Go into the process to achieve an out-of-the-box process experience.

Regarding the efficiency improvement, what effect can be achieved in this way? No longer let our team Overload, no need to look at the document to see what's going on, it will be solved automatically and solve the problem of its own labor consumption. At the same time, this thing is relatively code-based. There is no certain brother who is more technically better. He can't figure it out when he leaves, because the code is left behind.

3. Monitoring and alarming

Stability We all know about Tracing, logging and Metric, but the premise of Tracing and logging is relatively few a few years ago. Now, many problems in distributed scenarios have led to distributed tracing. However, Metric has always been a gold standard, and today the main focus is on Metric.

When collecting system indicators, the traditional method may be *stat data, various data collected from procfs, and now we hope to get more fine-grained data through eBPF. The main reason for collecting these fine-grained data is to clarify responsibilities. For example, the business side needs to check the business jitter, which may be a problem with the infrastructure and the computer room. At this time, it is difficult to prove innocence. Through such means, we can prove whether it is a system problem by collecting the RTT data of TCP.

Memory is more related to memory reclamation, such as Memory Cgroup-related reclamation, PageCache reclaim and other alarm indicators collected a lot. The CPU scheduling is basically concerned with the scheduling delay, how long is the process delayed from when it becomes Running to when it starts running.

There is also file IO, sometimes the disk will shake, and some business parties will not be affected by asynchronously writing logs. If it is fully synchronous, it will be affected. At this time, we need to locate what is causing it. We also need to monitor, such as VFS latency, etc. These indicators need to be obtained from the bottom layer.

Like eBPF now supports uprobe technology, so it can also do another piece. For example, when a traditional monitoring application accesses a database, it will inject some bytecode enhancements through an agent on the client side, such as a Java program, to find out whether it is fast or slow to access MySQL. If there is an uprobe mechanism, in MySQL or Kafka, when the code is fixed, you can use uprobe to hook the functions of the server, and no longer catch the problem from the client.

At this time, you can also prove your innocence. For example, the user has a bug in his own code. He thinks that there is a problem with your backend Redis. You can say that the data has not come to me, and I have checked the function execution, and the effect is good. We are still trying and implementing this piece.

Then in the field of Tracing, some methods are used to analyze logs and record logs through the SDK code. In fact, in some traditional applications, some very old legacy applications, you can use uprobe to get the dynamic information when it is not very good for him to do such integration. You can also do some analysis of the Redis protocol at the network level, and you can get error information, delay information, and errors through the network output traffic level. This related implementation is relatively complicated. At this stage, we are trying to land.

Of course, the very fine-grained indicators we are doing now have relatively high requirements on the kernel version, and the kernel version of one of our internal departments has reached 5.13, which is more aggressive.

In this way, we have collected a lot of indicators. The indicators are used for alarming. In fact, it is not easy to do alarming. The previous two teachers have also been raising this question. What problem does the alarm solve? First, the police should not be reported, and the second should be reported, both of which are not easy to solve.

management of alarm

First of all, what is the problem with the traditional threshold alarm? It will change, your request today and your request tomorrow for whatever reason. Either there is no quick allocation, there is a problem with changing the threshold, or you are startled by a false alarm, which will lead to this situation. The idea of no threshold alarm is probably something we have been exploring since 2017 and 2018, but we are now transplanting to cloud native.

Another issue is related to alarm suppression. For example, if your disk is slow, what problems will a slow disk cause? It will cause users to be stuck when logging, and a log stuck like a Java program will lead to an increase in processing threads, the thread pool will burst, and a series of effects such as increased response delay will occur. In fact, if you think about these alarms, there is still a correlation between them. When we alarm, we may only need to report that the disk is slow. There is no need to report other alarms. In fact, we need an alarm correlation as a suppression strategy.

How is this piece done? You saw a lot of people in the AIOps session today, and everyone will be interested in such topics. At present, whether it is practice or personal feeling, it is not so realistic to rely entirely on AI. At least at this stage, it seems that there is no good practice.

We are doing more mathematical statistics now, and there are a small number of machine learning algorithms to help us model. This modeling is more about analyzing a single indicator or related indicators. The related indicators are still input from experience, not really automatic analysis. . Because until now, I still don’t believe that AI can really achieve such good intelligence.

We currently rely on mathematical statistical algorithms, normal distribution, etc., using very basic machine learning algorithms to generate models through offline computing, and then real-time computing modules for abnormal detection to achieve a state of no threshold alarm. There is also correlation alarm suppression. Correlation alarm suppression is input, not learned by machine learning.

There is also the feedback of the alarm. After the alarm is detected, an abnormality will be sent to you. If the alarm is inappropriate or there is a problem, you can make a parameter adjustment. For example, if we calculate the period function of a certain indicator for fitting, and if we fit a certain parameter, its period is a problem. If we assume that it is wrong, we can adjust it manually or change it. Or there are some people who report falsely to the police that there is a problem, we can just suppress it.

We hope to do a complete set, from monitoring to problem triggering, to overall offline calculation, real-time calculation, and manual feedback to go back to the closed loop of doing this, so that this thing can continue.

This is our content on the monitoring and alarming part.

4. Cost savings

cost saving means

Below is the cost aspect. Cost management is on NetEase's side, including the biggest problem faced before. A business unit has suddenly come to a batch of tasks. We NetEase have content security, and at some point, a large amount of business will suddenly come over. They don’t have enough machines, so they borrow and coordinate machines everywhere, wasting a lot of human energy.

Now we do this, hoping to make the internal resource pool a little better. The main body of cost saving is the recommendation of resource allocation. Just now, the guest also mentioned such a problem, including issues such as upgrading and downgrading, which is VPA. VPA is actually quite meaningful.

In addition, hybrid deployment is also relatively hot in the industry. It may be really difficult for this thing to land, and it may not be suitable for many small and medium-sized enterprises.

There is also the resource pool within the integration just now, and I hope to build a large pool for each business department to use.

There is also the issue of business promotion. From the boss to the worker, when you do something new, there will inevitably be unstable factors. From our practical experience, the problems you encounter when doing containerization are all container problems from the perspective of the business side. All the problems you encounter when doing hybrid deployment are considered by the business side to be caused by hybrid deployment. Fang thinks that he has no problem at all, but in the end, it is the business side's problem, you have to tell him that there is a problem.

Let's briefly talk about our two content, one is hybrid deployment, and the other is resource integration.

Mixed system

NetEase began to try to implement hybrid deployment in the previous year, and now it has a certain scale.

The first is two pieces, one is scheduling and the other is isolation. Scheduling is relatively simple for us. Although the industry has papers and practices such as machine learning, resource profiling, and resource scheduling, we currently only base it on real-time data collection.

From a practical point of view, the usage of resources based on real-time monitoring data is not bad at present. That is to say, it is unlikely that an online business will really fluctuate constantly, at least the offline business can run relatively smoothly for a period of time.

means of isolation

There are several main means of isolation, and the calculation is mainly CPU scheduling and memory management. CPU scheduling The traditional CFS share isolation has a little problem, because one of its requirements, fair scheduling is fair no matter how low your business priority is. There is also the issue of the minimum scheduling delay, and the CPU will be switched after running for a certain period of time. When running offline services, such as big data services, the CPU must be full, which will cause a certain delay in online services. There are relatively good implementation cases in the industry. Tencent has developed its own scheduling class, and Alibaba Cloud has its own technology called Group Identity. Similar things have some practice in the industry, and we will refer to other people's thinking and combine them into our own version.

There is also hyper-threading related, I don't know if you are clear, if hyper-threading is actually a single core on a physical core. There will be differences in different scenarios, but generally we think about 120%-130%, not the computing power of two cores, so if your offline business and online business are on two hyperthreads of one physical core , the online business will be greatly disturbed, and some evasion can be done on the scheduler in this area.

Also like L3 isolation, we basically limit the amount of offline tasks.

Page Cache recycling is basically controllable for online services, but basically uncontrollable for middleware services. If your Page Cache recycling is not handled properly, the unexpected recycling will cause a great drop in performance, or the service will OOM due to non-recycling. Because our kernel version is relatively new, some automatic recycling and active recycling methods have also been added.

Finally, the biggest challenge of co-location is that offline business often needs to separate storage and computing, which is related to the infrastructure of IDC. In the self-built IDC, you will find that the IDC may have been built N years ago, and whether the old equipment can support full-link QoS is a question mark. I have encountered the pit of the switch before. When the traffic is full, the network for offline services to access the remote storage is full. At this time, touching the bug of the switch will cause some random delays, which is very painful to troubleshoot.

In terms of IO isolation, although there are many Buffer IO isolation strategies in CgroupV2, it may be the simplest and most rude to use different block devices, which is the easiest.

At present, NetEase's internal isolation method is still simple and universal, and does not pursue the ultimate hybrid deployment effect.

Resource pool integration

Each department within NetEase is relatively independent, which means that each department has multiple Kubernetes clusters. What if a large number of cluster resources of one business are left and the resources of another business are completely insufficient? We made a huge resource pool called Kubeminer. We divided the business Kubernetes cluster into Consumer and Provider, a resource consumer and a resource provider, all of which use KubeMiner to simulate virtual nodes.

The resource consumer sees the virtual node, but the resource provider is completely unaware, because it helps others to run the business into his cluster, and the conversion is carried out in the middle.

In this way, the resources of the previous single cluster are extended to all other Kubernetes clusters. In this way, if the capacity of a business party is not planned well, or there are more resources remaining in a certain period of time, he can sell the resources to others. If a business resource party has some needs temporarily, you can see who has the resources.

Why make this structure? First of all, this is relatively common. For consumers, it is nothing more than changing the service scheduling conditions and scheduling services to virtual nodes. Providers are completely unaware. They just see a bunch of Pods that I don't know running up.

Another advantage is that there is zero awareness of the business, and the business does not need to do much adaptation. The Consumer is still in its own cluster, and the Provider is not aware, because it does not care about this thing. We need to make a settlement system among us, and we need to give him a clear calculation of the money. This is our job.

difficulty

Three difficulties are listed. The organizational form has led us to a very divided scenario. Each party has a different Kubernetes cluster network solution. There are various network solutions. When we create a unified resource pool, we need to achieve interoperability. This is In the process of doing it, we think that the challenge is relatively large.

There is also synchronization of cross-cluster objects. Although you schedule the Pod, tokens such as PV, Service access, ConfigMap, etc., must be synchronized in this cluster, because we have to let him experience it in himself. Same as cluster.

And how do we do the best match? What does that mean? The business side needs 10 Pods, but the provider does not have a cluster that can handle the capacity of 10 Pods. What should I do? We have to divide into multiple clusters. So how can we make it a little better? Because the underlying provider cluster is still changing dynamically and others are using it, how can we achieve relatively good results at this time? For this piece, our current method is relatively simple. Basically, we simply configure some parameters based on experience, because the current parameters are matched.

This is probably the case. At present, there is no particularly complex algorithm, and some methods may be used in the scheduling mechanism in the future.

landing effect

Today’s overall sharing is about this. In the end, there is an effect. We have not achieved some value from overall cost savings. The average CPU utilization is only 55%; the video transcoding business of a business unit has not achieved Consume any real resources and run in mixed departments with low-priority tasks; the computing resources of various Kubernetes clusters are aggregated through elastic resource pools, providing elastic resource capabilities for other business directions.

That's all for today, thank you all.

About the Author

Wang Xinyong, head of NetEase Shufan container orchestration team, has participated in the construction of NetEase Group's load balancing, SDN and other projects. Currently, he is mainly promoting the implementation of NetEase's Internet business cloud native technology, and is committed to stability assurance and cost optimization in the cloud native environment. .

learn more

Analysis of KubeDiag Framework Technology


网易数帆
391 声望550 粉丝

网易数智旗下全链路大数据生产力平台,聚焦全链路数据开发、治理及分析,为企业量身打造稳定、可控、创新的数据生产力平台,服务“看数”、“管数”、“用数”等业务场景,盘活数据资产,释放数据价值。