The popularity of cloud native continues to rise, and this trend also extends to the field of middleware. With the help of cloud native technology, middleware is solving its own problems of elasticity, resilience, operation and maintenance, and delivery. At the same time, the way developers use middleware is becoming more and more cloud-native.

So, in the cloud-native era, how should middleware complete its own technological "evolution"? On May 30, Feng Changjian, Chief Architect of NetEase Shufan Cloud Native, was a guest on "A Date with Geeks" and discussed this topic with us. The following content is organized according to the live broadcast content and deleted without changing the original intention. The complete content [[click to view the playback video]]( https://www.infoq.cn/video/Zq2P94aVHmGbKiGs9qfh )

June 22, 19:00-20:30, Digital Basic Software Independent Innovation Sharing Week・Middleware Technology Forum, 3 experts interpret the practice of cloud native middleware with high stability, high performance and high availability, welcome to pay attention: Diversified symbiosis connects new Machine- 2022 NetEase Shufan Digital Basic Software Independent Innovation Sharing Week

Why middleware should be cloudized

Q: Mr. Feng joined NetEase for 12 years and has more than ten years of R&D experience. When did you start focusing on cloud native?

A: I have been working on development and architecture design for more than ten years. Fortunately, I have just caught up with cloud native from germination, development, and now gradually mature. Large-scale innovation can be implemented in the production environment. the whole process. In 2014, before the Kubernetes 1.0 version came out, we developed a container cloud platform based on Docker and Kubernetes. At present, I have been doing cloud-native-related microservices, containerization, service mesh, middleware, and DevOps and other cloud-native projects in the company.

Q: In your opinion, what major progress has been made in cloud native from the beginning to the present?

A: The most intuitive feeling is that cloud native has a great impact on the R&D operation and maintenance process. 5 or 6 years ago, there were not many specific implementation paths for DevOps, and more emphasis was placed on the concept of DevOps development, operation and maintenance integration. With the development of various cloud-native technologies, everyone gradually discovered that the agility, flexibility, and speed advocated by the DevOps concept can all be implemented through cloud-native technologies. It can be said that cloud native technology has truly reshaped the entire process of enterprise development and operation and maintenance.

Q: What is the necessity of cloudification of middleware?

A: First of all, the fundamental feature of the cloud is that it has resource elasticity, which can cope with sudden traffic fluctuations, and behind the elasticity is the pooling of resources, that is, all computing, network, storage, etc. are made into a resource pool, we can Adequate scheduling to improve resource utilization.

Second, I think the automated operation and maintenance aspect cannot be bypassed. In the traditional field, if we do not do cloudification and platformization, according to our engineering experience, 30 to 50 clusters can be completed through scripts or semi-automatic methods. However, when the number reaches three to five hundred clusters, which are common in Internet businesses, it is difficult for traditional scripts or semi-automated methods to control the stable operation of so many instances. The necessity of cloudification is that it can well solve the problem of automated operation and maintenance.

In the third aspect, standardization is also a reflection of the necessity of cloudification. If we do not consider cloud native, we will actually face many versions, especially in the field of middleware. There will be many versions of the same middleware, and there are many different deployment architectures. It is also difficult to automate various aspects. Unite.

Cloudification means relative standardization, which helps to unify the technology stack. We can focus the entire technology stack on a few selections, and we can further standardize enterprise-level technologies. After everyone agrees to use a certain version, it will facilitate subsequent maintenance and unified upgrades.

Another point is that cloudification means a change in the way middleware services are provided. In a large-scale organization, the microservice architecture requires middleware such as caching and message queues in addition to applications. To use such middleware as microservices, the application process must first be applied. If the operation and maintenance team and the business team are not the same cost center, capacity planning, procurement planning, etc. need to be done in advance. This process involves a large number of procedural interactions between multiple parties.

The cloud-based middleware platform provides a technical center, or a portal, which can be applied independently. After some necessary approvals, each business department can use it as needed, which greatly shortens the process and makes the entire resource management and control more efficient. high.

Q: What kind of middleware can be called cloud-native middleware?

A: I think that the middleware service running on K8s and designing the architecture in the native way of K8s is the cloud-native middleware.

Q: Is there a fundamental difference between cloud-native middleware and on-premises PaaS middleware?

A: The source of the difference is that cloud native middleware is based on K8s. Many basic capabilities of middleware have been transferred to K8s, and a large number of conventional high-frequency operation and maintenance operations are abstracted into the definitions of resources on K8s. The operation and maintenance personnel often encounter expansion and contraction, migration, reconstruction, failure recovery, etc. High-frequency actions can be operated uniformly through kubectl or other corresponding white-screen operations, which greatly reduces the operation and maintenance threshold and workload, which is a benefit brought by standardization.

In addition, because of the Kubernetes layer of abstraction, cloud native middleware can naturally perform cross-cloud or cross-heterogeneous infrastructure deployment interactions, so the user experience can be completely consistent throughout the process. In other words, in the era of cloud computing, we can choose different mainstream cloud vendors at home and abroad, run middleware services completely on the cloud, and migrate at any time. These are theoretically based on the decoupling of the Kubernetes layer.

The cloudification path of middleware

Q: What does the technology stack of cloud native middleware mainly cover? How to reduce costs and increase efficiency through the development of enterprises?

A: Our entire technology stack is mainly based on K8s and the framework above K8s, such as Operator.

As for how to reduce costs and increase efficiency, it is actually a question of efficiency and cost. In terms of efficiency, it is more in the two aspects of process and automation. We can do a lot of automation, fault root cause analysis, fault cure, and even advance fault perception, stability inspection and other things.

Cost has also been a constant issue with cloud computing. Cloud computing has been improving cost and efficiency, but after years of practice, it is found that the cost is uncontrollable after cloud computing is used. In my opinion, enterprises still need to be more refined to control and use resources rationally.

Q: Do you internally classify cloud-native middleware?

A: There will be, mainly divided by state, that is, stateful and stateless. We say that those responsible for load balancing, such as Nginx and API gateways, are stateless middleware, such as message caching, which are weak state middleware.

Q: For stateful middleware, such as Kafka, what is their cloud-native path? What is the difficulty in it?

A: Cloud-native middleware is defined under K8s. Middleware includes its own runtime, engines like Redis, and supporting management and control systems. I think the main path of their cloudification mainly includes two aspects: a general technical framework and some differentiated management and control processes.

At present, the mainstream solution for state management on K8s is the Operator framework and its hosting platform OLM, which is responsible for managing the life cycle of the middleware cluster. This process also involves resource management and scheduling on K8s such as CPU memory and disk network. In addition, the problem of unified access to intersections by middleware is also very important. This needs to be combined with the complex network architecture in the enterprise, involving access inside and outside the cluster, and many details need to be considered.

Next, we also need to do relatively advanced capabilities, such as automatic elastic expansion and contraction, instance migration, some dynamic configuration distribution, etc., to achieve higher-level high-availability and other features, and finally related to supporting facilities. content of observability.

Life cycle management, scheduling, access methods, and observability are all common, that is, these designs that Kafka, Redis and other middleware cannot avoid.

In addition, there are some differentiated management and control processes. "Differentiation" means that different resource dependency management needs to be implemented on a common basic framework. For example, various middleware, some are disk-intensive, some are memory-intensive, and some are network-intensive, so different middleware has different requirements for different resources, and thus requires different technical solutions.

In addition, there are various solutions for high availability of middleware, that is, their cluster topology management methods are different. For example, Kafka relies on ZK's management cluster, which is relatively easy to automate, while Redis needs to perform role assignment after the entire cluster is started, relatively In terms of control, it will be a bit heavier.

On the whole, the cloudification of various types of middleware is achieved through some common technical frameworks and differentiated management.

Difficulty in operation and maintenance

Q: Qingzhou has a special operation and maintenance stability control product. Why do you make a special operation and maintenance product? What are the difficulties in operation and maintenance?

A: The advantage of cloud native middleware is that it reshapes the technology supply method, but if the stability problem is not solved, this advantage cannot be reflected. Middleware operation and maintenance requires a lot of energy. There are many factors that affect stability. After I give a few examples, everyone may understand it very well.

For example, the typical resource water level problem, it is difficult for us to do capacity planning. There are roughly two types of resource water level risks. One is the bandwidth risk caused by real-time traffic, such as CPU, disk IOPS, network card bandwidth, etc.; the other is the capacity-based risk that gradually increases and accumulates, mainly in disk space and memory. The water level affects many risks related to capacity, such as the business traffic of the application and the internal support mechanism of the middleware.

Middleware is a multi-copy cluster. After a hang, there is often a process of rebuilding itself, but the rebuilding process will bring load pressure to some remaining surviving nodes. At this time, they are very uncontrollable and sometimes even enlarged. Now a node needs to be restored first after it hangs. During the recovery process of this node, data will be pulled from the surviving node, which will affect the resource water level and cause other problems. Others such as balance, hotspot resources, hardware failures, abnormal configuration parameters, etc. can cause stability problems.

We just said that NetEase Shufan's special stability control product is the accumulation of our operation and maintenance team's more than ten years of experience, providing users with monitoring and alarming, stability capacity, daily troubleshooting and other services. In addition, we also try to establish a stable improvement cycle. The idea is to find problems, analyze and rectify them, and then add the accumulated experience to the rule engine to avoid the same problem next time. The stability management of middleware requires very senior people and experience. We hope this platform can solidify this part of the experience.

Q: I saw a friend ask, how does a small company with more than one person do operation and maintenance?

A: I think it is more necessary for such enterprises to make cloud-native transformation. Because now all kinds of middleware have their own application scenarios and areas that are good at solving problems. Therefore, operation and maintenance personnel must learn the requirements for middleware submitted by the business side every day.

Redis and Kafka are very common middleware, and some, such as graph database or time series database, are sometimes classified into the field of middleware. For small companies without many operation and maintenance personnel or companies that "manage more than one person", the operation and maintenance can easily get out of control. As a result, these middleware are no longer provided, and the problem of technology selection is limited.

So, what are the benefits of having such a set of middleware platforms? One of the essential features of cloud-native middleware we mentioned earlier is sufficient standardization. The necessary skills for all operation and maintenance personnel are to understand K8s, declarative API, and CRD's definition of resources, so that whether it is Kafka or Redis, the learning path and technical understanding of operation and maintenance personnel are consistent. For example, the expansion operations of Kafka and Redis can both be declared as the same CR, so that even if there are ten middleware, they can all be handled in almost the same way.

This is a great advantage of cloud native. It can be decoupled and many technical details can be deposited into the operator. For operation and maintenance personnel, the work interface is a K8s command line, which helps to liberate their productivity.

Technology selection of middleware cloudification

Q: When did Qingzhou start working on cloud native middleware?

A: We made cloud-based middleware relatively early. The entire NetEase Internet business has been doing private cloud since 2012. At that time, it provided cloud hosts, networks, and hard disks based on OpenStack, and at the same time, the virtualization environment based on OpenStack provided PaaS middleware based on virtual machines. Therefore, I think the real cloud-based middleware started in 2012. It took us about a year or two to migrate 95% of NetEase's Internet business to the cloud platform, and use our cloud host and some cloudification. middleware.

Around 14 or 15 years ago, NetEase entered the cloud-native stage. We started to try to containerize stateless applications, let some stateless businesses of the group run on K8s, and also do service mesh. By around 19 years, the containerization of enterprises has basically been completed, and a large number of core businesses are running on the service grid platform.

From the second half of 2018, we started to do cloud-native middleware based on operators in a real sense. After more than three years, we have gradually established some in-depth product capabilities. From the early days, we focused on middleware life cycle management, that is, capabilities such as creation, deletion, expansion and shrinkage, and gradually developed to now, we are more considering multi-machine rooms, High availability of multiple clusters, and adaptation of localized heterogeneous software and hardware platforms, etc.

At the beginning of last year, we gradually carried out the productization of the stability inspection just mentioned. In addition, we also started to build "two places and three centers" and multi-activity capacity. Now that we've passed the lifecycle management phase, we're slowly making deeper capabilities.

Q: Are all our businesses going to the cloud now?

A: Our Internet business is basically cloud-based, stateless applications are all containerized, and stateful applications are slowly migrating. Things like message queues and caches may be faster, and database types are relatively slower.

Q: You mentioned the Operator framework many times before, why did you choose this framework at that time? How does an enterprise make technology selection?

A: There are two ways of application delivery on K8s, one is based on Helm, and the other is based on Operator.

Helm is more suitable for some stateless applications. It relies on the K8s native resource controller to ensure the self-healing, expansion and contraction capabilities of various K8s resources, but the management and control capabilities are relatively weak and cannot guarantee the correlation between K8s native resources. For example, when we are doing middleware, some state changes under a Pod are dependent on each other, that is, a Pod will depend on the results of another Pod, and I need to inject this information into the corresponding Kubernetes There is some complex business logic in it, which the Helm controller can't handle.

In addition, it is based on the management method of Operator user-defined resources. If we now understand Kubernetes as a distributed operating system on hardware, Operator is a framework for developing cloud-native applications on this operating system. It is more suitable for middleware applications, or stateful applications.

Middleware has two core characteristics, one is the state, which is represented by storage, network IP, etc.; and the other is that each copy of the middleware cluster is related. For example, there is a master-slave relationship between the two copies of Redis. This way they cannot be treated equally. Operator is an abstraction that encapsulates operation and maintenance knowledge and experience in some specific fields. It can be customized to do many things. It is very suitable to choose Operator+CRD to manage middleware.

Q: With the continuous expansion of business scale, how can cloud-native middleware ensure its own high performance and high stability?

A: This is also a concern of many enterprises when they containerize stateful applications. Performance and stability are the most important things for middleware. We can talk about our specific ideas and solutions in this regard.

In terms of performance, load balancing will be involved first, and more distributed processing capabilities can be utilized, requiring some load balancing components and proxy components. There is also rapid expansion considering elasticity, which can be a way to horizontally expand or improve the performance and capacity of the entire cluster. In addition, if the IO or disk requirements are particularly high, you can consider using a more high-performance local disk method. Of course, the most important thing is to do some parameter-level tuning of middleware in cloud-native scenarios. These are some of our options to improve performance.

There are also many ways in terms of stability. One is to use the stability inspection platform, which can use the resources of the entire platform more reasonably and ensure that the resource water level remains healthy. In addition, we have adopted some automated development frameworks like Operator, so that middleware can have healing power, and some observability methods and chaos engineering. If we can effectively solve these faults, then everyone will be more assured of this system.

Q: The entire cloud native framework is becoming more and more complex, and everyone is paying more and more attention to observability. What is the current state of observability in the industry?

A: For observability, everyone has their own standards first, and some mainstream manufacturers lead, and then gradually form a consensus in the grassroots community, and then generate new industry standards.

Now, many indicators, links, and logs of various applications in the direction of observability follow community standards. For example, in terms of links, there are community standards such as OpenTelemetry, and the entire community is currently improving. A lot of open source cloud-native software will claim to follow at least this community relations standard. This way you don't have to worry too much about compatibility and interoperability issues. At present, it has developed to this stage.

Q: What explorations has NetEase Shufan made on observable lines?

A: We have also made an observability product with integrated monitoring. The scenario of microservices needs to be application-centric. We need to collect various indicators and information scattered in various components and display them logically on one screen.

Therefore, through some language data systems, we gather the monitoring system and various sub-products, such as service frameworks, service grids, some API gateways involving traffic entry, and middleware logs, links and other data in this three-dimensional system. On the monitoring platform, and then get through the data.

Unified data collection is the first step, and there is still a lot of room for imagination, such as further root cause analysis. Previous root cause analysis may only be at a certain latitude, such as east-west traffic calls, north-south traffic calls, or some middleware. Now, the root cause localization process can chain the entire process. For example, if the access is slow, you can directly locate the container layer or the network memory further down.

Also, we do a lot in small areas like logging. We have also used open source log collection tools such as Fluentd before, but in some large-scale Internet scenarios, the performance is still insufficient. For example, if the log is lost, it is difficult to find out whether the log has been collected or not. Due to the same pain points, we jointly built with the Industrial and Commercial Bank of China and released the open source log collection software Loggie.

landing practice

Q: What middleware cloudization has NetEase implemented?

A: In our group's Internet business, Redis, Kafka, RocketMQ, ES, ZK have been implemented, and MySQL is also being implemented. The above are relatively mature, and most of the new businesses use cloud native.

Q: I saw an introduction earlier that NetEase Cloud Music has reduced costs by more than 30% after introducing the Redis of the Qingzhou middleware. What exactly did you do to achieve this effect?

A: First of all, after 2012, we made a PaaS middleware service based on KVM virtualization. After the application is containerized later, there is a natural cost savings of more than 10% compared to the virtualization overhead.

Secondly, we developed a management and control system on K8s, which is mainly responsible for the co-location of offline and online business. For example, when the water level is low, offline computing services such as typical music transcoding are scheduled to Redis. Through system co-location, the resource utilization rate can be increased to more than 60%, while it may be 10% or 20% or even less under normal circumstances. Of course, there are other mixed attempts, such as mixing Redis with other middleware.

The third aspect is resource oversale, which is also what cloud platforms basically do, that is, to oversell K8s resources, such as reasonably designing the oversold ratio between requests and limits, and try to increase the density of data instances in the test environment. The scale of our online and offline applications is about 1:1, and we have done more extreme operations offline.

Q: What scenarios are cloud-native middleware more suitable for?

A: Actually, cloud native middleware is a replacement for traditional middleware. For example, cache middleware is used for data caching and fast access scenarios, message queue middleware is used for decoupling, peak shaving and valley filling scenarios, and API gateway is used for traffic balance scenarios. In terms of scenarios, it is not said that cloud native is particularly suitable and traditional middleware is not suitable, but localized PaaS or traditional middleware is no longer suitable for some requirements of the current microservice architecture, which may require cloud native middleware to provide. .

For example, the financial industry needs to do a multi-active architecture. The service framework has such functions, but the middleware database cannot support it, so there is no way to continue to promote the development of the entire architecture, forcing you to use cloud-native middleware to support it. cloud-based architecture.

Q: What difficulties or challenges will you face during the entire landing process?

A: From the perspective of the industry, it is either self-developed, external procurement, or a combination of the two. NetEase is mainly self-developed. Self-research is definitely based on open-source technical frameworks, etc., but no open-source community will really provide the bottom line for stability, and there is very little professional support. It may be driven by everyone's values or technology enthusiasm, and the personalized needs of ordinary enterprises may not be able to respond. . In general, the degree of scenario-based is relatively low, and the enterprise-level management and control capabilities are relatively weak. These are all things that may be encountered in independent development based on open source. Enterprises solve these problems of high availability, stability and large-scale application point-to-point.

Outsourcing is a very fast path, but the main difficulty in this area is that the buyer does not have the ability to fully control the middleware. After all, it is bought from outside and lacks the technical ability to control it. At this time, training is important to ensure that the product can be used on a daily basis.

After all, you still need people. Whether it is self-developed or externally sourced, technical talents who understand cloud native and K8s are needed to build the entire ecosystem.

Q: Do developers need to understand the business?

A: The development of middleware is required. Only after understanding the business can you design it better, because your things are designed to be used by others, or even by business personnel.

But here it can be understood from two aspects. On the one hand, it is necessary to understand the IT process of the enterprise, such as understanding the respective positioning of various technology platforms, what the functions of different platforms are, where the boundaries are, and how the platforms work together.

Another aspect is to understand the best practices in some business scenarios. In order to support large-scale applications, some best practices need to be deposited. At this time, middleware developers need to understand how the business is used. For example, what is the business traffic model of the API gateway, whether the traffic trend is random or fixed, and the requirements of Kafka or Redis for data fragmentation, performance, scalability, and high availability, all involve the inherent architecture and the selection of its dependent resources. type, we can only make a reasonable assessment if we understand its business type. For example, when doing multiple activities in different places or a unitized architecture, for some unimportant businesses, there is no need to do a complex architecture, and the same-city architecture is fine. I think these all need to be understood, and it is difficult to say that there is a very general protocol that can be used in all scenarios.

Q: Mr. Wang Yuan once said in his sharing that the amount of code developed based on K8s for the same middleware has been reduced by 50% to 80%. How is this done?

A: I think this may be the most intuitive benefit of middleware based on Operator or containerization.

At that time, we also had several teams of PaaS middleware based on KVM virtual machines. I just listed about six or seven such middlewares that were done by six or seven teams. Everyone had to understand the entire management and control process of middleware, especially the middleware. It is high availability and disaster recovery. The entire management and control process lacks this abstraction, and the code is difficult to reuse, which makes the R&D efficiency relatively low. It takes about two or three months for a middleware PaaS service to start from project establishment, development, testing and launch, which is still relatively fast. Yes, it does not include enterprise-level capabilities, such as billing metering, permission control, and resource pooling.

After we use the Operator framework, the reduction in the amount of code may be the final result, so what have we done in the process? As mentioned earlier, operators can be divided into general deployment and differentiated deployment. The general part is reflected in design specifications, including design principles, best practices, and CRD response design.

At that time, we largely switched to using Golang as the development language, that is, we used the Operator SDK as the development framework and followed the declarative API design specification. Therefore, we formulate standards and constraints from various specifications in the R&D process such as design specifications, development specifications, deployment specifications, operation and maintenance specifications, and testing specifications, as well as online stability management and high-availability requirements for middleware. The end result is a 50%+ code reduction.

Doing cloud-native middleware isn't necessarily about reducing code size, but it's definitely a by-product. At that time, the first generation of PaaS middleware was about 50,000 to 60,000 lines of code, and after containerization, it was only about 20,000 lines. Others are similar. Kafka originally had about 50,000 lines, and later more than 20,000 lines.

future development

Q: What imagination does cloud native middleware have in the future?

A: Now everyone has their own understanding of the definition of middleware, but I think some stateful applications can be cloud-native, including some outdated middleware.

For example, our group is using an early cache middleware Memcached, which may be considered a middleware that is about to be eliminated in many businesses, but in one of our businesses it is still heavily used by some applications, because It also has many applications in some specific scenarios. Our Operator-based SDK framework quickly implements the Memcached Operator, replacing it with the cloud-native Memcached.

We really started to do these standardization capabilities based on K8s, including the various specifications mentioned above, and the development and implementation of replacements are very fast. If we do not have such framework precipitation and technical reserves, these may be historical liabilities, which cannot be dropped, and some may not even be replaced, and need to be maintained by someone. We made these cloud-native middlewares that seem to be outdated, and at the same time, they can be maintained uniformly at a small cost.

This is just an example. What I want to say is that many stateful middleware can improve operation and maintenance efficiency through cloud-native methods.

Of course, there is also a cost advantage. At present, the competition of public cloud IaaS at the resource level is already very fierce. There are various homogeneous applications, and the gross profit is very low. On the contrary, PaaS middleware has a relatively large profit margin, and it will try its best to make the business use its PaaS services. If you pay attention to the cost, people will build their own middleware platform on the IaaS of the public cloud, and the overall cost will be much lower relatively. Since everyone is based on the standard K8s base, this is more feasible.

Q: What plans does Qingzhou have in the cloudification of middleware in the future?

A: In terms of technology, we have followed up on cloud native, and our multi-cluster, multi-active, etc. are still under construction. Although we have accumulated a lot of technical levels, we still need to do further scenario verification at the product level. The industry as a whole is the same and continues to evolve.

In addition, for the gridization of the middleware that you are talking about a lot now, although there is no specific actual production scenario demand, the vision it describes is indeed very good: to further improve the development efficiency and operation and maintenance efficiency, we will also stay tuned.

In addition, we hope to realize more cloud-native middleware supporting microservice architecture. For the open source and open technology stack, we hope to realize the localized replacement of core middleware. In addition to NetEase's own Internet scenarios, we hope to make technical output in various industries such as finance, manufacturing, and energy.

June 22, 19:00-20:30, Digital Basic Software Independent Innovation Sharing Week・Middleware Technology Forum, 3 experts interpret the practice of cloud native middleware with high stability, high performance and high availability, welcome to pay attention: Diversified symbiosis connects new Machine-2022 NetEase Shufan Digital Basic Software Independent Innovation Sharing Week

This article is reproduced from InfoQ: How should middleware "evolve" in the cloud-native era?


网易数帆
391 声望550 粉丝

网易数智旗下全链路大数据生产力平台,聚焦全链路数据开发、治理及分析,为企业量身打造稳定、可控、创新的数据生产力平台,服务“看数”、“管数”、“用数”等业务场景,盘活数据资产,释放数据价值。