The next five years of cloud-native runtimes

Text｜Ma Zhenjun (Name: Ancient and Modern)

Years of hard work in the field of infrastructure

Currently in the middleware team of Ant Group

Responsible for the development of MOSN, Layotto and other projects

Proofreading｜Zhuo Yu, Qi Tian

This article is 9053 words read 18 minutes

｜Foreword｜

In the past few years, the Ant Group's infrastructure has undergone a widely-anticipated large-scale mesh transformation, which has become a benchmark for service meshing in the industry. While gaining the business agility brought by the infrastructure team's mastery of the data plane, they also enjoyed the greater ease of maintenance of both the application and the infrastructure after the infrastructure SDK was separated from the application.

However, the service grid is not a silver bullet, and it is also facing new problems after large-scale implementation.

At the right time, Dapr, led by Microsoft, was born, bringing the concept of distributed application runtime into the public eye. We also tried to use this idea to solve the problems left over after Meshization.

This article reviews the entire evolution of Ant Group from micro-service architecture to Service Mesh to distributed application runtime, combined with various problems and thoughts encountered in the production process, and tries to discuss the cloud-native runtime in the next five years. Possible development direction.

PART. 1 From Service Mesh to Application Runtime

In 2018, the Ant Group invested heavily in this direction when Service Mesh was just beginning to become popular, and it has been more than three years now. Sevice Mesh has already been implemented on a large scale in the company, supporting the daily operation of hundreds of thousands of containers in the production environment.

In the second half of 19, the Dapr project was officially open sourced and continued to be popular. The concept of application runtime has attracted people's attention, and the Ant Group has also embarked on the evolution path from Service Mesh to application runtime.

A. Gains and remaining issues from the practice of Ant Service Mesh

Under the traditional microservice architecture, the infrastructure team generally provides an SDK that encapsulates various service governance capabilities for the application, although this approach guarantees the normal operation of the application. But the shortcomings are also very obvious. Every time the infrastructure team iterates, a new feature requires the business side to participate in the upgrade before it can be used. When encountering a bugfix version, the business side often needs to be forced to upgrade, and the pain level of each member of the infrastructure team All have a deep understanding.

With the difficulty of upgrading, the SDK version used by the application is very different. The production environment runs various versions of SDK at the same time. This phenomenon makes it necessary to consider all kinds of compatibility in the iteration of new functions. Over time, code maintenance will be very difficult. Some ancestral logics are even dare not to be changed or deleted. .

At the same time, this "heavy" SDK development model makes it difficult for the governance capabilities of heterogeneous languages to match the host language, and various capabilities to ensure high availability cannot be applied to heterogeneous language applications.

Later, someone proposed the concept of Service Mesh, which aims to decouple service governance capabilities from business and allow the two to interact through process-level communication. Under this architecture, various service governance capabilities are stripped from the application and run in an independent process, allowing the business team and the infrastructure team to independently update iteratively, greatly improving efficiency.

At the same time, the SDK becoming "lighter" due to reduced functions reduces the access threshold of heterogeneous languages, allowing applications developed in these languages to have the opportunity to govern the host language's governance capabilities.

After seeing the huge potential of the Service Mesh concept, Ant Group quickly invested heavily in this direction. As shown in the figure above, first of all, the Go language was used to self-develope the data plane that can be benchmarked against envoy. Then the various governance capabilities in RPC were submerged into MOSN, so that RPC's SDK became "light", while other infrastructure SDKs remained the same.

After completing the Mesh transformation of RPC capabilities, we quickly promoted it, and now it has reached the production scale of thousands of applications and hundreds of thousands of containers. At the same time, the upgrade frequency of the whole site can reach 1~2 times/month at the fastest, which is a qualitative increase compared with the upgrade frequency of 1~2 times/year under the traditional microservice architecture.

B. Exploration of ants' preliminary pan-Meshization

After the RPC capability completed the Mesh transformation and verified the feasibility of this architecture and experienced the benefits of the substantial improvement in iterative efficiency brought by Mesh, we officially embarked on the path of pan-Mesh transformation of the entire infrastructure.

As shown in the figure above, under the general trend of pan-Mesh transformation, in addition to RPC, some common infrastructure capabilities such as caching, messaging, and configuration are quickly stripped from applications and sink into MOSN. This architecture is greatly improved. Improve the iterative efficiency of the entire infrastructure team.

Just as there is no silver bullet in software engineering, as the scale of pan-Meshization gradually expands, we are gradually aware of its remaining problems, as shown in the figure above.

Under this architecture, although a layer of network agent is added between the application and the infrastructure, the processing of the infrastructure protocol part is still retained in the SDK, which causes the application to essentially be developed for a certain infrastructure. For example, if you want to use Redis as a cache implementation, the application needs to introduce the Redis SDK. If you want to switch to other cache implementations such as Memcache in the future, you must modify the application.

In addition to the replacement of the SDK, it even involves the adjustment of calling APIs, so this architecture is completely unable to meet the current company's needs for deploying the same application on multiple platforms.

Similar to the above problems, after the pan-Mesh transformation, the low development cost of the "light" SDK allows various heterogeneous languages to have the opportunity to connect to the entire infrastructure system and enjoy the dividends of years of infrastructure construction.

However, because the SDK still retains the processing logic of communication, serialization and other protocols, as the language of access becomes more and more diversified, there are still development costs that cannot be ignored. In other words, compared with the traditional microservice architecture, the "light" SDK brought about by the pan-Mesh transformation has lowered the threshold for accessing heterogeneous languages to the infrastructure. However, as the access languages become more and more diverse, relying on middleware With more and more abundant capabilities, we also need to try to further lower this threshold.

If you make a layer of abstraction on the above two issues, it can be attributed to the fact that the boundary between the application and the infrastructure is not clear enough, or that the application is always embedded with a certain processing logic specific to the implementation of the infrastructure, resulting in both Coupled together.

Therefore, how to define the boundary between the application and the infrastructure so that the two can be completely unbound is a problem that we must think about and solve at the moment.

PART. 2 Redefine the boundaries of infrastructure

A. How to treat Dapr

The Dapr project was led by Microsoft and was officially open sourced in the second half of 2019. As an implementation solution for distributed application runtimes, it came onto the stage and attracted widespread attention. It showed us how to define the boundary between applications and infrastructure.

The above picture is the architecture diagram officially provided by Dapr. Similar to the Service Mesh architecture, Dapr uses the Sidecar model to deploy between the application and the infrastructure. But the difference is that Dapr provides applications with a set of clear semantics and capability-oriented API based on standard protocols such as HTTP/gRPC, so that applications can no longer care about the implementation details of the infrastructure, and only need to focus on which capabilities the business itself depends on. That's it.

At present, Dapr has provided a relatively rich API, including common infrastructure capabilities such as state management, publish and subscribe, and service invocation, which can basically cover the needs of daily business development, and each capability corresponds to a variety of specific infrastructure implementations, development Users can freely switch as needed and this switch is completely transparent to the application.

In addition to capabilities, Dapr officials also gave the similarities and differences between Dapr and Service Mesh, as shown in the figure above.

Although there are some intersections between the two, they are essentially different. Service Mesh emphasizes a transparent network proxy, which does not care about the data itself, while Dapr emphasizes the provision of capabilities. It is really thinking about how to reduce application development from the perspective of applications. cost.

Dapr itself has obvious advantages, but the rich network governance capabilities provided by Service Mesh are also the key to ensuring the stability of application production.

At the same time, the interaction between Dapr and the infrastructure cannot be separated from the network, so is there a solution that can combine the two swords of service mesh with the application runtime, reducing application development costs while retaining rich network governance capabilities?

B. Layotto: Servcie Mesh & Application Runtime Combine Two Swords

Layotto, as an application runtime implementation solution other than Dapr, aims to combine the advantages of both application runtime and Service Mesh. Therefore, Layotto is built on MOSN. In terms of division of labor, it is hoped that MOSN will handle the network part, and it is responsible for providing various middleware capabilities to the application.

In addition, based on Ant Group’s internal production, operation and maintenance experience, Layotto also abstracted a set of PaaS-oriented APIs. The main purpose is to reveal the running status of the application and Layotto itself to the PaaS platform, so that SRE can quickly understand the running status of the application. Reduce the cost of daily operation and maintenance.

C. API standardization: a cross-platform deployment tool

For the API used to interact with applications, Layotto hopes to expand and transform the actual production and use scenarios on the basis of Dapr. At the same time, it will also cooperate with Ali and Dapr to define a set of standards that have universal capabilities and cover a wide range of scenarios. API.

Once standardization is completed, for all applications developed based on this set of APIs, not only do they not need to worry about adapting to the differences between various platforms, but they can even switch seamlessly between Layotto and Dapr, which is completely eliminated. Commercial users have concerns about product binding.

D. Why not solve everything in one Sidecar?

The biggest inspiration from the Dapr project is that it defines the boundary between the application and the infrastructure, but the application needs more than that. Dapr provides us with a good idea and a good start, but it is not yet able to completely cover what we want. We hope to completely define the boundary between the application and the dependent resources, covering system resources, infrastructure, Multiple links such as resource constraints. When it becomes the "real" runtime of the application, the application does not need to pay attention to any other resources besides the business logic.

Judging from the current implementation of Sidecar's thinking, whether it is Dapr, MOSN or Envoy, it solves the problem of applying to the infrastructure. For system calls, resource restrictions, and other aspects are still completed by the application itself, this part of the operation does not need to go through any intermediate links, and it is difficult to be unified if it is not taken over. Similar to the network traffic if there is no unified entry and exit, it will naturally be governed. Difficulties. At the same time, if the system resources accessible by the application cannot be finely controlled, there will always be security risks.

E. A unified boundary: Layotto's ambition

Although the initial stage of Layotto is similar to Dapr, it exists as an application runtime.

But a larger goal is to try to define the boundary between the application and all dependent resources, which we collectively call the three boundaries of security, service, and resources. In the future, we hope to evolve to the form of "real" runtime of applications.

The direct benefit of clearly defining the boundaries is that developers who can completely liberate the business, allowing them to focus on the business itself.

Now a business developer wants to write code by hand. Not only must he be familiar with his own business logic, he also needs to be familiar with the implementation details of various infrastructure such as caching, messaging, configuration, etc. The cost is very high, and once the boundaries are clearly defined , Will lower the threshold for business developers to get started, thereby reducing the overall development cost.

Although the goal is clear, the first question we face is in what form of existence Layotto achieves this goal.

Driven by Service Mesh, everyone has gradually accepted the benefits of Sidecar interaction between applications and infrastructure, but we must continue to use Sidecar to interact with the operating system and the maximum resources that can be used by the application. It may not be that simple to impose restrictions. Therefore, we urgently need a brand-new deployment model to achieve the goal. After repeated discussions, the research and development model of functional computing has entered our field of vision.

PART. 3 The next five years: Is the function the next stop?

A. Layotto and the future of ant functionalization

I believe you are not unfamiliar with function calculations, but is there any other better way to try the function besides running as an independent process? In order to answer this question, we first review the development of virtualization technology.

As shown in the figure above, in the previous virtual machine era, people ran multiple operating systems independently on a set of hardware. This mode can be abstracted as virtualizing the hardware, but now the big-hot container technology is to run multiple containers on an operating system through technical means such as namespace and cgroup. This mode is compared to virtual machines. It can be seen as virtualizing the operating system, but because the container technology uses a shared kernel method, it has been criticized in terms of security. This is also a background for the birth of secure containers such as Kata.

In addition, there is also a Unikernel technology developing in the community. One of its main ideas is that the application can monopolize the kernel, but this part of the kernel is not a complete operating system, but only contains the parts needed for the application to run. After the development is completed, it will be compiled together with the kernel into a mirror image and run directly on the hardware.

The reason there are multiple virtualization solutions is actually because the benefits of virtualizing different resources are different. For example, compared with virtual machines, containers have faster startup speed and higher resource utilization.

Comparing the above three technologies, we can conclude that technicians have been trying to find a balance in the three directions of isolation, safety, and weight reduction, hoping to integrate their respective advantages to the greatest extent. Therefore, the function model we expect can also integrate the advantages of these three.

B. Skip, skip again, can the function become a first-class citizen in the cloud-native era?

The final function model we expect is shown in the figure above:

Speaking up:

1. The function itself can be developed in any language, so as to better meet the increasingly diverse business demands.

2. Multiple functions run on a runtime base, and they all run in the same process. In this model, the isolation between functions is bound to be a key consideration.

Speaking down:

1. The lower-level resources cannot be directly accessed during the function operation, and requests must be initiated with the help of the base, including system calls, infrastructure, etc.

2. At runtime, the base can finely control the resources that can be used during the operation of the function to ensure that it is used on demand.

In order to achieve the above goals, we need to find a technology as a function carrier, so that different functions in a process have good isolation, portability, and security. Because of this, the current increasingly popular WebAssembly technology has become our key consideration.

C. WebAssembly (wasm) on the cusp

WebAssembly, abbreviated as wasm.

Although the initial positioning is to allow the server-side programming language to run in the browser to solve the performance problems of JavaScript, because this technology has various excellent features, people eagerly hope that it can be used in an environment outside the browser Run, similar to Node.js that allows JavaScript to run on the server, the WebAssembly community also provides a variety of runtimes to support running *.wasm files on the server.

As a hot technology darling, WebAssembly is born with advantages that other technologies cannot replace:

1. Language-independent, cross-platform

As a set of instructions, WebAssembly can theoretically be compiled from any language. At the same time, it is the basic goal to run on different CPU architectures at the beginning of the design.

2. Safety and small size

The system calls and accessible disk files that the wasm module can execute at runtime require explicit authorization from the host, which brings good security.
The compiled wasm file itself is small in size, which brings faster transmission and loading speed.

3. Sandbox execution environment

Multiple wasm modules are running in their own sandbox environment, and they have good isolation between them and do not affect each other.

Although this technology has huge development potential, for now, there are still many shortcomings for the actual implementation of the back-end production environment:

1. Multi-language support

The goal of WebAssembly is to support compilation from various languages, but at present, the degree of support for it in various mainstream languages is very different. It is better to support compiled languages such as c/c++/rust. Unfortunately, these languages are common for development. In terms of business logic, the cost of getting started is a big problem. As for the mainstream Java and Go in business scenarios, their support for WebAssembly is very limited, and it is not enough to support the implementation of this technology in a production environment.

2. Ecological construction

In the actual production environment, the problem of positioning online is what we face every day. Java has its own various commands and third-party tools such as arthas. Go's pprof is also a very good performance analysis tool, but how is the running wasm performed? Troubleshooting, such as elegant printing of the error stack or debugging, is still in its early stages.

3. Various runtime capabilities are uneven

As mentioned in the "Unified Boundary" above, the running function needs to make it perform safe system calls and limit the maximum resources that can be used. At present, several mainstream wasm runtimes have different levels of support for these capabilities. Only part of the functions are supported. These are the problems that must be solved before the real production scenes are implemented.

Although WebAssembly has many shortcomings in the fire, but with the development of the community, I believe that the above mentioned problems will be gradually solved. The important thing is that we believe in the prospects of this technology, and we will also participate in the promotion and construction of the entire WebAssembly community.

D. Layotto and the tomorrow of ant functional applications

If in the future use functions as another basic R&D model with the same status as the current microservice architecture, we need to consider the ecological construction of the entire function model, and this construction is actually built around the ultimate iterative efficiency, including but not limited to The following points:

1. Basic framework

Thanks to the support of WebAssembly technology, the function itself can be developed in a variety of mainstream languages, but in order to better manage each function, it is still necessary for business students to follow a certain template during the development process, such as when the function is loaded. A start() method will be executed, which can do some initialization work, and a destroy() method will be executed when uninstalling, which can do some cleanup work.

2. Development and debugging

At present, most of the development and debugging work is still used to be done in the local IDE, but local development actually has a lot of inconveniences, such as the need to perform various configurations, or when we need to collaborate, we often need to involve others in the way of projection. Now that Cloud IDE is becoming more and more mature, I believe that with the development, the above problems can be better solved.

3. Package deployment

Now mainstream applications are packaged into war, jar or directly compiled into executable files of the target operating system during deployment. In the function system, applications are compiled into *.wasm files, which can run on various operating systems by default. .

4. Life cycle management + resource scheduling

Now K8s has become the de facto standard for container management scheduling, and how to integrate function scheduling into the K8s ecosystem is also a major focus of our exploration.

In the running model, as shown in the figure above, in addition to supporting the Sidecar mode, Layotto uses WebAssembly technology to allow wasm-shaped functions to run directly on Layotto. In this mode, the interaction between the function and Layotto uses local calls We call this layer of API Runtime ABI, which is evolved from Runtime API. For example, if a function wants to query a key from the cache, it only needs to call a local method of proxy_get_state.

Regarding scheduling, K8s has become a de facto standard, so what needs to be solved is how the function of the wasm form can be integrated into the ecology of K8s. Here we need to focus on two issues:

1. What is the relationship between wasm and mirroring?

K8s is based on mirroring to create Pod, and the product of function compilation is wasm file, we need to have a proper solution to integrate the two.

2. How to make K8s manage wasm deployment?

The scheduling unit of K8s is Pod. How to gracefully bridge the scheduling Pod to the scheduling wasm and let multiple wasm functions run in one process is also a tricky problem.

After investigating some exploratory schemes in the community, we gave a set of our own implementation schemes. On the whole, K8s supports developers to extend the container runtime based on the OCI specification of Containerd. At present, there are well-known secure container implementations such as Kata and gVisor based on this specification. Therefore, in Layotto, we also use Containerd-based Extend to implement the container runtime solution. There are two key points in the overall plan:

1. Mirror construction phase

For the compiled *.wasm file, we put it in a mirror, and then push it to the mirror warehouse for subsequent scheduling use.

2. Scheduling the deployment phase

We have implemented a plug-in called containerd-shim-layotto-v2 by ourselves. After K8s receives the request to schedule the Pod, it will pass the real processing logic to Kubelet, and then transfer it to our custom plug-in via Containerd. The *.wasm file will be extracted from the target image for Layotto to load and run. Layotto currently integrates wasmer as the runtime of wasm.

The final use effect of the entire scheduling scheme is shown in the figure above. For a developed function, first compile it into a *.wasm file, and then build it into an image. During deployment, you only need to specify runtimeClassName as Layotto in the yaml file. That's it. Subsequent operations such as creating a container, viewing the status of the container, and deleting the container retain the semantics of K8s. There is no additional learning cost for SRE students.

At present, the entire process has been open sourced in the Layotto community, and interested students can refer to our QuickStart [1] document for experience.

Finally, let's imagine possible future R&D models. First, in the R&D stage, developers can freely choose a language suitable for business scenarios to write code.

As for development tools, in addition to local IDEs, more and more people may choose Cloud IDE for development, which will greatly improve the efficiency of developers' collaboration. Then comes the deployment phase. For some lightweight business scenarios, deployment may be based on the function model, while for traditional businesses, the BaaS model may be retained. At the same time, if there is a higher security requirement, a feasible solution is to Deployed in a secure container such as Kata.

As the Unikernel technology matures, more and more people may try in this direction, such as putting Layotto in the kernel and compiling and deploying the application together.

More importantly, no matter which model deployment is used in the future, based on Layotto's inherent portability, operation and maintenance personnel can deploy applications on any platform at will, and this switch is completely transparent to developers!

Finally, in the stage of providing services to users, as the function service starts faster and faster, it is possible to load and run the functions after receiving the request, and to strictly and accurately control the resources they can use, and truly Need to be billed.

PART. 4 Open source and win-win

The future R&D model mentioned in the previous article actually relies on the development of many technical fields. The maturity of these technologies depends on the development of the entire technical community. This is also an important background for Layotto to choose open source. Therefore, we have communicated with multiple communities. Hope Together, promote the maturity of the technologies that future R&D models rely on.

A. Dapr community: API standardization

In addition to defining the service boundary between the application and the infrastructure, Dapr also has a set of Runtime APIs that are widely accepted by people. The above picture shows the various correction suggestions we made for these APIs during the actual internal implementation process. We hope to follow Dapr The community and Alibaba worked together to standardize this set of APIs.

B. WebAssembly community: ecological construction

For the WebAssembly community, we will continue to pay attention to the entire ecological development of this technology, which is roughly divided into the following categories:

1. Multi-language support

As mentioned earlier, languages that currently support WebAssembly are more expensive to develop business logic. Therefore, we hope that with the development of the community, we can better support common business development languages such as Java, Go, and JS.

2.WASM ABI

This is mainly to define the API used for interaction between the wasm function and Layotto. There have been some attempts in the community, and we hope to add the definition of Runtime ABI on this basis, so that functions can call the infrastructure more conveniently.

3. Ecological construction

We hope that WebAssembly technology has a better ability to troubleshoot and locate problems, more refined usable resource control methods, and more practical advanced features.

C. Layotto Community: Exploration of Microservices & Functions

The Layotto community will focus on the exploration of future R&D models, which are mainly divided into the following two categories:

1.Sidecar model

Under this model, the application and Layotto interact through the Runtime API based on the gRPC protocol, which is also the easiest model to implement at the moment.

2. FaaS model

Under this model, Layotto will use WebAssembly technology to allow multiple functions to run in the same process, and on this basis, try to define the three boundaries of security, service, and resources on which the function runs.

｜Postscript｜

We try to think about how the cloud native runtime will develop in the next five years based on the development of the microservice architecture theory at this stage and the valuable experience accumulated in solving various problems in actual production.

But this is only one direction we are currently exploring, and we also believe that there are other possibilities, but one thing is very clear, that is, the next five years of cloud-native runtimes are not waiting, but many technical personnel are working hard to explore. from.

"refer to"

【1】QuickStart

The next five years of cloud-native runtimes

｜Foreword｜

PART. 1 From Service Mesh to Application Runtime

A. Gains and remaining issues from the practice of Ant Service Mesh

B. Exploration of ants' preliminary pan-Meshization

PART. 2 Redefine the boundaries of infrastructure

A. How to treat Dapr

B. Layotto: Servcie Mesh & Application Runtime Combine Two Swords

C. API standardization: a cross-platform deployment tool

D. Why not solve everything in one Sidecar?

E. A unified boundary: Layotto's ambition

PART. 3 The next five years: Is the function the next stop?

A. Layotto and the future of ant functionalization

B. Skip, skip again, can the function become a first-class citizen in the cloud-native era?

C. WebAssembly (wasm) on the cusp

D. Layotto and the tomorrow of ant functional applications

PART. 4 Open source and win-win

A. Dapr community: API standardization

B. WebAssembly community: ecological construction

C. Layotto Community: Exploration of Microservices & Functions

｜Postscript｜

Recommended reading this week

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

JManus - 面向 Java 开发者的开源通用智能体

得物增长兑换商城的构架演进

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

得物业务参数配置中心架构综述

深度测评国产 AI 程序员，在 QwQ 和满血版 DeepSeek 助力下，哪些能力让你眼前一亮？

分析型数据库入门指南：如何选择适合你的实时分析工具？