MOSN sub-project Layotto: opening a new chapter in service grid + application runtime

About the Author:
Ma Zhenjun, famous for ancient and modern, has worked in the field of infrastructure for many years and has deep practical experience in Service Mesh. Currently, he is responsible for the development of MOSN, Layotto and other projects in the middleware team of Ant Group.

Layotto official GitHub address: https://github.com/mosn/layotto

Click the link to watch the live video: https://www.bilibili.com/video/BV1hq4y1L7FY/

Service Mesh is very popular in the field of microservices, and more and more companies are starting to land internally. Ant has been investing heavily in this direction since the emergence of Service Mesh. So far, the internal Mesh solution has been covered. Thousands of applications and hundreds of thousands of containers have undergone many big promotion tests. The advantages of service decoupling and smooth upgrades brought by Service Mesh have greatly improved the iterative efficiency of middleware.

After the large-scale implementation, we encountered new problems. This article mainly reviews and summarizes the implementation of Service Mesh within Ant, and shares solutions to new problems encountered after the implementation of Service Mesh.

1. Service Mesh review and summary

A. The original intention of Service Mesh

Under the microservice architecture, the infrastructure team generally provides an SDK that encapsulates various service governance capabilities for the application. Although this approach ensures the normal operation of the application, the shortcomings are also very obvious. Each time the infrastructure team iterates a new one The functions require the business side to participate in the upgrade before they can be used, especially the bugfix version, which often needs to force the business side to upgrade. Every member of the infrastructure team has a deep understanding of the degree of pain.

With the difficulty of upgrading, the SDK version used by the application is very different. The production environment runs various versions of the SDK at the same time. This phenomenon will make the iteration of new functions must consider all kinds of compatibility, just like It’s normal to move forward with shackles, so with continuous iteration, it will make code maintenance very difficult, and some ancestral logic will fall into the pit if you are not careful.

At the same time, this "heavy" SDK development model leads to very weak governance capabilities of heterogeneous languages. If you want to provide a full-featured and iterative SDK for various programming languages, the cost can be imagined.

In 18 years, Service Mesh continued to be popular in China. This architectural concept aims to decouple the service governance capability from the business and allow the two to interact through process-level communication. Under this architectural model, the service governance capability is separated from the application and runs in an independent process. The iterative upgrade has nothing to do with the business process. This allows various service governance capabilities to iterate quickly, and because the upgrade cost is low, each All versions can be upgraded to solve the historical burden problem. At the same time, the "light" SDK directly lowers the governance threshold of heterogeneous languages, and there is no longer a headache for SDKs that need to develop the same service governance capabilities for each language.

B. The current status of Service Mesh landing

Ants quickly realized the value of Service Mesh and devoted themselves to this direction. They developed MOSN in Go language so that they can benchmark the excellent data plane of envoy, and take full responsibility for the construction of service routing, load balancing, fuse current limiting and other capabilities, greatly speeding up The progress of the company's internal implementation of Service Mesh.

Now MOSN has covered thousands of applications and hundreds of thousands of containers inside Ant. Newly created applications are connected to MOSN by default, forming a closed loop. Moreover, MOSN also handed in a satisfactory answer in terms of resource occupation and performance loss that everyone is most concerned about:

RT is less than 0.2ms
CPU usage increased by 0%~2%
Memory consumption growth is less than 15M

As Service Mesh lowers the service governance threshold of heterogeneous languages, heterogeneous technology stacks such as NodeJS and C++ are also continuously connected to MOSN.

After seeing the huge benefits brought by the Meshization of RPC capabilities, Ant also internally transformed the MQ, Cache, Config and other middleware capabilities into the MOSN, which improved the overall iterative efficiency of middleware products.

C. New challenges

Strong binding between application and infrastructure

A modern distributed application often relies on various distributed capabilities such as RPC, Cache, MQ, and Config to complete the processing of business logic.

After seeing the benefits of sinking RPC, various other abilities also quickly sink. In the initial stage, everyone will develop in the way they are most familiar with, which results in no unified planning and management. As shown in the figure above, the application relies on SDKs of various infrastructures, and each SDK is carried out with MOSN in its own unique way. Interactions often use proprietary protocols provided by native infrastructure, which directly leads to the sinking of complex middleware capabilities, but the application is essentially bound to the infrastructure. For example, if you want to migrate the cache from Redis to Memcache However, the business side still needs to upgrade the SDK. This problem is more prominent under the general trend of application to the cloud. Imagine if an application is to be deployed on the cloud, because the application depends on various infrastructures, it is bound to be necessary The entire infrastructure must be moved to the cloud to allow the application to be deployed smoothly. The cost of this can be imagined.
Therefore, how to untie the application from the infrastructure, make it portable, and be able to deploy across platforms without perception is the first problem we face.

High cost of heterogeneous language access

Facts have proved that Service Mesh has indeed lowered the access threshold of heterogeneous languages, but after more and more basic capabilities sink to MOSN, we have gradually realized that in order for applications to interact with MOSN, communication protocols are required in various SDKs. , The serialization protocol is developed, and if the same functions are provided for various heterogeneous languages, the maintenance difficulty will increase exponentially.

Service Mesh makes heavy SDK history, but for scenarios where various programming languages are blooming, and various applications rely heavily on infrastructure, we find that the existing SDK is not thin enough, and the threshold for heterogeneous language access is not enough. How to further reduce the access threshold of heterogeneous languages is the second problem we face.

2. Overview of Multi Runtime Theory

A. What is Runtime?

At the beginning of the 20th, Bilgin lbryam published an article titled
Multi-Runtime Microservices Architecture
The article on the next stage of the microservice architecture is discussed.

As shown in the figure above, the author abstracts the needs of distributed services, which are divided into four categories:

Lifecycle
Mainly refers to things such as application compilation, packaging, deployment, etc., which are basically contracted by docker and kubernetes under the general trend of cloud native.
Networking
A reliable network is the basic guarantee for communication between microservices. Service Mesh has made attempts in this regard. At present, the stability and practicability of popular data planes such as MOSN and envoy have been fully verified.
State
The service orchestration, workflow, distributed singleton, scheduling, idempotence, stateful error recovery, caching and other operations required by the distributed system can all be collectively classified as the underlying state management.
Binding
In a distributed system, it not only needs to communicate with other systems, but also needs to integrate various external systems. Therefore, it is also strongly dependent on functions such as protocol conversion, multiple interaction models, and error recovery procedures.

After clarifying the requirements and drawing on the idea of Service Mesh, the author summarized the evolution of the distributed service architecture as follows:

The first stage is to separate and decouple various infrastructure capabilities from the application, and turn them into an independent sidecar model to run with the application.

The second stage is to unify and abstract the capabilities provided by various sidecars into several Runtimes, so that the application has evolved from the development of basic components to the development of various distributed capabilities, completely shielding the underlying implementation details, and because it is capability-oriented In addition to calling APIs that provide various capabilities, applications no longer need to rely on SDKs provided by various infrastructures.

The author's thinking is consistent with the problem we hope to solve, and we decided to use the runtime concept to solve the new problems encountered in the development of Service Mesh.

B、Service Mesh vs Runtime

In order to give everyone a clearer understanding of Runtime, the above figure summarizes the positioning, interaction, communication protocol and capability richness of the two concepts of Service Mesh and Runtime. It can be seen that compared to Service Mesh, Runtime provides The API with clear semantics and rich capabilities can make the interaction between the application and it easier and more direct.

3. MOSN sub-project Layotto

A, dapr research

dapr is a well-known runtime implementation product in the community, and its activity is relatively high. Therefore, we first investigated the situation of dapr and found that dapr has the following advantages:

A variety of distributed capabilities are provided, and the API definition is clear, which can basically meet general usage scenarios.
For various capabilities, different implementation components are provided, which basically cover commonly used middleware products, and users can freely choose according to their needs.

When considering how to implement dapr within the company, we proposed two solutions, as shown in the figure above:

Replacement: Abandon the current MOSN and replace it with dapr. There are two problems with this scheme:

a. Although dapr provides many distributed capabilities, it does not currently have the rich service governance capabilities included in Service Mesh.

b. MOSN has been implemented on a large scale within the company, and has undergone several big promotion tests. The direct replacement of MOSN with dapr remains to be verified.

Coexistence: Add a dapr container and deploy with MOSN in a two-sidecar mode. There are also two problems with this scheme:

a. Introducing a new sidecar, we need to consider its supporting upgrades, monitoring, injection, etc., and the operation and maintenance costs have soared.

b. Maintaining one more container means an extra layer of risk of failure, which will reduce the current system availability.

Similarly, if you are currently using envoy as the data plane, you will also face the above problems.
Therefore, we hope to combine Runtime and Service Mesh, deploy through a complete sidecar, and reuse the various existing Mesh capabilities to the greatest extent while ensuring stability and maintaining the same operation and maintenance costs. In addition, we also hope that in addition to combining with MOSN, this part of Runtime capabilities can also be combined with envoy in the future to solve problems in more scenarios. Layotto was born under this background.

B. Layotto architecture

As shown in the figure above, Layotto is built on MOSN, docking various infrastructures in the lower layer, and providing a unified standard API with various distributed capabilities to upper-layer applications. For applications that access Layotto, developers no longer need to care about the differences in the implementation of various underlying components, but only need to pay attention to what capabilities the application requires, and then call the APIs of the corresponding capabilities, so that they can completely understand the underlying infrastructure. tie.

For the application, the interaction is divided into two parts, one is to call Layotto's standard API as a gRPC Client, and the other is to implement Layotto's callback as a gRPC Server, which benefits from the excellent cross-language support capabilities of gRPC. The application no longer needs to care about communication, Detailed issues such as serialization further lower the threshold for the use of heterogeneous technology stacks.

In addition to application-oriented, Layotto also provides a unified interface to the operation and maintenance platform. These interfaces can feed back the running status of the application and sidecar to the operation and maintenance platform, so that SRE students can understand the running status of the application in time and make different actions for different statuses. , This function takes into account the integration with existing platforms such as k8s, so we provide the access method of HTTP protocol.

In addition to the design of Layotto itself, the project also involves two pieces of standardization. First, it is not easy to develop a set of APIs with clear semantics and a wide range of applicable scenarios. For this reason, we have cooperated with the Ali and dapr communities, hoping to be able to Promote the construction of Runtime API standardization. Secondly, for components with various capabilities that the dapr community has realized, our principle is to give priority to reuse, and then to develop, and try not to waste energy on existing components and reinvent wheels.

Finally, although Layotto is currently built on MOSN, we hope that Layotto can run on envoy in the future, so as long as the application is connected to Service Mesh, regardless of whether the data plane uses MOSN or envoy, runtime capabilities can be added to it.

C. Portability of Layotto

As shown in the figure above, once the standardization of the Runtime API is completed, the application that accesses Layotto is naturally portable. The application can be deployed on private clouds and various public clouds without any modification, and because it uses standard APIs , The application can also switch freely between Layotto and dapr without any modification.

D. Name meaning

As can be seen from the above architecture diagram, the Layotto project itself hopes to shield the implementation details of the infrastructure and provide various distributed capabilities to the upper application. This approach is like adding a layer of abstraction between the application and the infrastructure. , So we learn from OSI’s idea of defining a seven-layer model for the network, and hope that Layotto can serve as the eighth layer to provide services to applications. otto means 8 in Italian, and Layer otto means the eighth layer. Layotto, and the project code is L8, which also means the eighth layer. This code is also the source of inspiration when designing our project's LOGO.

After introducing the overall situation of the project, the following describes the implementation details of the four main functions.

E. Configuration primitives

The first is the configuration function often used in distributed systems. Applications generally use the configuration center to switch or dynamically adjust the running state of the application. The implementation of the configuration module in Layotto consists of two parts, one is thinking about how to define the API for configuring this ability, and the other is the specific implementation. Let's look at them one by one.

It is not an easy task to define a configuration API that can meet most of the actual production requirements. Dapr currently lacks this ability. Therefore, we cooperated with Alibaba and the dapr community to define a reasonable version of the configuration API. intense discussion.

The results of the discussion have not yet been finalized, so Layotto is based on the first version of the draft we submitted to the community to implement, the following is a brief description of our draft.

We first defined the basic elements required for general configuration:

appId: indicates which application the configuration belongs to
key: the configured key
content: configured value
group: The group to which the configuration belongs. If there are too many configurations under one appId, we can group these configurations for easy maintenance.

In addition, we have added two advanced features to adapt to more complex configuration usage scenarios:

label, used to label the configuration, such as which environment the configuration belongs to. When querying the configuration, we will use label + key to query the configuration.
Tags, some additional information added by the user to the configuration, such as description information, creator information, last modification time, etc., to facilitate configuration management and auditing.

For the specific implementation of the configuration API defined above, it currently supports five operations: query, subscribe, delete, create, and modify. The push after subscription configuration changes uses the stream feature of gRPC, and the underlying components that implement these configuration capabilities, we I chose apollo, which is popular in China, and other implementations will be added according to demand later.

F, Pub/Sub primitives

For the support of Pub/Sub capabilities, we investigated the current implementation of dapr and found that it can basically meet our needs. Therefore, we directly reused dapr’s API and components, but adapted it in Layotto. This is for us To save a lot of repetitive labor, we hope to maintain a cooperative approach with the dapr community instead of recreating wheels.

The Pub function is that the App calls the PublishEvent interface provided by Layotto, while the Sub function is that the application implements the ListTopicSubscriptions and OnTopicEvent interfaces in the form of gRPC Server. One is used to tell the Layotto application which topics to subscribe to, and the other is used to receive topic changes. Layotto's callback event.

The definition of Pub/Sub by dapr basically meets our needs, but there are still deficiencies in some scenarios. dapr adopts the CloudEvent standard, so the Pub interface has no return value, which cannot meet the requirements of our production scenarios for the server to return after the Pub message Corresponding to the demand of messageID, we have submitted the demand to the dapr community, and we are still waiting for feedback. Considering the mechanism of community asynchronous collaboration, we may first increase the return result from the community, and then discuss with the community a better Compatible program.

G, RPC primitives

Everyone is familiar with the ability of RPC. This may be the most basic requirement under the microservice architecture. For the definition of the RPC interface, we also refer to the definition of the dapr community and found that it can fully meet our needs. Therefore, the interface definition is directly reproduced. Dapr is used, but the current implementation of RPC provided by dapr is still relatively weak, and MOSN has been iterative for many years, and its capabilities have become very mature and perfect. Therefore, we boldly combine Runtime and Service Mesh together, and use MOSN itself as our implementation of RPC A Component of capability, so that Layotto will hand it over to MOSN for actual data transmission after receiving the RPC request. This solution can dynamically change routing rules, downgrade the current limit and other settings through istio, which is equivalent to directly multiplexing the various Service Mesh Ability, this also shows that Runtime is not to overthrow Service Mesh, but to continue to take a step forward on this basis.

In terms of specific implementation details, in order to better integrate with MOSN, we added a layer of Channel to the implementation of RPC. By default, it supports three common RPC protocols: dubbo, bolt, and http. If it still does not meet the user scenario, we have added it. Before/After two kinds of Filter, can let users make custom expansion, realize the demand such as the agreement conversion.

H, Actuator primitive

In the actual production environment, in addition to the various distributed capabilities required by the application, the operation and maintenance platforms such as PaaS often need to understand the running status of the application. Based on this requirement, we abstracted a set of Actuator interfaces. Currently dapr has not provided this. Therefore, we designed according to the internal demand scenario, aiming to expose various information of the application during the startup period, operation period, etc., so that PaaS can understand the operation of the application.

Layotto divides exposure information into two categories:

Health: This module determines whether the current running state of the application is healthy. For example, if a strongly dependent component fails to initialize, it needs to be expressed as a non-healthy state. For the type of health check, we refer to k8s, which are divided into:

a. Readiness: indicates that the application is started and can start processing requests.

b. Liveness: indicates the application's survival status. If it is not alive, it needs to cut the flow, etc.

Info: This module is expected to expose some dependent information of the application, such as the services that the application depends on, subscription configuration, etc., for troubleshooting.

Health The exposed health status is divided into the following three types:

INIT: Indicates that the application is still being started. If this value is returned during the application release process, the PaaS platform should continue to wait for the application to finish starting at this time.
UP: indicates that the application starts normally. If this value is returned during the application publishing process, it means that the PasS platform can start to put in traffic.
DOWN: Indicates that the application failed to start. If this value is returned during the application publishing process, it means that PaaS needs to stop publishing and notify the application owner.

At this point, Layotto's current exploration in the Runtime direction is basically finished. We use a well-defined semantic API and use a standard interaction protocol such as gRPC to solve the current strong infrastructure binding and high cost of heterogeneous language access. Big problem. With the construction of API standardization in the future, on the one hand, applications that access Layotto can be deployed on various private clouds and public clouds without perception. On the other hand, applications can be freely switched between Layotto and dapr to improve R&D efficiency. .

At present, the serverless field is also blooming, and there is no unified solution. Therefore, in addition to the investment in the above-mentioned Runtime direction, Layotto has also made some attempts in the serverless direction. The following attempts are introduced.

Fourth, the exploration of WebAssembly

A. Introduction to WebAssembly

WebAssembly, WASM for short, is a binary instruction set. It was originally run on the browser to solve the performance problems of JavaScript. However, due to its excellent security, isolation, and language independence, people soon began to let it run. Outside the browser, with the emergence of WASI definitions, only one WASM runtime is needed to allow WASM files to execute anywhere.

Since WebAssembly can run outside the browser, can we use it in the serverless field? There have been some attempts in this area, but if this kind of solution really wants to land, the first thing to consider is how to solve the problem of running WebAssembly on various infrastructures.

B, WebAssembly landing principle

At present, MOSN allows WASM to run on MOSN by integrating WASM Runtime to meet the needs of MOSN for custom extensions. At the same time, Layotto is also built on MOSN, so we consider combining the two together. The implementation scheme is shown in the following figure:

Developers can use Go/C++/Rust and other languages they like to develop application codes, and then compile them into WASM files and run them on MOSN. When the WASM form application needs to rely on various kinds of requests in the process of processing requests In the case of distributed capabilities, the standard API provided by Layotto can be called through local function calls, which directly solves the dependency problem of WASM form applications.

At present, Layotto provides the implementation of WASM for Go and Rust. Although it only supports demo-level functions, it is enough for us to see the potential value of this solution.

In addition, the WASM community is still in its early stages, and there are many areas that need to be improved. We have also submitted some PRs to the community for joint construction to contribute to the implementation of WASM technology.

C. Prospects for the implementation of WebAssembly

Although the use of WASM in Layotto is still in the experimental stage, we hope that it can eventually become an implementation form of Serverless. As shown in the figure above, the application is developed in various programming languages, then unified compiled into WASM files, and finally run in On Layotto+MOSN, the operation and maintenance management of the application is unified by k8s, docker, prometheus and other products.

Five, community planning

Finally, let's take a look at some of the things Layotto has done in the community.

A、Layotto vs Dapr

The figure above shows the comparison between the existing capabilities of Layotto and dapr. In the development process of Layotto, we borrowed from dapr’s ideas and always followed the principle of first reuse and second development to achieve the goal of co-construction. Or in terms of the capabilities to be built in the future, we plan to first land on Layotto, and then mention it to the community, and merge it into the standard API. In view of the asynchronous collaboration mechanism of the community, the communication cost is higher. Therefore, in the short term, Layotto's API may be earlier than The community, but in the long run, it will be unified.

B. API co-construction plan

Regarding how to define a set of standard APIs and how to make Layotto run on envoy, we have conducted in-depth discussions in various communities and will continue to advance in the future.

C、Roadmap

Layotto currently mainly supports the four major functions of RPC, Config, Pub/Sub, and Actuator. It is expected to devote energy to distributed locks, State, and observability in September. In December, it will support Layotto plug-in, that is, let It can run on envoy, and hope that the exploration of WebAssembly will have further output.

D. Officially open source

The Layotto project was introduced in detail above. The most important thing is that the project is officially open sourced as a sub-project of MOSN today. We have provided detailed documentation and demo examples to help you get started quickly.

The construction of API standardization is something that needs to be promoted for a long time. At the same time, standardization means not to meet one or two scenarios, but to adapt to most use scenarios as much as possible. For this reason, we hope that more people can participate in the Layotto project. In, describe your usage scenarios, discuss the definition of API, submit them to the community together, and finally achieve the ultimate goal of Write once, Run anywhere!

For more articles, please scan the QR code to follow the "Financial-level Distributed Architecture" public account