Under the microservice architecture, service splitting will double the scale of APIs, and it has gradually become a trend to use API gateways to manage APIs. Meituan unified API gateway service Shepherd came into being under this background. It is suitable for Meituan business and is completely self-developed. It is used to replace traditional web-layer gateway applications. Business developers can open functions and functions through configuration. data. This article will introduce the background of the birth of the Meituan unified API gateway, the key technical design and implementation, and the future planning of the API gateway. I hope to bring you some help or inspiration.
1. Background introduction
1.1 What is an API gateway?
API gateway is an architectural pattern that emerged with the concept of microservices. Originally a huge All in one business system was split into many Microservice systems for independent maintenance and deployment. The changes brought about by the service splitting are that the scale of APIs has doubled and the management of APIs The difficulty is also increasing, and the use of API gateways to publish and manage APIs has gradually become a trend. Generally speaking, an API gateway is a traffic entry that runs between external requests and internal services, and implements general functions such as protocol conversion, authentication, flow control, parameter verification, and monitoring for external requests.
1.2 Why is a Shepherd API gateway required?
Before the Shepherd API gateway, Meituan business R&D staff wanted to export internal services as external HTTP API interfaces. A web application is usually built to complete basic authentication, current limiting, monitoring logs, parameter verification, protocol conversion, etc. At the same time, it needs to maintain the code logic and upgrade the basic components, and the research and development efficiency is relatively low. In addition, every web application needs to maintain machines, configurations, databases, etc., and resource utilization is also very poor.
Some internal business lines of Meituan suffer from no ready-made solutions. According to their own business characteristics, they have developed business-related API gateways. Looking at the industry, companies such as Amazon, Alibaba, and Tencent also have mature API gateway solutions.
Therefore, the Shepherd API gateway project was officially established. Our goal is to provide Meituan with a high-performance, highly available, and scalable unified API gateway solution, so that business developers can open functions and data to the outside world through configuration.
1.3 What are the benefits of using Shepherd?
From the perspective of business developers, what benefits can access to the Shepherd API gateway bring? In short, it includes three aspects.
Improve R&D efficiency
- Business R&D personnel only need to quickly open the service interface through configuration.
- Shepherd uniformly provides non-business basic capabilities such as authentication, current limiting, and fusing.
- Shepherd supports business developers to extend API gateway capabilities by developing custom components.
Reduce communication costs
- After the business R&D personnel configure the API, they can automatically generate the front-end and back-end interaction documents of the API and the client SDK to facilitate the interaction and joint debugging of the front-end and back-end developers.
Improve resource utilization
- Based on the serverless architecture idea, the API is fully managed, and business developers do not need to care about machine resource issues.
2. Technical design and implementation
2.1 Overall architecture
Let's first take a look at the overall architecture of the Shepherd API gateway, as shown in the following figure:
control surface Shepherd API gateway is composed of the Shepherd management platform and the Shepherd monitoring center. The management platform mainly completes the full life cycle management of the API and configuration issuance, and the monitoring center completes the collection of API request monitoring data and business alarm functions.
configuration center Shepherd API gateway mainly completes the information interaction between the control plane and the data plane, which is realized through the unified configuration service Lion of Meituan.
data plane Shepherd API gateway is also the Shepherd server. A complete API request may be initiated from a mobile application, web application, partner or internal system, and then arrive at the server after passing through the Nginx load balancing system. The server integrates a series of basic functional components and business custom components, and requests back-end RPC services, HTTP services, function services or service orchestration services through generalized calls, and finally returns the response results.
Below we will give a detailed introduction to these three main modules.
2.1.1 Control surface
Using the control plane of the API gateway, business developers can easily complete the full life cycle management of the API, as shown in the following figure:
Business R&D personnel start from the creation of API, complete parameter entry, DSL script generation; then can perform API testing through documentation and MOCK functions; after API testing is completed, in order to ensure the stability of the launch, the Shepherd management platform provides release approval, grayscale launch, A series of security assurance measures such as version rollback; during API operation, API call failures will be monitored, request logs will be recorded, and an alarm will be issued in time if an abnormality is found; finally, after the offline operation of the API that is no longer used, the API will be recovered. All kinds of resources occupied and waiting to be reactivated.
The entire life cycle is fully self-managed by business R&D personnel through a configuration and process approach. The start-up time is basically within 10 minutes, which greatly improves R&D efficiency.
2.1.2 Configuration Center
The configuration center of the API gateway stores the relevant configuration information of the API-using a custom DSL (Domain-Specific Language, domain-specific language) to describe, used to send the API routing, rules, components and other configurations to the data plane of the API gateway change.
The design of the configuration center uses the combination of the unified configuration management service Lion and the local cache to achieve dynamic configuration and non-stop release. The configuration of the API is shown in the figure below:
Detailed description of API configuration:
- Name, Group : name, group belonging.
- Request : requested domain name, path, parameters and other information.
- Response : Response result assembly, exception handling, Header, Cookies information.
- Filters, FilterConfigs : Functional components and configuration information used by the API.
- Invokers : Request rules and scheduling information for back-end services (RPC/HTTP/Function).
2.1.3 Data plane
API routing
After the data plane of the API gateway perceives the API configuration, it will establish the request path and routing information of the API configuration in the memory. Usually HTTP request path contains some path variables. Considering the performance problem, Shepherd did not adopt the method of regular matching, but designed two data structures for storage. As shown below:
One is the direct mapping MAP structure that does not contain path variables. Among them, Key is the complete domain name and path information, and Value is the specific API configuration.
The other is a prefix tree data structure that contains path variables. Through the prefix matching method, the leaf node is accurately searched first, and the search node is pushed into the stack. If it does not match, the top node of the stack is pulled out of the stack, and then the variable node of the same level is pushed into the stack. If it is still not found, Then continue to backtrack until the path node is found (or not found) and exit.
Functional component
When the request traffic hits the API request path and enters the server, the specific processing logic is completed by a series of functional components configured in the DSL. The gateway provides a wealth of functional component integration, including link tracking, real-time monitoring, access logs, parameter verification, authentication, current limiting, fuse degradation, gray-scale shunting, etc., as shown in the following figure:
Protocol conversion & service call
The last step of API call is protocol conversion and service call. The work that the gateway needs to complete includes: obtaining HTTP request parameters, Context local parameters, assembling back-end service parameters, completing the protocol conversion from HTTP protocol to back-end service, calling back-end services to obtain response results and converting them into HTTP response results.
The above figure takes the call of the back-end RPC service as an example, obtains the parameter values of different parts of the HTTP request through the JsonPath expression, replaces the value of the corresponding part of the RPC request parameter, generates the service parameter DSL, and finally completes the service call with the help of RPC generalization.
2.2 High-availability design
As the basic component of the access layer, the Shepherd API gateway has high availability has always been a part of great concern to business developers. Next. Let's explore Shepherd's practice in high-availability design.
2.2.1 Eliminate performance hazards
For a highly available system, to prevent failures, we must first eliminate hidden performance hazards and ensure high performance.
Shepherd has done fully asynchronous processing of API requests. The request is submitted to the business processing thread pool through the Jetty IO thread asynchronously, and the back-end service is called asynchronously using the RPC or HTTP framework, which releases the thread occupation caused by network waiting and makes the number of threads. It no longer becomes the bottleneck of the gateway. The following figure shows the request thread processing logic of the server when the Jetty container is used:
We pressure-tested the end-to-end QPS of a single gateway through a domain name, and found that when the QPS exceeds 2000, there will be many timeout errors, but the server load and performance of the gateway are very surplus. The survey found that this problem exists in other web applications in the company. After a joint investigation with the Oceanus team, it was found that the long connection function between Nginx and the web application was not turned on and could not be configured. After the Oceanus team went through an emergency schedule, developed and launched the long-lived connection function, Shepherd's end-to-end QPS successfully increased to more than 10,000.
In addition, we have optimized the API request warm-up on the Shepherd server, so that the gateway can reach the best performance immediately when it starts, and reduce the occurrence of glitches. Then, through the CPU hot spot investigation during the stress test, the performance bottleneck was found, the local log printing on the main link was reduced, and the request log was modified asynchronously and remotely. Shepherd's end-to-end QPS has once again increased by more than 30%.
One year after the Shepherd service went online and ran stably, we optimized the performance again and performed a network framework upgrade. The Jetty container was completely replaced with the Netty network framework. The performance increased by more than 10%. Shepherd's end-to-end QPS was successfully increased to 15000 the above. The following figure shows the request thread processing logic of the server when using the Netty framework:
2.2.2 Service isolation
cluster isolation
Drawing on the experience of mature components such as the company's cache and task scheduling, Shepherd considered cluster isolation based on the business line dimension at the beginning of the design, and also supports independent deployment of important services. As shown below:
Request for isolation
Service node dimension, Shepherd supports requested fast and slow thread pool isolation. Fast and slow thread pool isolation is mainly used for some APIs that use synchronous blocking components, such as SSO authentication, custom authentication, etc., which may cause long-term blocking of the shared business thread pool.
The principle of fast and slow isolation is to count the processing time of API requests, to isolate the request processing time, and the API requests that exceed the tolerance threshold are isolated to the slow thread pool to avoid affecting other normal API calls.
In addition, Shepherd also supports business developers to configure custom thread pools for isolation. The specific thread isolation model is shown in the following figure:
2.2.3 Stability guarantee
Shepherd provides some conventional stability guarantee methods to ensure the availability of itself and back-end services. As shown below:
- flow control : Provide flow protection from multiple dimensions such as user-defined UUID flow restriction, App flow restriction, IP flow restriction, and cluster flow restriction.
- request caching : For some idempotent, frequently queried, and data-insensitive requests, business developers can enable the request caching function.
- timeout management : Each API has set a processing timeout time, for the timeout request, fast failure processing is performed to avoid resource occupation.
- fuse downgrade : support fuse downgrade function, real-time monitoring of requested statistical information, after reaching the configured failure threshold, it will automatically fuse and return to the default value.
2.2.4 Request security
Request security is a very important capability of the API gateway. Shepherd integrates a wealth of security-related system components, including basic request signatures, SSO single sign-on, UAC/UPM access control based on SSO authentication, user authentication Passport, merchants EPassport authentication, merchant rights authentication, anti-climbing, etc. Business R&D personnel only need to configure and use it.
2.2.5 Grayscale
As the request entrance, the API gateway often shoulders the important task of gray-scale verification of request traffic.
Grayscale scene
In terms of grayscale capabilities, Shepherd supports the grayscale API's own logic, and also supports grayscale downstream services. It can also use the grayscale API's own logic and downstream services at the same time. As shown below:
In the grayscale API's own logic, the grayscale capability is realized by diversion of traffic to different API versions; for grayscale downstream services, the flow is marked and diverted to the designated downstream grayscale unit.
Grayscale strategy
Shepherd supports a wealth of grayscale strategies, which can count grayscales according to proportions or grayscales according to specific conditions.
2.2.6 Monitoring alarm
three-dimensional monitoring
Shepherd provides 360-degree three-dimensional monitoring, providing 7x24-hour professional guarding from business indicators, machine indicators, and JVM indicators, as shown in the following table:
Monitoring module | The main function | |
---|---|---|
1 | Unified monitoring of Raptor | Real-time report request call information, system indicators, responsible for application layer (JVM) monitoring, system layer (CPU, IO, network) monitoring |
2 | Link tracking Mtrace | Responsible for full link parameter transparent transmission, full link tracking and monitoring |
3 | Log monitoring Logscan | Monitor local log exception keywords: such as 5xx status code, null pointer exception, etc. |
4 | Remote Log Center | API request logs, Debug logs, component logs, etc. can be reported to the remote log center |
5 | Health Check Scanner | Heartbeat detection and API status detection for gateway nodes, and find abnormal nodes and abnormal APIs in time |
Multi-dimensional alarm
With a comprehensive monitoring system, a supporting alarm mechanism is naturally indispensable. The main alarm capabilities include:
Alarm type | Trigger timing | |
---|---|---|
1 | Current limit alarm | The API request reaches the current limit rule threshold to trigger a current limit alarm |
2 | Request failed alert | Authentication failure, request timeout, back-end service exception, etc. trigger request failure alarms |
3 | Component abnormal warning | Custom components processing time-consuming and high failure rate alarms |
4 | API exception warning | API exception alarm is triggered when API release fails or API check is abnormal |
5 | Health check failure alarm | The health check failure alarm is triggered when the API heartbeat check fails and the gateway node is unavailable |
2.2.7 Fault self-healing
The Shepherd server is connected to an elastic scaling module, which can quickly expand and shrink according to CPU and other indicators. In addition, it also supports quick removal of problem nodes and fine-grained removal of problem components.
2.2.8 Can be migrated
For some Web services that are already providing APIs to the outside world, in order to reduce the cost of operation and maintenance and improve the efficiency of subsequent research and development, business developers are considering migrating them to the Shepherd API gateway.
For some non-core APIs, you can consider using Oceanus's gray release function to migrate directly. However, for some core APIs, the above grayscale publishing function is machine-level, with a large granularity, not flexible enough, and cannot well support the grayscale verification process.
solution
Shepherd provides a grayscale SDK for business developers, which can access the SDK's web service and forward it to the Shepherd API gateway for verification after identifying the grayscale traffic.
Which APIs and percentages of gray levels can be dynamically adjusted on the Shepherd management terminal and take effect in real time. Business R&D personnel can also customize gray levels through SPI. After the grayscale verification is passed, the API will be migrated to the Shepherd API gateway to ensure the stability of the migration process.
Grayscale process
before gray scale : Create an API group on the Shepherd management platform, and configure the domain name as the currently used domain name. On Oceanus, the original domain name rules remain unchanged.
Grayscale : Turn on the grayscale function on the Shepherd management platform, and the Grayscale SDK will forward the grayscale traffic to the gateway service for verification.
grayscale : After verifying that the API configuration on Shepherd meets expectations through the grayscale flow, then migrate.
2.3 Design for ease of use
The Shepherd API gateway is powerful and complex, and its ease of use is very important to business developers. We will focus on solutions that automatically generate DSL and increase efficiency in API operations.
2.3.1 Automatically generate DSL
When business R&D personnel actually use the gateway management platform, we try to reduce the burden of writing DSL through graphical page configuration. However, the DSL configuration for service parameter conversion still needs to be manually written by business R&D personnel. Generally speaking, the process of generating service parameter DSL is:
- Introduce the interface package dependency of the service.
- Get the service parameter class definition.
- Write Testcase to generate JSON template.
- Fill in the parameter mapping rules.
- Finally, manually enter the management platform and publish the API.
The whole process is very cumbersome and error-prone. If there are dozens or hundreds of APIs that need to be entered, the efficiency of manual entry by business R&D personnel is very low.
solution
So can the generation process of the service parameter DSL be automated? The answer is yes, the business RD only needs to enter the API document information in the gateway, and then enter the service Appkey, service name, method name information, and the Shepherd management terminal will obtain the JSON Schema information of the service parameters from the newly released service framework console. JSON Schema defines the type and structure information of service parameters, and the management terminal can automatically generate JSON mock data of service parameters based on this information. Combined with the information in the API document, the Value value with the same parameter name is automatically replaced. This set of DSL automatic generation solution is transparent and standardized to the business during the use process. The business side only needs to upgrade the latest version of the service framework to use it, which greatly improves the research and development efficiency. It is currently widely praised by the business research and development personnel.
2.3.2 API operations to improve efficiency
Quickly create API
The core capabilities of the API Gateway are based on the API configuration, but while providing powerful functions, it brings high complexity. Many business developers complain that the API configuration is too cumbersome and the learning cost is high. The ability to quickly create APIs came into being. Business developers only need to provide a small amount of information to create APIs. The function of quick creation of API is currently divided into 4 types (back-end RPC service API, back-end HTTP service API, SSO CallBack API, Nest API). In the future, according to different business application scenarios, more rapid creation API types will be provided.
batch operation
Business developers need to manage a lot of business groups on the API gateway. Each business group can have up to 200 API configurations. Multiple APIs may have many identical configurations, such as component configuration, error code configuration, and cross-domain configuration. of. Each API must be configured once for the same configuration, and the operation is highly repetitive. Therefore, Shepherd supports batch operation of multiple APIs: After checking multiple APIs, multiple API configuration updates can be completed at one time through the [Batch Operation] function, reducing the operation cost of repeated business configuration.
API import and export
Shepherd provides the ability to import and export APIs in different R&D environments. After offline testing is completed, business R&D personnel only need to use the API import and export function to export the configuration to the online production environment to avoid repeated configuration.
2.4 Scalability design
A well-designed basic component, in addition to providing strong basic capabilities, also needs to have good scalability. The scalability of Shepherd is mainly reflected in the ability to support custom components and service orchestration.
2.4.1 Custom components
Shepherd provides a wealth of system components to complete authentication, current limiting, and monitoring capabilities, which can meet most business needs. But there are still some special business requirements, such as custom verification, custom result processing, etc. Shepherd supports the business to complete some custom logic extensions by providing the ability to load custom components.
The figure below is an example of a custom component implementation. Fill in the name of the custom component when applying for the custom component in getName, and implement the business logic of the custom component in the invoke method, such as continuing execution, performing page jumps, returning results directly, throwing exceptions, etc.
At present, Shepherd has successfully supported important businesses such as Meituan selection, takeaway, catering, and taxi rides through custom components, with more than 200 custom components connected.
2.4.2 Service Orchestration
Generally, an API configured on the gateway corresponds to an RPC or HTTP service on the backend. If the caller has requirements for aggregation and orchestration of back-end services, then as many back-end services there are, how many HTTP request calls must be initiated. This will bring about some problems, too many HTTP requests on the caller side, low efficiency, and too heavy logic for aggregating services on the caller side.
The need for service orchestration arises at the historic moment. Service orchestration is to orchestrate and call existing services and process the acquired data at the same time. Mainly used in data aggregation scenarios: the data returned by an HTTP request needs to call multiple or multiple services (RPC or HTTP) to get the complete result.
After preliminary research, the company already has a mature service orchestration framework, a pirate component developed by the customer service team (see "Pirate Middleware: Best Practice for Service Experience Platform Docking Business Data" 160a76f98e5368), which is also an internal Meituan Public service.
Therefore, we worked with the pirate team to design Shepherd's service orchestration support plan. Pirates provide service orchestration capabilities through independent deployment, and calls are made between Shepherd and pirates through RPC. This can decouple Shepherd and pirates and avoid affecting other services on the cluster due to service orchestration capabilities. At the same time, one more RPC call will not significantly increase the time-consuming. It is also transparent to business R&D personnel and very convenient to use. Business R&D personnel configure the service orchestration API on the management side, and send it to the Shepherd server and pirate service through the configuration center at the same time, and then they can start using the service orchestration capabilities. The overall interactive architecture diagram is as follows:
3. Future planning
At present, the number of APIs connected to the Shepherd API gateway exceeds 18,000, the number of online clusters is more than 90, and the total number of calls per day is more than tens of billions. As the business scale of Shepherd API Gateway continues to grow, higher requirements will inevitably be placed on our availability, ease of use, and scalability. In the coming year, Shepherd's planning focuses include the evolution of cloud native architecture, static website hosting, component market, etc.
3.1 Evolution of Cloud Native Architecture
The evolution of the cloud native architecture of the Shepherd API gateway has three goals: simplify the steps of accessing the gateway and improve the R&D efficiency of business R&D personnel; reduce the size of the server-side War packet to improve security and stability; access to the serverless flexibility, reduce costs, and improve Resource utilization.
In order to achieve these three goals, we plan to migrate the gateway service as a whole to the company's Serverless service Nest (see "Exploration and Practice of Serverless Platform Nest" 160a76f98e53e6 article), and integrate the core functions of Shepherd into the SDK. For service gateway clusters, service developers can choose only the custom components they need to use, thereby greatly reducing the size of the War packet on the server side.
3.2 Static website hosting
Relying on the Shepherd API gateway to achieve static website hosting goals are: build a general static website hosting solution, providing developers with convenient, stable, and highly scalable static website hosting services .
The main functions that the static website hosting solution can provide for business developers include: hosting static website resources, including storage and access; managing application lifecycle, including custom domain configuration and authentication and authorization; CI/CD integration, etc.
3.3 Component Market
The goal of the Shepherd API gateway component market is: cooperation and win-win, forming a development ecosystem, business R&D personnel can provide customized components developed to other business R&D teams in need to use .
We hope to allow business R&D personnel to participate in the development of custom components, complete the use of documents and set it as a public component, and open it to all business R&D personnel who use Shepherd to avoid duplication of wheels.
About the Author
Chongze, Zhiyang, Li Min, etc., all come from the Meituan Basic Technology Department-Infrastructure Team.
Job Offers
Meituan's basic technology department-infrastructure team is looking for senior and senior technical experts, Base Beijing and Shanghai. We are committed to building a unified high-concurrency and high-performance distributed infrastructure platform for Meituan, covering database, distributed monitoring, service governance, high-performance communications, message middleware, basic storage, containerization, cluster scheduling and other infrastructures. Technical field. Interested students are welcome to submit their resumes to: edp.itu.zhaopin@meituan.com.
Read more technical articles from the
the front | algorithm | backend | data | security | operation and maintenance | iOS | Android | test
| In the public account menu bar dialog box, reply to keywords such as [Products in 2020], [Products in 2019], [Products in 2018], [Products in 2017], and you can view the collection of technical articles from the Meituan technical team over the years.
| This article is produced by the Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "the content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activity, please send an email to tech@meituan.com to apply for authorization.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。