Gaode Serverless Platform Construction and Practice

Introduction to Why does AutoNavi want to engage in Serverless/Faas? How is Serverless/Faas made? What is the technical solution? How is the current progress? What are the follow-up plans? This article will do a simple sharing with you.

头图.jpg

Author | Deng Xuexiang (Xiang Yi)
Source | Serverless Public

AutoNavi started its serverless construction in FY21, and it has been a year since AutoNavi’s serverless business peaked at over 100,000 qps, with platforms ranging from 0 to 1, and qps ranging from zero to 100,000, becoming the scale of serverless applications within Alibaba Group The biggest BU, what is the process like in the middle? What problems have you encountered? Why does AutoNavi want to engage in Serverless/Faas? How is Serverless/Faas made? What is the technical solution? How is the current progress? What are the follow-up plans? This article will do a simple sharing with you.

1. Why- Why does AutoNavi want to engage in Serverless

Why does AutoNavi want to engage in Serverless? The background reason is that AutoNavi launched a client cloud project in FY21. The main purpose of the client cloud project is order to improve the efficiency of client development iteration .

In the past, the business logic of the client was on the client. Changes in product requirements needed to be released through the client version to be released. However, the client version required various testing procedures and gray-scale processes to solve problems such as client crashes. The current pace is One version per month.

After the client goes to the cloud, some changeable business logic is put on the cloud. New product requirements are developed in the cloud instead of monthly version releases, which speeds up the development and iteration efficiency of requirements, and is one step closer to the ideal goal of production and research at the same frequency. (Why do you say "again" is because AutoNavi We have made some efforts to optimize the direction of production and research with the same frequency, but we hope that cloud integrated development can be the most effective technical assistance).

1.1 Goal: Client-side development model---end-cloud integration

Although the development model has changed from the previous client development to the current cloud + client development, the development students should still be the students who were in charge of the corresponding business, but everyone knows that there is obviously a difference between server-side development and client-side development. Client-side development is For the development of stand-alone mode, server-side development is usually a cluster mode, and various complex issues such as coordination of distributed systems, load balancing, and failover degradation need to be considered. If you use the traditional server-side model for development, this transition risk will be greater.

Faas solves this problem well. We combined the existing xbus framework of AutoNavi Client (a framework for local service registration and invocation on the client) to extend the xbus-cloud component, making the development on the cloud just like the development on the end, and the goal is a set Code runs in two places. A set of business code can run on both the client and the server.

The AutoNavi client has three main terminals: IOS, android, and car machine (Linux-like operating system). There are two main languages: C++ and Node.js. Traditional map functions: such as map display, navigation path display, navigation broadcast, etc., due to the need to cross three terminals, using C++ language to develop. Some map application functions based on map navigation, such as cards before/after the trip, recommended destinations, etc., are mainly developed with Node.js.

In FY20, the front-end team of Amoy Department developed Node.js Faas runtime. In the cloud project of AutoNavi Client, the Node.js part adopts the existing Node.js runtime of Taobao to connect to the Faas platform of the group to complete some services of Node.js to the cloud. During the November 2020 period, AutoNavi's business during the November Travel Festival was well supported.

C++ Faas does not have an existing solution, so we decided to add to the group’s infrastructure and build a new C++ Faas basic platform to help AutoNavi clients go to the cloud.

1.1.1 The key to the best practice of integrating device and cloud: abstraction of the interface between the client and Faas

The logic of the original client is moved to the Faas server, or part of the new requirements is developed on the Faas server. The key point of success or failure here lies in the definition of the interface protocol between the client and Faas, that is, the Faas API definition, a good API definition In addition to being beneficial to the maintainability of the system, it is also very important to support the subsequent iterative development of the business. For a good API definition, please refer to Gu Pu’s document: " API Design Best Practice Thinking ".

Ideally: the client is made into a browser that parses the result data returned by Faas. Once the browser protocol is defined, it will not change frequently, and you will rarely update IE and Chrome. Of course, our browser will be more complicated. Our browser is a map browser. How to check whether the interface definition between the client and Faas is good or not can be seen in the subsequent product requirement iterations. If some product requirement iterations only need to be completed on Faas without any modification of the client, then this interface abstraction is successful.

1.2 BFF layer development to improve efficiency

When it comes to AutoNavi, the first thing you think of should be its tool properties: AutoNavi is a navigation tool (this statement is not accurate now, because AutoNavi has been transforming from tool to platform in recent years, and we have to do The omnipotent AutoNavi, AutoNavi’s trading business is on the rise, and AutoNavi’s services such as taxi rides, tickets, and hotels are developing rapidly).

For AutoNavi, compared with other businesses of the group (such as e-commerce), a large number of read-only scenarios is a major technical feature of AutoNavi's business. Among these read-only scenarios, a large number of requirements are BFF (Backend For Frontend) type read-only scenarios. Why do you say that? Because the core functions of navigation, such as routing, traffic, eta, etc., are relatively stable, the main work of this part is to continuously optimize the algorithm to make AutoNavi's traffic more accurate and the calculated path better. These core functions are relatively stable in terms of interfaces and functions, but the front-end requirements are changeable, such as adding a reminder of the limit width on the path.

Faas is particularly suitable for BFF layer development. Faas calls various Baas services with relatively stable backends on Faas. Faas services are used for data and call logic encapsulation, rapid development, and release. In the industry, the scene most used by Faas is also the BFF scene (the other is called the SFF scene, service for frontend).

1.3 Serverless is a high-level language in the cloud era

In FY21, AutoNavi is the first BU in the group to be fully clouded. Although AutoNavi has fully launched to the cloud, this is not the end of the cloud era. At present, it is mainly fully pouched and clouded. The container has been standardized. In terms of scale and resource utilization, you can fully enjoy the dividends of the cloud, but the business development model is basically the same as before, and it is still a large-scale distributed system. For the research and development model, we have not yet enjoyed the benefits of the cloud. It can be compared to that we are now using assembly language to write services that run on the cloud. Serverless and cloud native can be understood as high-level languages in the cloud era. Cloud as a computer is truly achieved. It only needs to focus on business development and does not need to consider the complexities of large-scale distributed systems.

1.4 Go-Faas supplements the Go language ecology

As mentioned earlier, because of the client-side cloud project, we added on the Alibaba Cloud FC (Functional Computing) team and developed the C++ Faas Runtime. Not only that, we have also developed Go-Faas, why do we do Go-Faas? Here is a brief introduction to the background. The qps peak value of the Go part of the AutoNavi server has exceeded one million. AutoNavi has completed the Go clients of Ali's middleware and built it together with the group's middleware department. The observability and automated testing system are also basically complete, and the Go ecosystem is now basically complete. After completing the Go-Faas, we can use Go to write Baas services and Go to write Faas services. Different service implementation methods are used in different business scenarios. Go-Faas is mainly used in the above mentioned BFF scene.

2. How-Technical Solution Introduction: Adding to the existing infrastructure of the group

2.1 Overall technical architecture

The above talked about why we have to do this, and then we will talk about how we did it, how to achieve it, and what the specific technical plan is.

In line with the idea of adding addition on the basis of the group's existing infrastructure and existing middleware, we cooperated with the CSE and Alibaba Cloud FC function computing team to develop C++ Faas Runtime and Go Faas Runtime. The technical architecture of the whole and the group pull-through is shown in the figure below, which is mainly divided into three parts: research and development state, operation state, and operation and maintenance state.

2.1.1 Running state

Let's talk about the running state first, business traffic comes in from our gateway, calls to FC API Server, and forwards to C++/Go Faas Runtime, and runtime completes the functions in the user function. The runtime architecture will be described in detail in the next chapter of this article.

Deployed together with the runtime container are various side cars such as monitoring, logging, and Dapr. The side car is used to complete various log collection and reporting functions, and the dapr side car is used to complete the function of calling group middleware.

In addition, dapr is still in the pilot stage. The invocation of middleware is mainly done through Broker and various middleware proxies. The middleware calls HSF, Tair, metaq, diamond and other middleware proxies.

Finally, the Autoscaling module manages the expansion and contraction of function instances to achieve the purpose of function auto-scaling. There are various strategies for scheduling here, such as scheduling based on requested concurrency and scheduling based on CPU usage of function instances. The number of reserved instances can also be set in advance to avoid cold start problems after scaling down to 0.

The bottom layer calls the capabilities of the group’s ASI. ASI can be simply understood as the group’s K8S+ sigma (the group’s scheduling system). The final deployment is that FC calls ASI to complete the function instance deployment, which is flexible. The smallest unit of deployment is the figure above. Pod, a pod contains runtime container and sidecar set container.

2.1.2 Research and development status

Let's look at the R&D state again. The running state determines how the function runs. The R&D state focuses on the development experience of the function and how to easily allow developers to develop, debug, deploy, and test a function.

C++ Faas has a cross-platform difficult problem. There are some dependent libraries in C++ Faas runtime, which are not as convenient as Java dependent library management. This way, the installation of dependent libraries is more troublesome. Faas scaffolding is to solve this problem, calling the scaffolding, generating C++ Faas sample projects with one click, and installing various dependent packages. In order to be able to debug locally, a C++ Faas Runtime Boot module has been developed. The function runtime start entry is in the boot module. The boot module integrates runtime and user Faas functions, which can be used for single-step debugging of runtime.

We cooperated with the group Aone team. The release of functions is integrated into the Aone environment. It is easy to publish Go or C++ Faas on Aone. Aone also integrates the function of one-click generation of example code libraries.

The compilation of C++ and Go Faas depends on the corresponding compilation environment. Aone provides the function of customizing the compilation image. We upload the compilation image to the group's public image library. When the function is compiled, the corresponding compilation image is specified in the function code library. , Faas dependent libraries, SDK, etc. are installed in the compiled image.

2.1.3 Operation and maintenance status

Finally, let’s look at the operation and maintenance monitoring of the function. The runtime integrates the functions of collecting logs by Hawkeye and sunfire. These logs will be written in the runtime and collected by the agent in the sidecar to the Hawkeye or sunfire monitoring platform (FC is through SLS). After collecting), the existing monitoring platform of the group can be used to monitor Faas, and it can also be connected to the GOC alarm platform of the group.

2.2 C++/Go Faas Runtime architecture

The above is about an overall architecture integrated with Aone, FC/CSE, and ASI. Runtime is a part of this overall architecture. The following specifically talks about the architecture of Runtime and how Runtime is designed and implemented.

user Faas code in the top part only needs to rely on the Faas SDK. Users only need to implement the Function interface in the Faas SDK to write their own Faas. Then if you need to call an external system, you can call it through the Http Client in the SDK. If you want to call an external middleware, you can call the middleware through the Diamond/Tair/HSF/metaq Client in the SDK. These interfaces in the SDK shield the complexity of the underlying implementation. Users don't need to care about how these calls are finally implemented, and they don't need to care about the specific implementation of runtime.

SDK layer is the function definition mentioned above and the interface definition of various middleware calls. The SDK code is developed for Faas users. The SDK is relatively light and thin, mainly for interface definitions, and does not include specific implementations. The specific implementation of calling middleware is implemented in two ways in Runtime.

is an overall architecture of Runtime 160a4e1a9ed28c. Starter is the startup module of the runtime. After startup, the runtime itself is a Server. When it is started, the runtime is started according to the configuration of the Function Config module. After the runtime is started, the request and management monitoring mode is started.

Further down is the Service layer , which implements the middleware call interface defined in the SDK, including RSocket and dapr. RSocket uses the RSocket broker mode to call the middleware. The runtime integrates dapr (distributed application). runtime), invoking middleware can also be invoked through dapr. In the early dapr pilot phase, if the invocation of middleware through dapr fails, it will be downgraded to rsocket to invoke the middleware.

Further down is RSocket protocol layers , wrap the various protocols call RSocket of metadata. The dapr call is called through grpc.

The bottom layer is the integration of rsocket and dapr.

The rsocket call also involves the issue of broker selection. The upstream module manages the broker cluster, broker registration and anti-registration, keepalive check, etc., and the LoadBalance module implements broker load balancing selection and event management, connection management, reconnection, and so on.

Finally, the metrics module in runtime is responsible for the access of the eagle eye trace, intercepting the time consuming of the Faas link through the filter mode, and outputting the eagle eye log. Print the sunfire log for sidecar to collect. The following figure is a sunfire monitoring interface of actual business:

2.2.1 Dapr

The dapr architecture is shown in the figure below. For details, please refer to official document .

In the runtime, middleware was previously called through the rsocket method. Here, rsocket broker has a centralization problem. In order to solve the problem of decentralization of outgoing traffic, the dapr architecture was introduced in cooperation with the group middleware team. It's just that dapr is integrated at the runtime level, which is unaware to the user Faas, and does not need to care about whether the specific invocation middleware is invoked through rsocket or dapr. After the runtime calls the middleware to switch to dapr, the user Faas does not need to make any changes.

3. How-How does business connect to Serverless

As mentioned earlier, access is unified on Aone. Provides access documentation for C++ Faas/Go Faas. An example code library of functions is provided. The code library has examples of various scenarios, including code examples of calling various middleware of the group. The access of C++ Faas/Go Faas is for the development of the entire group. At present, there are already some BUs other than Gaode who have implemented C++/Go Faas in their own businesses. Node.js Faas uses the runtime and templates provided by Taobao to access, and Java Faas uses the runtime and templates provided by Alibaba Cloud FC to access it.

3.1 Access specification-stability three axes: can be monitored, can be grayscale, can be rolled back

In response to the stability issues that everyone may worry about when implementing new technologies, our magic weapon is the stability of Alibaba Group's three axes: monitoring, grayscale, and rollback. Establish a Faas link guarantee group, link up and downstream related business parties and basic platforms, and work together according to the group's 1-5-10 requirements to respond to online alarms and rapid investigations within 1 minute; within 5 minutes Processing; recovery within 10 minutes.

In order to standardize the access process and avoid making mistakes and causing online failures, we have formulated Faas access specifications and checkList to help business parties quickly use Faas.

Monitorability, grayscale, and rollback are hard requirements. In addition, it would be better if the business side can be downgraded. Our C++ client cloud service is ready to be downgraded at the beginning of the pilot phase. If the Faas terminal fails to be called, the call will be automatically downgraded to the local call. Basically it does not damage the client function, but it will increase some response delay. In addition, the version of the function on the client may be slightly older than the server, but the function is forward compatible and basically does not affect the use of the client.

4. Now-Our current situation

4.1 Infrastructure construction

Go/C++ Faas Runtime has been developed, docking with FC-Ginkgo/CSE, Aone has been completed, and the stable version 1.0 has been released.
A lot of stability construction, graceful offline, performance optimization, and C compiler optimization have been done, and the compilation method provided by the compiler optimization team of Alibaba Cloud Basic Software Department has been used to optimize the compilation of C++ Faas, and the performance has improved significantly.
C++/Go Faas is connected to Eagle Eye and Sunfire monitoring is completed, and the function is observable.
The pooling function is completed, and it has the ability to be flexible in seconds. The pooled runtime mirrors the access to the CSE, and the time to expand a new instance is changed from the original minute level to the second level.

4.2 Implementation of AutoNavi's Serverless Business

C++ Faas, Go Faas and Node.js Faas have already had a large number of applications in AutoNavi. To give a few examples:

The first two pictures in the above picture are services developed by C++ Faas: long-distance weather, search along the way. The last two screenshots are the business developed by Go-Faas: navigation tips, footprint map.

AutoNavi is the largest BU in the Alibaba Group for serverless applications. The serverless applications that have been deployed have a daily peak value of more than 100,000 qps.

4.3 Main benefits

After AutoNavi has implemented the largest serverless application in the group, what benefits will it have?

First of all, the first and most important benefit is: development and efficiency . Our client-cloud integrated component based on Serverless has helped clients go to the cloud, eliminated the need for real-time client version release dependencies, and improved the iterative efficiency of client development. Based on the BFF layer developed by Serverless, the development iteration efficiency of BFF scenes is improved.

The second benefit is: operation and efficiency 160a4e1a9ed801. With Serverless's automatic and elastic expansion and contraction technology, AutoNavi can deal with various travel peaks more calmly. For example, the annual travel peaks during the 11th Travel Festival, May Day, Qingming Festival, and Spring Festival, no longer need operation and maintenance or business development students to expand before the holiday, and then shrink after the holiday. The characteristics of AutoNavi’s business peak are also different from the e-commerce spike scenario. The peak travel traffic does not suddenly rise within one second. The second-level elasticity achieved by our current pooling technology can fully meet the needs of this business scenario of AutoNavi.

The third benefit is: reduces costs . AutoNavi’s business features include large traffic during the day and low traffic at night, with large differences in peak and trough values, and distinct time periods. Using Serverless to automatically shrink the capacity at night when the traffic is low and peak, it greatly reduces the cost of server resources.

5. Next-Follow-up plan

Optimize the use of function calculation in the FC bomb, and continue to optimize the performance, stability, and experience of the function calculation in the bomb together with the FC team. Use the rich high-traffic business scenarios in the group to continuously polish the C++/Go Faas Runtime, and finally output it to the public cloud, to inclusive of more companies in the wave of digital transformation.
Dapr landed to solve the problem of decentralization of outcoming traffic, and gradually launched some C++/Go Faas, using Dapr to call group middleware.
Faas chaos engineering, fault drill, escape ability building. Faas will also participate in our BU's fault drills in the new fiscal year to solve the problems found during the drills one by one.
Access edge computing. In the scenario where the device and the cloud are integrated, Faas + edge computing can provide lower latency and better user experience.

The above tasks have a long way to go. In addition, in FY22, our department will also conduct cloud-native pilot projects and implementation. Technical students know that there is still a long way to go from technology selection, technology prototype to actual business landing.

Welcome friends who are interested in serverless, cloud native, or Go application development, and want to do something together to join us (no matter what technology stack you are before, the hero does not ask the source, send your resume to gdtech@alibaba-inc.com, email The subject is: Name-Technical Direction-From Serverless), there are large-scale landing scenes and a simple and open technical atmosphere. Self-recommendations or recommendations are welcome!

This article is compiled from the sharing of Alibaba Senior Technical Expert--Xiang Yi on [Alibaba Cloud Serverless Developer Meetup Shanghai Station]
ppt How to get and reply "ppt" in the background dialog
live playback viewing address : https://developer.aliyun.com/live/246653
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.