2

has started its Go business construction for some time, mainly including 1612db05bee3f7 Go application landing , Go middleware construction , cloud native . After continuous effort, good progress has been made in these areas. How is the implementation of AutoNavi Go's business implementation, what problems have you encountered, and how can you solve them? This article will introduce relevant experience for everyone, and hope it will be helpful to interested students.

want to implement Go application 1612db05bee421

At present, the mainstream language in Gaode is still Java. Java has the most applications and the number of machines is amazing. Moreover, AutoNavi's overall business is also running fast, and the cost is increasing very fast. In terms of reducing machine load, Go language has considerable advantages over Java language at the language level. Reducing the machine cost is our first consideration in landing Go applications.

Secondly, the Go language has developed rapidly in recent years. Whether within the Alibaba Group or within AutoNavi, there are increasing calls for the use of the Go language. Landing Go applications can well verify the stability of Go middleware. Of course, we can use chaos engineering and other means to verify, but the test of the production environment is the most convincing. verification that the stability of the Go language middleware is our second consideration in landing Go applications.

Finally, as the basic framework of cloud native, Go language uses a lot of languages. The implementation of Go applications in advance can reduce a lot of resistance to the subsequent implementation of cloud native. The scale of Serverless/Faas currently implemented by AutoNavi is quite large. landing Go applications is to pave the way for subsequent cloud native landing.

3. Go application in high traffic scenarios landed

3.1 Introduction to the rendering gateway

The AutoNavi rendering gateway mentioned in this article is one of the top applications in terms of business traffic, transformation difficulty, risks, and benefits in our Go applications. The rendering gateway is at the access layer and accounts for half of AutoNavi's total traffic. The importance can be imagined.

Next, briefly introduce the services undertaken by the rendering gateway, so that everyone can have a more three-dimensional understanding.

The rendering gateway undertakes all graphics rendering from sources such as AutoNavi’s mobile apps, cars, and open platforms. When you use AutoNavi, all the graphics of buildings, topographic maps, names, routes, subway stations, bus stops, traffic lights, etc., you see are all exposed to the end by the rendering engine through the rendering gateway. Here are a few pictures, so that everyone can have a more perceptual understanding.

Picture 1 above is before the trip, picture 2 is in the row, picture 3 is the taxi page, and picture 4 is the hand drawing of the scenic spot. The rendering gateway involves many services. The above is only an example, and other services are not mapped here.

3.2 Refactoring difficulties

Students who have done refactoring projects believe that they have a deep understanding. refactoring project. One is to ensure the correctness of the business, and the other is to ensure the stability of the service.

To ensure business correctness, generally speaking, most of the reconstructed services are old services. The biggest problems faced by old services are complex historical logic, personnel turnover, and lack of documentation. These factors are the "blockers" in the reconstruction process.

The same is true for the reconstruction of the rendering gateway. It involves various business lines such as AutoNavi’s mobile terminal, car terminal, open platform, taxi-hailing, and all historical versions, plus the above factors, so ensuring the correctness of the business is a very difficult task. .

For ensuring service stability, students who have worked as gateways should know that the attributes of the gateway itself determine that it will not have frequent business iterations. Stability is the gateway's first demand. We need to ensure that the gateway can always maintain high availability regardless of whether the external environment/dependency is normal. As the Go version of the middleware lacks sufficient verification in high-traffic scenarios, this difficulty needs to be carefully evaluated. Appropriate methods and means are used to verify various boundary conditions in the simulation environment as much as possible, so as to ensure that the production environment does not appear. problem.

3.3 Technical Solution

When refactoring the AutoNavi rendering gateway, our overall technical solution is divided into three steps:

3.3.1 Online traffic comparison

How to verify the business correctness of the new service? We adopted the method of online traffic comparison.

We did a lot of research in the early stage, hoping to find a satisfies the (near) real-time, binary level comparison of , but unfortunately we did not find a tool that meets the requirements. Due to the special properties of rendering services, most of the rendering gateway interfaces return binary vector data, so an ideal tool must not only support conventional data comparison, but also binary-level comparison.

Another advantage of binary-level comparison is that it can eliminate character set differences and differences in different language library functions. It can more guarantee the accuracy of comparison. Some students may think of logging and then reading them offline for comparison. This method has many drawbacks.

First, the traffic cannot be replayed to the designated machine. Secondly, this way of use is generally a fixed corpus, and the corpus is not complete enough to completely simulate the online environment. In addition, the differences in character sets and language library functions caused by log comparison will have a greater impact on the accuracy of the comparison, especially for special characters (more obvious when the 7-layer protocol is a binary protocol). What should I do if there is no ready-made weighing tool? "Every mountain opens the road, meets the water to build a bridge".

We independently developed a (near) real-time traffic comparison tool , which guarantees the correctness of the reconstruction of this business, and can also serve the reconstruction of other AutoNavi business. The technical details involve a lot of TCP/IP and are very interesting. Interested students can skip directly to the section "Technical Details of Traffic Comparison Tool (ln)".

3.3.2 Simulation environment pressure test

I believe that all service students have a deep understanding. It is not an easy task to make the service guarantee to achieve the availability of 5 9s. Various situations may occur in the real production environment. We must find a way to verify the stability of the service under various boundary conditions in order to ensure the high availability of the service. For new services that have been refactored, a simulation environment is needed to verify various situations.

To build a simulation environment, we need to keep the machine baseline, external dependencies, and external traffic consistent (such as draining from the line). The simulation environment must not only provide the ability of the normal environment, but also the ability of the abnormal environment.

Abnormal conditions include network disconnection, network packet loss, and so on. There is a saying that is good: 20% of the code completes the function, and 80% of the code handles various abnormal situations. Our 's main means of constructing abnormal states in practice is chaos engineering , which simulates anomalies down to the operating system level (such as network disconnection, packet loss, etc.) through chaos engineering, and up to application layer anomalies (such as message middleware backlog, Hook before and after the JVM method simulates business exceptions, etc.).

In the simulation environment, a long-term ultimate pressure test is carried out at the same time, the corpus is diverted from the line, the pressure test is performed in the normal state and the abnormal state, and the performance of the service over a long period of time is observed to obtain the stability of the service. Usability conclusion.

Observed indicators include basic indicators , such as CPU, disk utilization, memory utilization, number of connections, and business indicators, such as business interface success rate, success volume, total, TP99. In this way, all possible situations are basically completely covered, and service stability and high availability are fully guaranteed.

3.3.3 Smooth gray-scale tangential flow

I talked about how to ensure the correctness of the business and the stability of the service. Next, talk about how to ensure smooth gray-scale tangential flow. Firmly abide Ali published three principles smooth flow cut gray "magic weapon": be gray , can monitor , can roll back .

In specific practice, we follow the steps below to gray cut flow :

a. The original Java cluster does not move, and a new Go cluster is applied. Modify the routing rules, and some whitelisted users use the Go cluster service.

b. Modify the routing rules to the Go cluster one by one interface, slowly graying out, and closely observe the machine posture, business logs, and monitoring indicators during the period. If there is an exception, switch back to the Java cluster with one click.

c. After all the interfaces are switched to the Go cluster, the Java cluster/Go cluster will coexist for a period of time at the same time.

d. Gradually drop the Java cluster machines.

3.4 Main income

The first important benefit: reduction and efficiency . After AutoNavi rendering gateway changed from Java to Go language, the number of machines was reduced by nearly half. Half of the original resources were used to complete the same work, which greatly reduced costs, improved resource utilization, better supported business development, and greatly reduced the growth rate of access layer machines brought about by the rapid growth of business traffic.

The second important benefit is: verifies the stability of the Go version middleware co-built by and the group 1612db05bee8bd, which has improved and prospered the group’s Go ecology to a certain extent. After the test of the high-traffic scenario, the stability of the Go version middleware co-built by AutoNavi and the Group has been fully verified.

The third important benefit is: paves the way for the gateway cloud . Gateway Go is only the first step. Go is a language that uses more cloud native infrastructure. The first step is to smooth out language differences. For subsequent cloud native gateways, there are many benefits, which can reduce the risk and cost of transformation.

Of course, there are still many very useful tools in the reconstruction process of AutoNavi rendering gateway. It can provide key guarantees for subsequent business reconstruction, such as the self-developed traffic comparison tool ln.

4. Technical dry goods

4.1 Flow comparison tool (ln) technical details

Let me ask a question first, what functions need to be completed to build a (near) real-time traffic comparison tool? That's right, it is traffic replication, traffic analysis, traffic replay, and traffic comparison. In fact, it is more than that. In practice, it is more of a closed loop of traffic regression, as shown in the following figure:

4.1.1 Traffic replication

In order to support all 7-layer protocols, traffic acquisition must start from layer 3 or 4. Some students will immediately think of tcpdump. That's right, it is tcpdump. The files from tcpdump are real traffic. The step of copying traffic has been completed. As for real-time, two or three processes can be staggered in time, and the time period overlaps each other to complete real-time.

In addition, another consideration in designing this tool is that there should not be too heavy a load on the online machine, so as to avoid affecting the stability of the online machine. This traffic replication method is very lightweight, and the load added to the online machine is very small and can be ignored.

4.1.2 Traffic upload & traffic pull

Both traffic upload and traffic pull use internal file services.

4.1.3 Traffic comparison

Traffic comparison In order to ensure the rigor of comparison and eliminate possible character set interference/different library function implementation interference, we natively support binary stream comparison.

4.1.4 Debug for local replay of problem traffic

When returning to the traffic, you may find that some traffic comparisons are inconsistent. At this time, we hope to only replay the specific traffic to the specified machine for debugging or other operations. ln natively supports this function.

4.1.5 Traffic analysis

Traffic analysis is very interesting. This pure joy comes from "playing" with network protocols.

The actual method is how to parse the tcpdump file, get the tcp payload, and restore the http request.

There are two key points here, one is how we get the tcp payload from the tcpdump file, and the other is how we re-aggregate the four-layer tcp payload into a seven-layer http request.

4.1.5.1 tcpdump file format

Let me talk about how to get the tcp payload from the tcpdump file. If you can know the format of the tcpdump file, then you can know where the tcp payload is and what is the length? This time we will take a look at the tcpdump file format.

First look at the tcpdump file overview

The format and length of the file header are fixed, as follows:

We can move back 23 bytes after reading the tcpdump file, and then start processing each packet. The format of each packet is as follows:

We process each data packet, skip the previous packet header, data link header, ip layer header, and tcp protocol header in turn, and finally shift to the first byte position of the tcp payload. More implementation details (judgment of header field values ​​of different layers, judgment of different lengths, judgment of large and small ends, how the request packet corresponds to the response packet, etc.) will not be expanded here. Only the general idea is introduced here, and interested students can dig deeper into the network protocol.

4.1.5.2 tcp payload restore http request

This part introduces how to restore the tcp payload to an http request (here http refers to http1.0/1.1, excluding http2). The complete implementation in the ln tool is to restore the request and the corresponding response from the tcp payload, here for ease of understanding , Only explain how to parse http requests. It is parsed that the http request can actually be re-requested for the new and old services respectively, and the response binary stream is compared.

One tcp connection, multiple payloads are sent (this is only for illustration, and many situations such as judging packet loss and retransmission are code details and will not be expanded here). There may be multiple payloads corresponding to one http request; it is also possible that the first part of a payload corresponds to one http request, and the latter part corresponds to another http request. What we have to do is to read in the byte stream formed by multiple payloads, and aggregate http requests according to the format of the http frame. In addition, http2 requests cannot be aggregated in this way.

4.2 Some Go language best practices

4.2.1 sync.pool practice

Since the memory management mechanism of Go language and Java language are different, there are also differences in memory application and release overhead.

For the Go language, sync.pool is a powerful tool for reusing memory. There are many advantages of sync.pool, such as reducing memory applications, reducing system calls, and reducing gc pressure. But things have two sides, sync.pool is the same, we need to pay attention when using sync.pool, objects stored in sync.pool will be recycled without notice, so resources like database connections are not suitable Use sync.pool.

In short, sync.pool can reuse memory, reduce machine load, and is very suitable for temporary objects.

4.2.2 Golang Byte

The Byte type of the Go language is unsigned, and the Byte type of the Java language is signed. In the process of migrating from the Java service to the Go service, pay attention to the comparison of the positive, negative, and zero type of the Byte type in the Java code.

4.2.3 Golang byte slices and string efficient conversion

Byte slice to string

func Bytes2String(b []byte) string { 
    return *(*string)(unsafe.Pointer(&b)) 
}

String to byte slice

func String2Bytes(s string) []byte {     
    x := (*[2]uintptr)(unsafe.Pointer(&s))     
    h := [3]uintptr{x[0], x[1], x[1]}     
    return *(*[]byte)(unsafe.Pointer(&h)) 
}

Using this method of conversion, the performance is very high. The reason is that there is no new memory application and copy in the bottom layer. However, whether it is byte slice to string or string to byte slice, the value change in byte slice will affect the value of string. The user must judge whether it is acceptable according to business logic, and control life more accurately. cycle.

4.2.4 Golang library function rewriting

For the gateway, the part that consumes more CPU is Hash function/encoding/decoding function/encryption and decryption function/serialization and deserialization function, etc. In practice, we have rewritten related library functions and made a lot of optimizations on the CPU load.

To reduce the CPU load, we have to know how the CPU works before we know how to write code to better reduce the CPU load. Here will introduce a rough working principle of the CPU.

Release the CPU pipeline work step diagram

  • Instruction fetch (IF)
  • Instruction decode (ID)
  • Execute (execute, EXE)
  • Memory access (memory access, MEM)
  • Register write-back (register write-back, WB)

Mainly optimize the MEM step, use the CPU cache to reduce the clock cycle occupied by the MEM step as much as possible, thereby reducing the CPU load.

Similar to the NUMA architecture, affinity and other methods to reduce the CPU load are also the same idea, reducing the clock cycles required to load data as much as possible.

For optimizing Golang library functions, there are two points that can be improved: optimizing the algorithm itself; optimizing CPU cache affinity.

We focus on the second one. Take the base64 encoding and decoding function as an example. The incoming Byte slice and the returned Byte slice are not the same array and the same memory at the bottom layer. There are two points that can consume additional CPU clock cycles. One is memory application and release, and the other is the CPU cache contention problem caused by separate access to the two memory blocks (not exactly the same as pseudo sharing).

What if we reuse the incoming memory? That is, overwrite the same memory while decoding. A wonderful thing happened, and the problem mentioned above no longer exists. The same work is done in fewer clock cycles. It should be noted that since the input and output of the function use the same memory, there are higher coding requirements for program developers, that is, more precise control over the life cycle of the data flowing in the program, and the code needs to be polished. The details are very detailed.

5. Future Outlook

The next step of the gateway is the cloud native , which is implemented using the Service Mesh . This can solve the shortcomings of the current centralized gateway. Decentralization can improve the stability of the access layer, reduce the explosion radius, enhance the isolation capability, and achieve finer-grained control.

Secondly, reduces machine costs. . According to the current internal pressure test and the industry's existing practical pressure test conclusions, the cost will be further reduced after Meshing. Considering the consumption of the existing RPC framework itself, the cost will be further reduced. In addition, the data plane agent is constantly optimized, and the subsequent performance will be better, and the load on the machine will be further reduced by the additional two hops.

, the 1612db05beeed5 network layer capability set is greatly enhanced. gateway can drive the meshing of upstream services, and finally make a capability superset in the entire network layer.

The capabilities provided by the existing Service Mesh framework can be summarized as Connect, Secure, Control, and Observe. Its capabilities are a superset of the existing gateway capabilities. It can do things that could not be done before, the most obvious is the Observe capabilities. The benefits brought can greatly enhance the observability of full-link services, which is of great help to the subsequent development of service stability and rapid positioning of full-link faults.

There is a long way to go to do the above things. In addition, we will do more cloud-native pilots and implementations. Technical students know that there is a long way from technology selection to technology prototypes to actual business implementation. To go. But if you choose the right way, you are not afraid of going far.

Sincerely recruit fellow travelers

The author’s team is eager for talents, and I hope that there are enthusiastic technical partners to do something interesting together. All technology stacks are available. If you are willing, please send your resume to gdtech@alibaba-inc.com. The subject of the email is : Name-technical direction-from Gaode technology.

Happy Hacking!


高德技术
458 声望1.8k 粉丝

关于高德的技术创新内容将呈现于此