Routing service design of the multi-active architecture design

1. Background

With the development of the company's business, the impact of each stability failure is getting bigger and bigger. Providing stable services and ensuring the high availability of the system has become a problem faced by the entire technical department. Based on this background, the company has carried out multi-cloud/multi-active technology projects, and I have the honor to participate in the design of the remote dual-active transformation scheme of the "Next Day Delivery" project [1]. I would like to talk about my understanding of some technical solutions for multi-living and even globalization.

In the multi-active architecture series of articles, I will follow the five major parts of the overall technical solution, active-active/global regional deployment technology, network scheduling technology, performance optimization, and SRE. This Poisoner Blog will focus on discussing the overall technical solution and the routing service design module in the active-active/global regionalization deployment technology, and will gradually improve the complete technical solution of the multi-active architecture in the follow-up Poisoner Blog.

2. Technical requirements for multi-activity/globalization

In addition to meeting the basic requirements of users for performance and availability, network services also require compliance and data isolation in the context of multi-activity/globalization. And these requirements have encountered new challenges.

1. Performance

The shorter the time from the user initiating the request to the reception and response, the better the performance. However, in the context of active-active/globalization, the user may be in Japan, the computer room may be in China, the physical distance has become longer, and the response time of the corresponding service will be proportional. Test data shows that for cross-country or large-scale cross-machine room calls, the RTT of the network will increase by about 1s , and this 1s may cause a decrease in the "conversion rate" of transactions and even loss of users.

2. Availability

Multi-live/global business will span time zones, which requires our services to be available 24 hours a day. This is not only a challenge to the system, but also a human challenge.

3. Interconnection

Interconnection refers to the physical connection between telecommunication networks. In order to enable users of one telecommunication operator enterprise to communicate with users of another telecommunication operator enterprise, such network communication across operators is not a big problem in China. However, in foreign countries, the quality of network interconnection in many countries is still not ideal.

4. Data Consistency

When data is shared by global users, users in multiple places can read and write operations. How to ensure data consistency?

5. Privacy Protection

Global business must comply with GDRP (General Data Protection Regulation).

6. Scalability

Wikipedia's explanation: When a system, network, or process increases or decreases in task volume, it has the ability to respond.

3. The overall architecture of multi-activity/globalization

3.1 Domain Modeling

The core of the system is to deal with the relationship between the domain model, starting from the domain model to make the entire system meet the current and future needs. At the same time, to allow the project team to collaborate better, the following are the core objects of the system.

User: The user on the website platform.

UserGroup: User group, users with the same characteristics generally use the same network link and equipment room scheduling policy, and therefore belong to the same user group. In fact, the multi-active system is based on the user group scheduling unit. A user is a member of a user group, and we can schedule either by user group or for a specific user.

EdgeNode: Edge node. It can be understood as a provider node for static resources. Such as pictures, js files, css, etc. The general edge node refers to the CDN.

NetLink: Network link.

NetNode: Network node.

PoP: Network service access point. (nodes such as routers, switches, etc.)
DSA: Dynamic Site Acceleration, an accelerator for accessing dynamic content provided by CDN vendors.

IDC: computer room.

DNS: DNS server.

HTTP-DNS: understood as the DNS server on the APP side.

DomainName: Domain name.

VIP: Virtual "Idol".

The relationship between the above domain models is as follows;

3.2 Overall Architecture

According to the above description, it can be seen that from the user to the computer room, there will be edge node scheduling, network scheduling, and computer room scheduling. There is also the execution of schedules (use of routes) and the control of schedules (generation of routes). Routing means that each user or each user group belongs to which computer room. The structure diagram is as follows;

Global users: Our system divides global users into different user groups and divides them into four groups according to regions, namely Continent A, Continent B, Continent C and Continent D. When scheduling is executed, the conventional process is to push the scheduling information to the App of each user group.
Edge scheduling: The scheduling of static information determines which edge node each user group should use.
Network scheduling: Based on big data, each feasible link is counted in real time and the decision model determines which route to take. (It has nothing to do with routing)
Computer room: The source of solutions.
Scheduling execution: PC uses DNS technology for scheduling, and App uses HTTP-DNS and PUSH technology for scheduling. (use of routing)
Scheduling control: Determine the specific scheduling through real-time data calculation. (Routing generation and configuration)

4. Multi-active/globalized regional deployment technology

4.1 Overall Architecture

4.1.1 Function priority

The specific strategy for scheduling scheduling is determined by business requirements. Generally speaking, compliance, data consistency, scalability, cost, capacity, performance, and stability are considered. Usually the order of importance is

Compliance > Data Consistency > Scalability > Other

4.1.2 Deployment Architecture

At present, we have four computer rooms (Continent A, Continent B, Continent C, Continent D), and build the regional users corresponding to the geographical service where each computer room is located. Data between computer rooms needs to be replicated on demand . All applications and databases are deployed in each computer room, so that each computer room is equal. After the data is backed up, all the computer rooms can be used as disaster recovery equipment rooms for each other. Data consistency and scalability will be introduced later.

4.1.3 Problem Analysis

Take buying and selling, nearby access, and remote disaster recovery in e-commerce scenarios as an example. For example, buyers and sellers are from different regions, and there must be some shared data consistency problems when conducting transactions; when disaster recovery occurs in different places, they will also face data consistency problems; when users migrate to other regions, it is still necessary to ensure that they are nearby access, then it involves the data consistency of the same user.

Our strategies are as follows:

Nearest access: The user will be routed to a fixed computer room (under normal circumstances), and ensure that the user's data is closed in the same computer room.
Remote disaster recovery: Since the applications are peer-to-peer, it is necessary to ensure that the data exists in the backup computer room.

Global Buying and Selling: Synchronize data on demand, and synchronize product information to all computer rooms.
Data consistency: Ensure the principle of a single data master, that is, only one computer room will change the same piece of data. Will ensure business priority (buyers > sellers > operations).

4.1.4 Solutions

From an application layering perspective, the solution is shown in the figure;

4.2 Routing Service

The essence of regionalized deployment technology is multi-layer routing, and each layer of routing is based on the routing called by the user's corresponding home computer room. The role of the routing service is to tell the caller which user the user belongs to.

routing service structure

 -  内存路由表：理解为 HashMap，key 为用户 id，value 为用户归属机房以及用户状态。
-  RPC 服务。

How routing tables are used

 - 用户请求进入机房第一个应用程序是同一接入层。使用 Nginx 作为统一接入的应用程序，Nginx 内嵌路由表，并且在多进程进行共享。Nginx 接受请求后做的第一件事情就是获取用户 id，然后调用路由表取得用户归属机房以及用户状态。若用户归属于本机房则继续向下透传。


-  下游需要路由信息直接获取上层丢下来的路由信息。如下图；

There is a time limit for route transparent transmission. When the time limit exceeds a certain time limit, the content of the transparent transmission will be invalid. As for the transparent transmission process, what should I do if the user's route is changed? Answers to this post.

4.2.1 Principle of Routing Table

Routing table design specifications must understand the following points:

must be kept in memory.
Guaranteed performance and throughput.

Can't rely on third-party systems.
The routing design should support free upgrades.

4.2.1.1 Scheme comparison

The scheme comparison includes the following introduction of distributed cache, HashMap, Bloom filter, etc. The following schemes have their own shortcomings, as follows.

4.2.1.1.1 Introducing Distributed Cache

defect

- All systems must call the remote cache, which is highly dependent.
- When the user's attribution changes, the client cache needs to be updated, and the remote cache needs to be updated as well.
- All systems must add a strong dependency.

4.2.1.1.2 HashMap

defect

- It takes about 2GB of memory to save 50 million records.

4.2.1.1.3 Bloom Filter

defect

- There are False Positives.

It seems that there is no existing solution, and the routing table needs to be customized according to the scenario.

4.2.1.2 Routing Table Design

Based on the above inspiration, choose to use bit array to store routing information. We can use 4 bits to express a user. as the picture shows;

In this way, only about 47M memory space is needed to store 100 million data.

But what if the distribution of user IDs is segmented:

0~ 80000000

100000000~ 300000000

700000000~ 800000000

2000000000~ 2000100000

Although the real number of users is only about 100 million, the id distribution is so wide that it consumes about 900 M of memory.

Based on this, segment mode is introduced:

segment mode

The segment mode is shown in the figure. The core idea is to establish a segment index table, and each index table specifies a bit sequence to store user information. (eg we take 1 million users as a segment). For those index entries that the user does not have, we perform a NULL segment. Its corresponding bit sequence also does not allocate storage space. This greatly saves memory space. In this way, the same user registration needs to consume about 58 M of storage space.

4.2.1.3 Routing table related design

In the previous stage, the basic storage solution of the routing table was solved, but there are still some scenarios that need to be continuously designed and improved. Now we think about two questions:

When a computer room fails and needs to be switched for disaster recovery, if the solution is implemented based on the existing routing table, it is necessary to change the routing attribution information of all users in the corresponding computer room, which may involve tens of millions or hundreds of millions of user changes, and the cost is very high. high.

In the Double Eleven scenario, although the distribution of user behaviors can be planned through big data, the Double Eleven is only once a year, and the learning samples are few, so it is easy to cause inconsistencies between user behaviors and expectations. As a result, the capacity of the computer room in Continent A is insufficient, but the capacity of the computer room in the United States is very spare. At this time, some users in Continent A need to be offloaded to the US computer room. How can it be supported through the current routing table? Based on the above scenarios, we propose a concept called "logical computer room".

- - When everything is normal, the logical computer room is directly mapped to an original computer room.
  - When disaster recovery switchover occurs, the logical computer room is directly mapped to the disaster recovery computer room.
- - When some users need to be offloaded, Hash modulo is performed according to the user ID, and users with different Hash results are mapped to different physical computer rooms.

The specific configuration logic can be centrally configured based on the configuration system used by each company.

4.2.2 Routing table update mechanism

The establishment of the routing table update mechanism requires the following design constraints;

Data consistency: In the process of changing the routing table, there is a possibility that the attribution information of a user is inconsistent in different computer rooms or computer nodes.

Recoverable and rollback: No matter what state the system is in, it can be deterministically restored to a desired state.
Rapid changes: In the process of ensuring consistency, or in the process of recovery and rollback, the user experience will be affected, and the system may even be unavailable. Therefore, the change process needs to be completed in a very short time.

4.2.2.1 Data Consistency Ideas

Many times, distributed systems are solving a problem, that is, how to make any record modification take effect in all computer rooms or multiple computer rooms at the same time. The solution is not complicated and has generality. Although we cannot guarantee that the change will take effect in all computer rooms or multiple computer rooms at the same time, we can know whether the change has taken effect in multiple computer rooms. On this basis, we set an intermediate state, which is the same as the state before and after the change Compatibility solves this problem.

As shown in the figure, state A is the state before the change, state C is the target state after the change, state A and state C cannot appear at the same time, but state A changes to state B, waiting until all related machines in all computer rooms All are changed to state B, then from B to state C, so that state A and state C will not appear at the same time.

In order to solve the problem of global consistency of business data in the process of routing update, we introduced a "write-forbidden" transition version. Before switching to the target routing room, we first set the routing to the transition version of "write-prohibited" in the current computer room. In this state, users cannot continue to perform any actions that will modify relevant business data in the current computer room and any other computer room. Before the "write-prohibited" transition version is changed to a new version, it must be ensured that all local versions of route resolution have been upgraded to the "write-prohibited" transition version. The "write-prohibited" transition version strictly separates the effective time of the old and new routing versions, and there is no situation where both the old and new routing versions take effect at a certain time, thus ensuring the global consistency of business data

(Note: Users in the transitional version of "Write Prohibition" will be affected to a certain extent in the availability of other services during the "Write Prohibition" process. This impact should be accepted by the business and be understood as a kind of business availability The local temporary downgrade. This kind of downgrade will be arranged in the period when the user is inactive, and often will not have much impact on the user's experience. Returning to the line of sight of the routing table, the user ID does not need to be stored, and the home computer room corresponds to The first three bits of the user bit are the fourth bit corresponding to the writable flag. When it is 0, it means that the user can write, and when it is 1, it means that the user is forbidden to write.)

4.2.2.2 Solutions

4.2.2.2.1 Separation of data preparation and validation process

The status of "Writing Prohibited" will have an impact on the user. If the user is prohibited from writing, it means that the user cannot place an order. Although making changes during the inactive period of the user can reduce the probability of affecting the user, the change can be further improved on this basis. reduce this probability. We can achieve this by separating the data preparation and validation processes.

The data preparation process is to write the user's attribution information into the distributed persistent database.
Due to the requirement for fast rollback, it must be a multi-version write. This requires that the data in our persistence layer database be multi-versioned. After preparing the data, use Zookeeper's watch mechanism to interact with the version validation process. The process is as follows:

- When a routing change is required, the routing change control program will write the data into the database and define the version number.
- After the data is ready, write the version number to the monitoring node of Zookeeper, and all watches will be pushed.
- The machine that needs to load the routing table reads the data in the database and loads the new version of the routing table.

4.2. 2.2.2 Consistency specific scheme

Zookeeper is a high-performance distributed coordination tool used for communication between nodes. It is often used in distributed configuration management. Most manufacturers use this kind of data consistency in the construction of routing tables. solution.

In distributed coordination scenarios, ephemeral nodes are often used. This node is with the session that created it. When the session disappears, the node disappears. This mechanism is often used for heartbeat checking. In the construction of routing nodes, all nodes that need to monitor the routing table will create a short-lived node for the heartbeat check of the routing table loading node. The process of a single change is as follows:

All nodes will establish a watcher with the currentVersion node of Zk to get the latest version push.
All nodes will create a short-lived node named after the machine name, which means that this node is listening for changes, which is established in the SessionList directory; when the session disappears, it means that this node is not listening for changes.
When the node is pushed with a new version of the change, it will use this version number to query the data in the distributed database (data was prepared before 4.2.2.2.1)
When the acquisition is completed and the routing table is initialized in the local memory, the machine name will be written into the currentVersion subdirectory under the AckList directory as a node, indicating that the node has been updated for the current version.
The change program will compare whether all the machine nodes in the currentVersion subdirectory under the AckList directory cover all the machine nodes in the SessionList directory, and if so, it will prove that all nodes are updated to the latest version.
Because we know whether all nodes have been updated and have a "write-prohibited" state that is compatible with both forward and backward, we can change the routing information of the new version after all nodes have been updated to the "write-prohibited" state, so as to ensure that The states that appear are all compatible with each other, thus ensuring data consistency issues.

The directory node structure of ZK described in the above steps is as follows:

4.2.2.3 Overall Architecture

The key technical details have been introduced before, and the overall architecture is described below. As mentioned above, the management and control system will be responsible for the regionalized management of all computer rooms, including the routing table change process. There will be a control agent in each computer room, and the management and control system will call the agent to manage all the computer rooms. In the process of routing change, each Agent will write the routing data in the computer room into the corresponding distributed database, and then Zk will push the information to write, which will not be repeated here.

4.2.2.3 Change Process

When the routing table changes, the complete process changes as follows:

Save the current version number V1 for handling the rollback scheme.
Get the list of the current computer room, get all the computer rooms, call the Agent of each computer room in a loop, and execute downwards in turn.

The Agent in each computer room calls the solution we mentioned above to write the data into the distributed database.
If it fails, step 8 is called directly.

Get the current computer room list, get all the computer rooms, and call the Agent of each computer room cyclically to modify the user status.
Then use the specific scheme of consistency (4.2.2.2.2) and change all user states to final states.
If it fails, step 8 is called directly. If successful, the process ends.
The Agent of each computer room is called cyclically to roll back the version. If it fails, manual intervention is performed.

4.2.3 User routing update scheme

The update mechanism of the routing table has been introduced earlier, but how to determine the computer room to which the user belongs? How to change the user's home computer room? How to add the existing users of the website to the routing table? And how to add new users to the routing table?

4.2.3.1 Determine the user's home computer room

In real application scenarios, most users' attribution logic adopts the principle of performance priority, which is basically equivalent to users belonging to the computer room with the smallest access delay. Of course, in most scenarios, the computer room with the smallest delay is the one with the closest physical distance.

How do we judge the user's home computer room, the scheme is as follows:

Each user will have asynchronous access to all computer rooms to confirm user and all computer room delays.
Use the user area as the granularity for statistics, which one is the most stable computer room.

In the routing table, each user in the area is associated with the best performing computer room in the area as a whole.

4.2.3.2 Change the user's home computer room

After the user's home computer room is determined, assuming that the new home computer room is different from the original computer room, the attribution of a user to the computer room must be implemented. As mentioned above, in the process of user routing attribution, the table needs to be rewritten into a forward-backward compatible "write-forbidden" state. This process ensures that the change of the routing table itself will not cause data inconsistency. However, in the transition from "write-prohibited" users to "writable" users, it is also necessary to copy the user's data from the original computer room to the target computer room, and ensure that the copying is completed. The technology of related data replication will not be discussed here, and will be described in subsequent chapters.

4.2.3.3 Change Optimization - Time-sharing Change

Since prohibition of writing may have an impact on users, we need to optimize the time of changes to reduce the probability of impact on user production. The main method is to find the time when the user is most likely to be idle.

(1) The unit is in hours, and the time period identification id is given.

###### Period ID id	###### Period
0	0-1
1	1-2
2	2-3
3	3-4
4	4-5
...	...
twenty three	23-24

(2) Set weights for different behaviors of users, and the weights represent the impact of prohibited writing on users. \

###### event	###### Weights
browse	0.2
ordering	0.8

(3) The operation records of the construction user abc within a period of time are as follows, and the following calculation method is used to calculate the conflict value of each event.

userid	event	Active id=0 number	Active id=1 number	...	Active id=23 numbers	The total number of
abc	browse	1	2		1	4
abc	ordering	4	2		2	8

P(0)=1/(1+2+1) 0.2+4/(4+2+2) 0.8=0.45 means that the conflict value of the time period with id 0 is 0.45

P(1)=2/(1+2+1) 0.2+4/(4+2+2) 0.8=0.3 means that the conflict value of the time period with id 1 is 0.3

P(2)=1/(1+2+1) 0.2+4/(4+2+2) 0.8=0.25 means that the conflict value of the time period with id 0 is 0.25

The higher the value, the higher the avoidance value representing this time period.

4.2.3.4 Inventory update plan

The stock update scheme refers to two scenarios

Program just launched
Machine just started

These two scenarios generally refer to recalculating the home computer room of all users currently existing in the system. Based on the knowledge introduced above, the currently adopted solution is the solution of determining the user's home computer room (4.2.3.1) that we mentioned earlier.

Here is a special mention of the default optimization of attribution. We use a certain computer room as the default computer room. All users belonging to this computer room do not need to join the routing table. When the routing service is called to query the user's route, the routing table returns a null value, and the routing service returns directly. The default computer room, thereby greatly reducing the size of the routing table.

4.2.3.5 Full update plan

Incremental update scheme generally also refers to two scenarios

User registration
User migration

For the first case, users in the new computer room will be assigned to the default computer room, and no routing table changes will be performed. The subsequent process is the same as the second case.

For the second case, during the multi-machine room detection process for new users, it is found that the user may not belong to the local machine room, or it is not the fastest to find that the newly registered user does indeed access the default machine room. Then you need to do user migration, that is, incremental update. After the attribution is confirmed, the incremental update is consistent with the stock update scheme. In contrast, the incremental update scheme requires fewer users to change. Inventory update programs need to be run only a few times.

V. Summary

This article mainly introduces the basic concepts in the process of multi-active/globalization transformation in different places, as well as the domain modeling and the storage optimization process of the routing system. In the future, we will continue to update more content about multi-living/globalization in different places. Welcome to the public account of "Dewu Technology".

Notes:

[1] Next-day delivery (Leadtime, LT) is a performance commitment product launched by Dewu. The core logic is to match the line configured in the background through the delivery park, receiving city, and product attributes, so as to promise users whether the product supports Items are delivered next day.

Text｜FUGUOFENG

Pay attention to Dewu Technology and be the most fashionable technical person!

Routing service design of the multi-active architecture design

1. Background

2. Technical requirements for multi-activity/globalization

3. The overall architecture of multi-activity/globalization

3.1 Domain Modeling

3.2 Overall Architecture

4. Multi-active/globalized regional deployment technology

4.1 Overall Architecture

4.1.1 Function priority

4.1.2 Deployment Architecture

4.1.3 Problem Analysis

4.1.4 Solutions

4.2 Routing Service

4.2.1 Principle of Routing Table

4.2.1.1 Scheme comparison

4.2.1.1.1 Introducing Distributed Cache

4.2.1.1.2 HashMap

4.2.1.1.3 Bloom Filter

4.2.1.2 Routing Table Design

4.2.1.3 Routing table related design

4.2.2 Routing table update mechanism

4.2.2.1 Data Consistency Ideas

4.2.2.2 Solutions

4.2.2.2.1 Separation of data preparation and validation process

4.2. 2.2.2 Consistency specific scheme

4.2.2.3 Overall Architecture

4.2.2.3 Change Process

4.2.3 User routing update scheme

4.2.3.1 Determine the user's home computer room

4.2.3.2 Change the user's home computer room

4.2.3.3 Change Optimization - Time-sharing Change

4.2.3.4 Inventory update plan

4.2.3.5 Full update plan

V. Summary

得物技术

引用和评论

从CPU冒烟到丝滑体验：算法SRE性能优化实战全揭秘｜得物技术

得物业务参数配置中心架构综述

分析型数据库入门指南：如何选择适合你的实时分析工具？

如何基于 Go 语言设计一个简洁优雅的分布式任务系统

软件架构模式实战指南：用真实血泪案例讲透技术选型

字节跳动开源 Godel-Rescheduler：适用于云原生系统的全局最优重调度框架

最近爆火的MCP究竟有多大魅力？MCP开发初体验｜得物技术