Briefly

The concept of off-site multi-living and why to do off-site multi-living will not be outlined here. There are many concepts, such as the concepts of dual-live in the same city, three centers in two places, five centers in three places, etc. If you are interested in these disaster recovery architecture patterns, you can read the following article to understand: "On the Architecture Patterns of Business-Level Disaster Recovery" .

Before reading this article, let's clarify the background first, so that you will not be confused when you read it later.

1.1 Division of computer room

There are currently two computer rooms in the first phase of the Dewu Duo Huo transformation, namely computer room A and computer room B. Most of the pictures in the article will be marked, which means that there are two different computer rooms.

We define the computer room A as the central computer room, which is the computer room that is being used before the multi-live goes online. If you talk about the central computer room, it refers to the A computer room. Another B computer room may be described as a unit computer room, which refers to the B computer room.

1.2 Unitization

Unitization is simple, we can directly think of it as a computer room, and the closed loop of the business can be completed in this unit. For example, the user enters the APP, browses the products, selects the product to confirm the order, places the order, pays, and checks the order information. The whole process can be completed in one unit, and the data is also stored in this unit.

There are two reasons to do unitization, disaster tolerance and improving system concurrency capability. However, the scale of the computer room construction and the cost of technology, hardware and other inputs must also be considered. I won't go into the specifics, but everyone will probably understand it.

2. Retrofit point

Before understanding the transformation points, let's take a look at the current status of the single computer room, so as to better help everyone understand why these transformations are needed.

As shown in the figure above, the incoming request from the client will go to SLB (load balancing) first, then to our internal gateway, and then distributed to specific business services through the gateway. Business services will depend on middleware such as Redis, Mysql, MQ, Nacos, etc.

Since there is more work in different places, there must be different computer rooms in different regions, such as central computer rooms and unit computer rooms. So what we want to achieve is as follows:

Looking at the above picture, you may feel that it is very simple. In fact, it is just some commonly used middleware, and one more computer room is deployed. How difficult is this. If you think so, I can only say one thing: the layout is small .

2.1 Traffic Scheduling

The user's request is sent from the client. Which computer room should the user's request go to? This is the first point we need to transform.

Before doing more work, the domain name will be resolved to one computer room. After doing more work, the domain name will be randomly resolved to different computer rooms. There is definitely something wrong with doing this in a random way, the calls to the service don't matter because there is no state. But the storage of dependencies inside the service is stateful.

We are an e-commerce business. The user places an order in the central computer room, and then jumps to the order details. At this time, the request is sent to the unit computer room. There is a delay in the synchronization of the underlying data, and an error is reported: the order does not exist. The user was stunned on the spot, the money was paid, and the order was gone.

Therefore, for the same user, the closed-loop service should be completed in one computer room as much as possible. In order to solve the problem of traffic scheduling, we developed a DLB traffic gateway based on OpenResty. DLB will connect to the multi-active control center and can know which computer room the user currently accessing belongs to. If the user does not belong to the current computer room, DLB will directly request Route to the DLB in the equipment room to which the user belongs.

If you randomly go to a fixed computer room every time, and then correct it through DLB, there will inevitably be cross-computer room requests, which will take longer. Therefore, in this section, we also made some optimizations in conjunction with the client. After the DLB correction request, we will directly respond to the client through the Header of the IP of the computer room corresponding to the user. In this way, when the next request is made, the client can directly access through this IP.

If the computer room currently accessed by the user is down, the client needs to downgrade to the previous domain name access method and resolve to the surviving computer room through DNS.

2.2 RPC framework

When the user's request reaches the unit computer room, in theory, all subsequent operations are completed in the unit computer room. We also mentioned earlier that the user's request should be closed in one computer room as much as possible, but not all.

This is because some business scenarios are not suitable for dividing units, such as inventory deduction. Therefore, in our division, there is a computer room that is the central computer room. Those businesses that do not do more work will only be deployed in the central computer room, so when the inventory is deducted, it needs to be called across the computer room.

How to know the service information of the unit computer room when the request is in the central computer room? Therefore, our registration center (Nacos) needs to do two-way synchronization, so that we can get the service information of all computer rooms.

When our registration information adopts two-way replication, for the central service, it is directly called across the computer room. For the unit service, there will be service information of multiple computer rooms. If it is not controlled, other computer rooms will be called, so the RPC framework needs to be transformed.

2.2.1 Define the route type

  1. default route

When requesting to the central computer room, the service in the central computer room will be called first. If the central computer room does not have this service, the service in the unit computer room will be called. If the unit computer room does not have this service, an error will be reported directly.

  1. Cell routing

If the request is made to the unit computer room, it means that the user's traffic rules are in the unit computer room. All subsequent RPC calls will only call the services in the unit computer room. If there is no service, an error will be reported.

  1. Central routing

If you request to the unit computer room, then directly call the service of the central computer room. If there is no service in the central computer room, an error will be reported. Request to the central computer room, then call the local computer room.

2.2.2 Business Transformation

The business side needs to mark the type of its own interface (Java interface) and add it to the interface through @HARoute. After the marking is completed, when the Dubbo interface is registered, the routing type will be put into the metadata of this interface, which can be viewed in the Nacos background. All methods in the interface called by RPC later will be routed according to the tag type.

If it is marked as unit routing, our current internal specification is that the first parameter of the method is a lowercase long buyerId, and RPC will determine the computer room where the user is located when routing.

The routing logic is as follows:

2.2.3 Retrofit process

  1. Make a copy of the interface, name it UnitApi, and add long buyerId to the first parameter. The old interface is called in the implementation of the new interface, and the old and new interfaces coexist.
  2. Release UnitApi online, there is no traffic at this time.
  3. The business side needs to upgrade the API package of other domains, and switch the call of the old interface to the new UnitApi, and the switch control is added here.
  4. After going online, call UnitApi through the switch control. If there is a problem, you can turn off the switch.
  5. Offline the old API and complete the switch.

2.2.4 Problems encountered

2.2.4.1 Other scene cut unit interface

In addition to the interface directly called by RPC, a large part is generalized through Dubbo. After this piece goes online, the traffic needs to be switched to UnitApi, and the old interface can go offline when there is no request volume.

2.2.4.2 Interface Classification

Interfaces are classified. There were no restrictions on how much to live before. There may be various methods in a Java interface. If your interface is a unit route now, then the first parameter of the method must be added with buyerId. Others do not have buyerId scenarios. method to move out.

2.2.4.3 Adjustment at the business level

Adjustment at the business level. For example, only one order number was needed to query an order before, but now the buyerId is required for routing, so the upstream access to this interface needs to be adjusted.

2.3 Database

The request reaches the service layer smoothly, and then it is necessary to deal with the database. We define different types of databases, which are defined as follows:

  1. Unitization

This library is a unit library and will be deployed in two computer rooms at the same time. Each computer room has complete data, and the data is synchronized in two directions.

  1. centralized

This library is the central library and will only be deployed in the central computer room.

  1. centralization

This library is the central unit library and will be deployed in two computer rooms at the same time. The center can read and write, and other computer rooms can only read. After the center writes data, it is unidirectionally copied to another computer room.

2.3.1 Proxy middleware

At present, all business parties use Sharding middleware in the form of client, and the version of each business party is not consistent. In the process of multi-active streaming, it is necessary to prohibit writing to the database to ensure the accuracy of business data. If there is no unified middleware, this will be a very troublesome thing.

Therefore, we developed the database proxy middleware Rainbow Bridge through in-depth customization of ShardingSphere. Each business party needs to access the Rainbow Bridge to replace the previous sharding method. In the process of switching, how to ensure stable and smooth migration, and how to quickly recover from problems, we also have a set of successful practices. You can read the article I wrote before "Client Sharding to Proxy Sharding, Smooth as Silk" "Smooth Smooth Migration" , there are ways to achieve it.

2.3.2 Distributed ID

For a unitized library, the data layer will perform two-way synchronous replication. If the self-incrementing ID of the table is used directly, the following conflict problems will occur:

This problem can be solved by setting different auto-increment step sizes for different computer room ids, but it is more troublesome, and more computer rooms may be added in the future. We adopted a once-and-for-all approach to access globally unique distributed IDs to avoid primary key conflicts.

2.3.2.1 Client Access

At present, there are two ways to access the distributed ID. One is to access the jar package provided by the infrastructure in the application. The specific logic is as follows:

2.3.2.2 Rainbow Bridge Access

The other is to configure the ID generation method for specific tables in Rainbow Bridge, and support the docking of distributed ID services.

2.3.3 Business Transformation

2.3.3.1 Unity library write request must carry ShardingKey

When the Dao layer operates on the table, the ShardingKey of the current method will be set through ThreadLocal, and then the ShardingKey will be put into SQL through Hint through the Mybatis interceptor mechanism and brought to Rainbow Bridge. Rainbow Bridge will judge whether the current ShardingKey belongs to the current computer room, if not, it will directly prohibit writing and report an error.

Here, I will briefly explain why writing is prohibited during the stream cutting process. This is actually a bit similar to the garbage collection of the JVM. If the operation is not prohibited from writing, data will continue to be generated, and when we cut the flow, we must ensure that the data in the current computer room is all synchronized before the traffic rules take effect. Otherwise, the user switches to another computer room and the data is not synchronized. business problems arise. In addition to the rainbow bridge will prohibit writing, the RPC framework will also block according to the traffic rules.

2.3.3.2 Database connection specifying connection mode

There are two definitions of connection mode, namely center and unit.

If the data source of the application specifies the connection mode as the center, the data source can be initialized normally in the central computer room. The data source is not initialized in the unit room.

If the data source of the application specifies the connection mode as the unit, the data source can be initialized normally in both the central computer room and the unit computer room.

Explain here why there is a connection mode design ?

In our project, there will be 2 libraries connected at the same time, a unit library and a central library. If there is no connection mode, the upper-level code is one copy, and the project will be deployed in the two computer rooms of the center and the unit at the same time, that is, data sources will be created in both places.

But in fact, my central library only needs to be connected in the central computer room, because all the operations of the central library are the central interface, and the traffic must go to the center. It is meaningless for me to connect in the unit computer room. Another problem is that I do not need to maintain the database information of the central library in the unit computer room. If there is no connection mode, then the rainbow bridge of the unit computer room must also have the information of the central library, because the project will be connected.

2.3.4 Problems encountered

2.3.4.1 The central database cannot be accessed in the unit interface

If the interface is marked as a unit interface, only the unit library can be manipulated. When there was no multi-living transformation before, there was basically no concept of center and unit, and all the tables were put together. After the multi-active transformation, we will divide the database according to the business scenario.

After the division, the central library will only be used by the programs in the central computer room, and it is not allowed to connect to the central library in the unit computer room. Therefore, if the operation of the central library is involved in the unit interface, an error will be reported. This piece needs to be adjusted to a central RPC interface.

2.3.4.2 The central interface cannot access the unit database

The same problem as above, if the interface is central, the unit library cannot be operated in the interface. Requests from the central interface will be forced to go to the central computer room. If there is an operation involving another computer room, the RPC interface must also be used for correct routing, because your central computer room cannot operate the database of another computer room.

2.3.4.3 Batch query adjustment

For example, batch queries are based on order numbers, but these order numbers are not the same buyer. If the buyer of one order is used as the routing parameter, some other orders actually belong to another unit, so there may be a problem of querying old data.

Such batch query scenarios can only be used for the same buyer. If they are different buyers, they need to be called in batches.

2.4 Redis

Redis is used more in business, and there are many places that need to be adjusted in the transformation of multi-activity. For Redis, we first define a few definitions:

Do not do two-way synchronization

Redis does not perform two-way synchronization like databases, that is, a Redis cluster in the central computer room and a Redis cluster in the unit computer room. Only a part of the user's cached data exists in the cluster of each computer room, not the full amount.

Redis type

Redis is divided into a center and a unit. The center will only be deployed in the central computer room, and the unit will be deployed in the center and the unit.

2.4.1 Business Transformation

2.4.1.1 Redis Multi-Data Source Support

Before the multi-active transformation, each application has a separate Redis cluster. After the multi-active transformation, since the application is not unitized and the center is not split, there may be situations in which two Redis need to be connected in one application. One central Redis, one unit Redis.

The Redis package provided by the infrastructure needs to support the creation of multiple data sources and define a common configuration format. The business side only needs to specify the cluster and connection mode in its own configuration to complete the access. The connection mode here is the same as that of the database.

The specific Redis instance information will be uniformly maintained in the configuration center, and does not require the business side to care. In this way, the business side does not need to adjust when expanding the computer room. The configuration is as follows:

 spring.redis.sources.carts.mode=unit 
spring.redis.sources.carts.cluster-name=cartsCuster

At the same time, we need to specify the corresponding data source when using Redis, as follows:

 @Autowired 
@Qualifier(RedisTemplateNameConstants.REDIS_TEMPLATE_UNIT) 
private RedisTemplate<String, Object> redisTemplate;
2.4.1.2 Data Consistency

In the database cache scenario, since Redis does not synchronize in both directions, there will be data inconsistency problems. For example, the user is in the central computer room at first, and then a piece of data is cached. Cut the flow to the unit computer room, and the unit computer room caches another piece of data. Then switch back to the central computer room. At this time, the cache in the central computer room is the old data, not the latest data.

Therefore, when the underlying data changes, we need to invalidate the cache to ensure the final consistency of the data. Relying solely on cache invalidation time to achieve consistency is not an appropriate solution.

Our solution here is to use the binlog of the subscription database to perform the cache invalidation operation. You can subscribe to the binlog of the local computer room or subscribe to the binlog of other computer rooms to realize the cache invalidation of all computer rooms.

2.4.2 Problems encountered

2.4.2.1 Serialization Protocol Compatibility

After accessing the new Redis Client package, there was a compatibility problem with the old data in the test environment. Most applications are fine. Although some applications use a unified underlying package, they have customized their own serialization methods. As a result, Redis does not use the customized protocol after being assembled in a new way. This part has also been transformed to support Protocol customization for multiple data sources.

2.4.2.2 Use of distributed locks

The distributed lock in the current project is implemented based on Redis. When Redis has multiple data sources, the distributed lock also needs to be adapted. To distinguish the scene where it is used, the default is to use the central Redis to lock.

However, the operations in the unit interface are all buyer scenarios, so this part needs to be adjusted to lock the unit Redis lock object, which can improve performance. Some other scenarios involve locking of global resources, then use the central Redis lock object for locking.

2.5 RocketMQ

After the request reaches the service layer, it interacts with the database and the cache. The next logic is to send a message out, and other businesses need to listen to this message and do some business processing.

If the message sent in the unit computer room is sent to the MQ of the unit computer room, there is no problem in the program of the unit computer room for consumption. But what if the program in the central computer room wants to consume this message? Therefore, MQ, like the database, needs to be synchronized , and the messages are synchronized to the MQ of another computer room. As for whether consumers in another computer room should consume, it is up to the business scenario to decide.

2.5.1 Defining consumption types

2.5.1.1 Center Subscription

Central subscription means that whether the message is sent in the central computer room or the unit computer room, it will only be consumed in the central computer room. If it is sent by the unit room, the unit's message will be copied to the center for consumption.

2.5.1.2 Regular subscription

Ordinary subscription is the default behavior, which refers to nearby consumption. Messages sent in the central computer room are consumed by consumers in the central computer room, and messages sent in the unit computer room are consumed by consumers in the unit computer room.

2.5.1.3 Unit Subscription

Unit subscription means that the message will be filtered according to the ShardingKey. No matter which computer room you send the message in, the message will be copied to another computer room. At this time, both computer rooms have the message. Use ShardingKey to determine which computer room the current message should be consumed by. Only those that conform will be consumed, and those that do not conform will be automatically ACKed.

2.5.1.4 Full unit subscription

Full-unit subscription means that no matter which computer room the message is sent in, it will be consumed in all computer rooms.

2.5.2 Business Transformation

2.5.2.1 Message sender adjustment

The sender of the message needs to be distinguished based on the business scenario. If it is a business message of a buyer scenario, the buyerId needs to be put into the message when sending the message, and the specific consumption is determined by the consumer. If the consumer is unit consumption, it must rely on the buyerId of the sender, otherwise it is impossible to know which computer room the current message should be consumed in.

2.5.2.2 The message consumer specifies the consumption mode

As mentioned earlier, there are multiple modes of central subscription, unit subscription, general subscription, and full-unit subscription. How to choose depends on the business scenario. After setting, you can specify it when configuring MQ information.

For example, the central subscription is suitable for your entire service is central, and other computer rooms have not been deployed. At this time, it is definitely suitable for central subscription. For example, if you want to clear the cache, it is more suitable for full-unit subscription. Once the data is changed, the cache of all computer rooms will be cleared.

2.5.3 Problems encountered

2.5.3.1 Message idempotent consumption

In fact, this point does not matter much according to the number of activities. Even if you don’t do more activities, the message consumption scenario must be processed idempotently, because the message itself has a retry mechanism. It is mentioned separately because in the multi-active scenario, in addition to the retry of the message itself, the message will be repeatedly consumed. In addition, in the process of switching the flow, the messages belonging to this part of the user will be copied to another computer room for re-consumption. When re-consuming, the message will be re-delivered based on the time point, so it is possible to consume the message that has been consumed before, which must be noted.

Then explain why there are message consumption failures and need to be copied to another computer room for processing during the stream cut process, as shown in the following figure:

After the user performs business operations in the current computer room, a message will be generated. Since it is a unit subscription, it will be consumed in the current computer room. During the consumption process, a stream cut operation occurs, and the database is read and written in the consumption logic, but the operations of the unit table all carry the ShardingKey. Rainbow Bridge will judge whether the ShardingKey conforms to the current rules, and if it finds that it does not conform, it will directly prohibit writing and report an error. All the messages of this batch of streaming users fail to be consumed. After the traffic is switched to another computer room, if the message is not re-delivered, then this part of the message will be lost, which is why it is copied to another computer room for message re-delivery.

2.5.3.2 The message sequence problem in the switching scenario

As mentioned above, during the stream switching process, the message will be copied to another computer room for re-consumption, and then played back based on the time point. If your business message itself is an ordinary topic, when the message is played back, if the same scene is used There are multiple messages, and this order is not necessarily consumed in the previous order, so there is a problem of consumption order here.

If your previous business scenario itself used sequential messages, then there is no problem. If it is not a sequential message before, there may be problems here. I will give an example to illustrate:

In a business scenario, triggering a function will generate a message. This message is at the user level, that is, a user will generate N messages. The consumer will consume these messages for storage. Instead of storing a piece of data once a message is sent, only one piece of data will be stored for the same user. There is a state in the message, and judgment will be made based on this state.

For example, a total of 3 messages are delivered in the following messages, and the final result of consuming them in normal order is status=valid.

 10:00:00  status=valid 
10:00:01  status=invalid 
10:00:02  status=valid

If the message is re-delivered in another computer room, the consumption order becomes the following, and the final result is status=invalid.

 10:00:00  status=valid 
10:00:02  status=valid 
10:00:01  status=invalid

The solutions are as follows:

  1. Topic is replaced with sequential messages and partitioned by user, so that each user's messages can be consumed strictly in the order in which they are sent.
  2. The message is idempotent, and it will not be consumed after it has been consumed. However, unlike ordinary messages, there will be N messages. If the msgId is stored, it can be judged whether it has been consumed, but the storage pressure is too high. Of course, only the most recent N messages can be stored to reduce the storage pressure.
  3. The optimization method of message idempotency allows the message sender to carry a version every time it sends, and the version must be incremented. The consumer stores the current version after consuming the message. Before consuming, it judges whether the version of the message is greater than the stored version, and consumes only if the conditions are met, which not only avoids the pressure of storage but also meets the needs of the business.

2.6 Jobs

Jobs are not used much on our side, and they are all used by the old logic. There are only a few tasks with statistical data in the early morning. The new ones are connected to our self-developed TOC (timeout center) to manage.

2.6.1 Business Transformation

2.6.1.1 Central computer room execution

Since Job is an old system, and only single-digit tasks are currently being executed, there is no support for multi-active transformation at the underlying framework level. The logic of the Job will be migrated to the TOC later.

Therefore, we must carry out transformation at the business level to support multiple activities. There are two transformation schemes, which are described below:

  1. Two computer rooms execute Job at the same time. During data processing, such as processing user data, through the capabilities provided by the infrastructure, it can be determined whether the user belongs to the current computer room, if the data is executed, otherwise the data is skipped.
  2. Starting from the business scenario, jobs are executed in the early morning, which is not an online business, and the requirements for data consistency are not so high. Even if the data is not processed in units, there is no problem. Therefore, we only need to execute the job in the central computer room, and in another computer room, we can configure the job to not take effect.

However, this method needs to sort out the data operations in the job. If there is any operation on the central library, it doesn't matter, it is running in the central computer room. If there is any operation on the unit library, it needs to be adjusted to take the RPC interface.

2.7 TOC

TOC is our internal timeout center. When we need to trigger business actions at a certain point in time, we can access the timeout center to process.

For example: After the order is created, it will be automatically cancelled if there is no payment within N minutes. If the business side implements it by itself, it can either scan the table regularly for processing, or use the delayed message of MQ. With the TOC, we will register a timeout task with the TOC after the order is created, specifying a certain time point, you want to call me back. In the logic of the callback, it is judged whether the order has been paid, and if not, it will be cancelled.

2.7.1 Business Transformation

2.7.1.1 Task registration adjustment

When registering the task of the timeout center, the business party needs to identify whether the task must meet the unitization standard. If this task only operates on the central database, then this task can be called back in the central computer room. If the task is to operate on the unit database, you need to specify the buyerId when registering the task. When the timeout center triggers the callback, it will route to the computer room to which the user belongs for processing according to the buyerId.

At present, the timeout center is only deployed in the central computer room, that is, all tasks will be scheduled in the central computer room. If the buyerId is not specified when the task is registered, the timeout center will not know which computer room to call back when calling back, and the center computer room will be called back by default. To make the timeout center call back according to the active routing rules, the buyerId must be specified when registering.

3. Service division

After reading the above transformation content, I believe that everyone still has a doubt about how to divide my services? Should I do unitization?

3.1 Overall orientation

First of all, we need to sort out according to the overall goal and direction of the entire multi-activity. For example, our overall direction is that the core link of buyer transactions must be transformed into units. Then all the upstream and downstream dependencies of the entire link need to be modified.

Users browse the products, enter the confirmation order, place an order, pay, and query the order information. This core link actually involves many business domains, such as: commodities, bids, orders, payments, merchants, etc.

Below these defined business domains, there may be some other business domains supporting them, so it is necessary to sort out the overall links and transform them together. Of course, not everything has to be unitized, it still depends on the business scenario, such as inventory, which must be on the core transaction link, but it does not need to be transformed, and it must go to the center.

3.2 Service Type

3.2.1 Central Services

The central service will only be deployed in the central computer room, and the database must also be the central repository. The entire application can be marked as a center, so that external access to this service interface will be routed to the central computer room.

3.2.2 Unit Services

The unit service will be deployed in the central computer room and the unit computer room at the same time, and the database must also be a unit library. The unit service is the business of the buyer dimension, such as confirming the order and placing the order.

For the business of the buyer dimension, in the interface definition, the first parameter must be buyerId, because routing is required. The user's request has been distributed to different computer rooms according to the rules, and only the database in the corresponding computer room is operated.

3.2.3 Central unit service

The central unit service means that there are both central interfaces and unit interfaces in this service. And there are two sets of databases. Therefore, this kind of service actually needs to be deployed in two computer rooms at the same time, but the unit computer room only has traffic from the unit interface, and the central interface has no traffic.

Some underlying supporting businesses, such as commodities and merchants, belong to the central unit service. The business that supports the dimension does not have buyerId. The commodity is universal and does not belong to a certain buyer.

The underlying database of the supporting type of business is the central unit library, that is, the central write unit reads and writes, and write requests are made in the center, such as product creation and modification. After the operation, it will be synchronized to the database of another computer room. The advantage of this is that it can reduce our time-consuming in the core link. If the product is not deployed in units, then when browsing the product or placing an order, the product information must be read in the central computer room. Now, the interface will be called to the nearest route, and the service of the central computer room will be adjusted when the request is made to the central computer room, and the service of the unit computer room will be adjusted when the request is made to the unit computer room.

In the long run, it is still necessary to split the business of the center and the business of the unit, which will be clearer. It is good for new students in the future to define interfaces, operate databases, caches, etc. Because now they are mixed together, you must know whether the current business of this interface belongs to the unit or the center.

The split is not absolute, or that sentence has to start from the business scenario. Like the business of buyers and sellers in the order, I think it can be split, and subsequent maintenance is also more convenient. However, like commodities, there are no two roles, that is, commodities. Adding, deleting, and changing commodities is also convenient for maintenance in one project, but it is only necessary to classify the interfaces, and mark the newly added, modified, and deleted interfaces as center.

4. Flow cut solution

Earlier we also mentioned that in the process of re-cutting, writing will be disabled, and MQ messages will be copied to another computer room for re-consumption. Next, I will introduce our flow cut solution to you, which can help you to have a deeper understanding of the processing flow in the entire multi-active abnormal scenario.

  1. Issue prohibition rules

When it is necessary to cut the flow, the operator will operate through the background of the active-active control center. Before switching traffic, you need to clean up the existing traffic, and you need to issue a write prohibition rule. The prohibition rules will be sent to the configuration center corresponding to the center and the unit, and the program that needs to be monitored will be notified through the configuration center.

  1. Rainbow Bridge implements write inhibit logic

Rainbow Bridge will use the write prohibition rule. When the write prohibition rule is modified in the configuration center, Rainbow Bridge can immediately sense it, and then judge the rule according to the shardingkey carried in the SQL to see if the current shardingkey belongs to this computer room, if not is blocked.

  1. Feedback prohibition of writing effective results

When the configuration is changed, it will be pushed to the Rainbow Bridge, and the configuration center will perceive the result of the configuration push, and then feed back the effective result to the active-active control center.

  1. Push the effective time of writing prohibition to Otter

After the active-active control center receives all the feedback, it will tell Otter the effective time point through the MQ message.

  1. Otter for data synchronization

When Otter receives a message, it will synchronize data according to the time point.

  1. Otter synchronization completes feedback synchronization results

After all data before the effective time point is synchronized, it will be fed back to the Active-Active control center through MQ messages.

  1. Issue the latest traffic rules

After the active-active center receives the feedback message of Otter's synchronization completion, it will issue traffic rules, and the traffic rules will be delivered to DLB, RPC, and Rainbow Bridge.

Subsequent user requests will be routed directly to the correct computer room.

5. Summary

I believe that after reading this article, you should have a certain understanding of the transformation of multi-living. Of course, this article does not explain all the multi-life-related transformations clearly, because the scope of the entire transformation is really too large. This article mainly talks about some transformation points and processes at the middleware level and business level, and some other points are not mentioned at the same time. For example: the construction of the computer room network, the publishing system supports multiple computer rooms, the monitoring system supports the entire link monitoring of multiple computer rooms, the monitoring of data inspection and so on.

Multi-active is a highly available disaster recovery method, but the implementation cost and requirements for the technical team are very high. When realizing multi-activity, we should design according to business scenarios. Not all systems and all functions must meet the conditions of multi-activity, and there is no 100% availability. Some are just some trade-offs for business in extreme scenarios. , giving priority to ensuring core functions.

The above are some experiences of Dewu Order Domain in participating in the multi-active transformation. I hope that it can be helpful to you who are reading.

Text / YINJIHUAN

Pay attention to Dewu Technology and be the most fashionable technical person!


得物技术
846 声望1.5k 粉丝