IM development basic knowledge make-up lesson (10): How difficult is the large-scale IM system? A long essay with ten thousand characters, know how to live in a different place!

This article was originally shared by Kaito, the owner of the public account "Water Drops and Silver Bullets."

1 Introduction

A few days ago, a group of friends in the technology group asked me if I had any articles about IM distributed systems living in different places. I thought about it carefully, except for a few articles shared by WeChat that mentioned disaster tolerance and living in different places (just roughly Mentioned, did not expand in detail), it is true that there is no systematic remote multi-living technical data for reference. I took this opportunity to organize this article shared by Kaito for everyone to learn.

This article starts with a simple system example, from the stand-alone architecture, master-slave copy, disaster recovery in the same city, dual-active in the same city, to dual-active in different places, and multi-active in different places, from the shallower to the deeper, step by step, it explains the large-scale distributed system with multiple activities in different places. The technical principles and basic implementation ideas of disaster tolerance architecture are very suitable for beginners to learn.

study Exchange:

5 groups for instant messaging/push technology development and communication: 215477170 [recommended]
Introduction to Mobile IM Development: "One entry is enough for novices: Develop mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published simultaneously at: http://www.52im.net/thread-3742-1-1.html)

2. Series of articles

This article is the tenth article in a series of articles on the basic knowledge of IM development:

"IM Development Basic Knowledge Supplementary Lesson (1): Correctly understand the principle of the front HTTP SSO single sign-on interface"
"IM Development Basic Knowledge Supplementary Lesson (2): How to design a server-side storage architecture for a large number of image files? 》
"IM Development Basic Knowledge Supplementary Lesson (3): Quickly Understand the Principles of Separation of Reading and Writing of Server-side Databases and Practical Suggestions"
"IM development basic knowledge supplementary lesson (4): Correct understanding of Cookie, Session and Token in HTTP short connection"
"IM development basic knowledge supplementary lesson (5): easy to understand, correctly understand and make good use of MQ message queue"
"IM Development Fundamentals Supplementary Lesson (6): Does the database use NoSQL or SQL? Enough to read this! 》
"IM Development Basic Knowledge Supplementary Lesson (7): Principles and Design Ideas of the Mainstream Mobile Terminal Account Login Method"
"IM development basic knowledge supplementary lesson (8): the most popular in history, thoroughly understand the nature of the problem of garbled characters"
"IM Development Basic Knowledge Supplementary Lesson (9): Want to develop an IM cluster? First understand what RPC is! 》
"IM Development Fundamentals Supplementary Lesson (10): How difficult is a large-scale IM system? A long essay with ten thousand characters, know how to live in a different place! "(* This article)

The following is the content of the article also about disaster recovery and living more in different places:

"Rapid Fission: Witness the evolution of WeChat's powerful back-end architecture from 0 to 1 (2)"
"IM Message ID Technology Topic (2): Practice of Generating Massive IM Chat Message Serial Numbers in WeChat (Disaster Recovery Plan)"
"Taobao Technology Sharing: The Technological Evolution Road of the Mobile Access Layer Gateway of the Hand Taobao Billion Level"

3. Content overview

In the field of software development, "living in different places" is a peak in the design of distributed system architecture. Many people have often heard of it, but few people understand the principles.

What is it to live in a different place? Why do I need to live more offsite? What problem does it solve? How is it solved?

These questions must be questions that every program wants to understand when it sees the term living in a different place.

I was once fortunate to be deeply involved in the design and implementation of a medium-sized Internet company's remote multi-activity system. So today, I'm here to talk to you about the realization principle behind living more in different places.

After reading this article carefully, I believe you will have a deeper understanding of the multi-site multi-living architecture.

4. What is system availability

To understand how to live in different places, we need to start with the principles of architecture design.

Nowadays, we develop a software system, and its requirements are getting higher and higher. If you understand some "architecture design" requirements, you know that a good software architecture should follow the following three principles.

They are:

1) High performance;
2) High availability;
3) Easy to expand.

in:

1) High performance: It means that the system has more processing capacity and lower response delay (for example, it can process 10W concurrent requests in 1 second, and the interface response time is 5 ms, etc.);
2) Easy to expand: It means that the system can be expanded with minimal cost when iterating new functions. When the system encounters flow pressure, the system can be expanded without changing the code.

The concept of "high availability" seems very abstract, how to understand it?

Two indicators are usually used to measure:

1) Mean Time Between Failure (MTBF): Means the time between two failures, that is, the average time of the system "normal operation", the longer the time, the higher the stability of the system;
2) MTTR (Mean Time To Repair): Means the "recovery time" after the system fails. The smaller the value, the smaller the impact of the failure on the user.

The relationship between usability and the two:

Availability (Availability) = MTBF / (MTBF + MTTR) * 100%

The result of this formula is a "proportion", and we usually use "N 9s" to describe the availability of a system.

From this picture, you can see that in order to achieve the availability of more than four 9s, the average daily failure time must be controlled within 10 seconds.

In other words, only the "shorter" the failure time, the higher the availability of the entire system, and every increase of a 9 will put higher requirements on the system.

We all know that system failure is inevitable, especially the larger the system, the greater the probability of problems.

These faults are generally reflected in 3 aspects:

1) Hardware failure: CPU, memory, disk, network card, switch, router;
2) Software issues: code bugs, version iterations;
3) Force majeure: earthquake, flood, fire, war.
These risks can happen at any time. Therefore, in the face of failures, whether our system can recover at the "fastest" speed has become the key to availability.

How to achieve rapid recovery?

The "multiple living in different places" architecture that this article will talk about is an efficient solution to solve this problem.

For the rest of this article, I will start with the simplest system and take you step by step to evolve a system architecture that supports "more live in different places".

In this process, you will see what usability problems a system will encounter, and why the architecture should evolve in this way, so as to understand the meaning of the multi-active architecture in different places.

5. Stand-alone architecture

Let's start with the simplest.

Assuming your business is in its infancy and its size is very small, then your architecture looks like this:

This architecture model is very simple. The client request comes in, the business application reads and writes the database, and the result is returned, which is very easy to understand.

But it should be noted that the database here is deployed on a "single machine", so it has a fatal disadvantage: that is, once an accident (such as disk damage, operating system exception, accidental deletion of data) occurs, it means that all data is all "Lost", the loss is huge.

How to avoid this problem? It is easy for us to think of a solution: backup.

You can back up the data and cp the database file to another machine "regularly". In this way, even if the original machine loses data, you can still "restore" the data through backup to ensure data security.

Although this scheme is relatively simple to implement, there are two problems:

1) Recovery time: The business needs to be shut down first, and then the data is restored. The downtime depends on the speed of the recovery, and the service is "unavailable" during the recovery period;
2) Incomplete data: Because it is backed up regularly, the data is definitely not "latest". The degree of data integrity depends on the backup cycle.
Obviously: the larger your database, the longer the recovery time from failure. According to the "high-availability" standard we mentioned earlier, this solution may not even be able to reach one nine, which is far from meeting our requirements for availability.

Is there any better solution that can quickly restore business? Can the data integrity be ensured as much as possible?

At this time you can use this scheme: master-slave copy.

6. Master-slave replica architecture

In response to the problem of the single-machine architecture in the previous section, you can deploy another database instance on another machine, making this new instance a "copy" of the original instance, and keeping the two in "real-time synchronization".

like this:

We generally call the original instance the master, and the new instance the slave.

The advantages of this scheme are:

1) High data integrity: the master and slave replicas are synchronized in real time, and the data "difference" is small;
2) Improved anti-failure ability: If there is any abnormality in the main library, the slave library can "switch" to the main library at any time and continue to provide services;
3) Improved reading performance: Business applications can directly read from the slave library, sharing the "stress" reading pressure of the master library.
This solution is good: it not only greatly improves the availability of the database, but also improves the read performance of the system.

In the same way, your "business application" can also be deployed on other machines to avoid single points. Because business applications are usually "stateless" (not storing data like a database), they can be deployed directly, which is very simple.

Because multiple business applications have been deployed, you now need to deploy an "access layer" to do the "load balancing" of requests (usually nginx or LVS), so that when one machine goes down, the other The machine can also "take over" all traffic and continue to provide services.

From this scheme, you can see that the key idea to improve availability is redundancy.

That's right: If you worry about one instance failure, deploy multiple instances, and if you worry about one machine down, deploy multiple machines.

Here: Your architecture has basically evolved into a mainstream solution. After you develop new business applications, you can deploy them in this mode.

But are there any risks in this scheme?

7. An easily overlooked risk

Now let us lower our perspective and focus on the specific "deployment details".

According to the previous analysis, in order to avoid a single point of failure, although your application has deployed multiple machines, we did not go into the distribution of these machines.

And a computer room has many servers, and these servers are usually distributed in a "cabinet". If you use these machines, they happen to be in a cabinet, there is still a risk.

If the switch/router that happens to be connected to this cabinet fails, then your application still has the risk of being "unavailable".

Although the switch/router has also made route redundancy, there is no guarantee that there will be no problems.

Deploying in one cabinet is risky. If you break up these machines and distribute them to different cabinets, isn't it okay?

This will indeed greatly reduce the probability of problems. But we still cannot take it lightly, because no matter how scattered they are, they are still in the same environment: the computer room.

Then continue to ask, will the computer room malfunction?

Generally speaking, the requirements for building a computer room are actually very high. For the geographical location, temperature and humidity control, backup power supply, etc., the computer room manufacturer will do a good job of protection in all aspects.

But even so, we will see news like this every once in a while:

On May 27, 2015, an optical fiber was cut in a place in Hangzhou, and nearly 300 million users were unable to access Alipay for 5 hours;

On July 13, 2021, part of the server room at station B failed, causing the entire station to be inaccessible for 3 hours;

On October 9, 2021, a power failure occurred in the server room of Futu Securities, causing users to be unable to log in and trade for 2 hours;

...

It can be seen that even if the protection of the computer room level has been done well enough, as long as there is a "probability" that goes wrong, the reality is likely to happen (Murphy's law). Although the probability is small, once it happens, the impact is obvious.

When you see this, you may think that the probability of problems in the computer room is too small. After working for so many years, I haven't encountered it once. Is it necessary to consider it so complicated?

But have you thought about such a question: What are the focuses of different systems of different sizes?

1) For a system with a small volume, it will focus on the scale and growth of "users". At this stage, acquiring users is everything;
2) When the user volume comes up, this stage will focus on "performance", optimizing interface response time, page opening speed, etc., this stage is more about user experience;
3) After the volume reaches a certain size, you will find that "availability" becomes particularly important. For national-level applications such as WeChat and Alipay, if a failure occurs in the computer room, the entire scope of impact can be said to be very huge.
Therefore, no matter how small the risk is, we cannot ignore it when we improve the availability of the system.

After analyzing the risks, let's talk about our architecture. So how do you deal with computer room-level failures?

Yes, it is redundant.

8. In-city disaster recovery architecture

If you want to resist the risks of the "computer room" level, the countermeasures cannot be limited to one computer room.

Now, you need to do a computer room-level redundancy solution, that is, you need to build another computer room to deploy your services.

For simplicity, you can build another computer room in the "same city". The original computer room is called computer room A, and the new computer room is called computer room B. The networks of these two computer rooms are connected by a "dedicated line".

With a new computer room, how to use it? Here we still have to give priority to "data" risk.

In order to avoid data loss caused by the failure of computer room A, we need to save a copy of the data in computer room B. The simplest solution is the same as mentioned earlier: backup.

The data in computer room A is regularly backed up in computer room B (copy data files), so that even if the entire computer room A is severely damaged, the data in computer room B will not be lost. The data can be "recovered" through the backup and the service can be restarted.

We call this kind of plan "cold preparation".

Why is it called Leng Bei? Because computer room B only does backup and does not provide real-time services, it is cold and will only be activated when computer room A fails.

But the problem of backup is still the same as previously described: data is incomplete, business is unavailable during data restoration, and the availability of the entire system cannot be guaranteed.

Therefore, we still need to use the "master-slave copy" method to deploy the data copy of the A computer room in the B computer room.

The architecture becomes like this:

In this way: even if the entire computer room A hangs up, we still have relatively "complete" data in the computer room B.

The data is preserved, but at this time you need to consider another issue-if computer room A really hangs up, you need to do these things "urgently" in computer room B if you want to ensure that the service is not interrupted.

for example:

1) All the slave libraries in the computer room B are upgraded to the master library;
2) Deploy the application in computer room B and start the service;
3) Deploy the access layer and configure forwarding rules;
4) DNS points to computer room B, access to traffic, and business recovery.
have you seen it? After computer room A fails, computer room B needs to do so much work before your business can be completely "recovered".

You see: the entire process requires human intervention and takes a lot of time to operate. The entire service is still unavailable before the restoration. This solution is still not very good. If you can "switch" immediately after a failure, it would be great.

Therefore: In order to shorten the business recovery time, you must do these tasks "in advance" in the B computer room. That is to say, you need to deploy the access layer and business applications in the B computer room in advance, and wait for the switch at any time.

The architecture becomes like this:

In this case, the A computer room is completely hung up, and we only need to do 2 things:

1) All the slave libraries in the computer room B are upgraded to the master library;
2) DNS points to computer room B, access to traffic, and business recovery.
In this way, the recovery speed is much faster.

Here you will find that computer room B has evolved from "empty" from the very beginning to the present. It has almost "mirrored" everything in computer room A, from the uppermost access layer to the middle business applications, to the most Lower storage.

The only difference between the two computer rooms is that the storage of computer room A is the main library, while the storage of computer room B is the slave library.

This kind of scheme, we call it "Hot Standby".

"Hot" means: B's computer room is in a "standby" state. After A fails, B can "take over" the traffic at any time and continue to provide services.

The biggest advantage of hot standby compared to cold standby is that it can be switched at any time.

Regardless of whether it is cold standby or hot standby, because they are in a "standby" state, we collectively refer to these two programs as: same-city disaster recovery.

The biggest advantage of intra-city disaster recovery is that we no longer have to worry about "computer room" level failures. If one computer room is at risk, we only need to switch the traffic to another computer room. The availability will increase again. Isn't it cool? (There will be more cool later...)

9. Live-active architecture in the same city

Let's continue to look at the architecture of the previous section.

Although we have a solution to the failure of the computer room, there is a problem that we cannot ignore: computer room A hangs up and all traffic is cut to computer room B. Can computer room B really provide services as we wish?

This is a question worth thinking about.

This is like having two armies A and B. Army A has gone through the battlefield and has rich combat experience, while Army B is just a reserve army. Apart from the basic qualities of a soldier, there is no actual combat experience, and combat experience is basically zero.

If Army A loses its combat capability and requires Army B to be on top immediately, as the commander, you will certainly worry about whether Army B can really take on this important task, right?

Our architecture is also the same: at this time, computer room B is "standby" at any time, but computer room A really fails. We have to cut all traffic to computer room B. In fact, we dare not 100% guarantee that it can be "on schedule." work.

Do you think: When we deploy services in a computer room, various problems always occur, such as inconsistent versions of published applications, insufficient system resources, different operating system parameters, and so on. Now that an additional computer room is deployed, these problems will only increase, not decrease.

In addition: From the perspective of "cost", we need to purchase a server, memory, hard disk, and bandwidth resources for a new computer room. The cost is also very high. Just letting it be a reserve army is too "overkill"!

Therefore: we need to allow computer room B to also access traffic and provide services in real time. There are two advantages to this.

One is that this reserve army can be trained in real time, so that it can reach the same combat level as the A computer room, and it can be switched at any time;
The second is that after the B computer room is connected to the traffic, the traffic pressure of the A computer room can be shared.
This is the best solution to maximize the resource advantage of computer room B!

Then how to make computer room B also access the traffic? It's very simple. Add the IP address of the access layer of computer room B to the DNS, so that computer room B can have traffic coming in from the upper layer.

But here is a problem: don’t forget, the storage in computer room B is now the "slave library" of computer room A. By default, the slave library is "not writable". The write request from computer room B is sent to the storage in the computer room. , Will definitely report an error, which still does not meet our expectations. How to do?

At this time, you need to make a transformation at the "business application" layer.

When your business application operates the database, you need to distinguish "read and write separation" (usually implemented by middleware), that is, the "read" traffic of two computer rooms can read the storage of any computer room, but the "write" traffic is only allowed to write A computer room, because the main library is in A computer room.

This will involve all the storage you use, such as MySQL, Redis, MongoDB, etc. used in the project. To operate these databases, you need to distinguish read and write requests, so this requires a certain amount of business "reform" cost.

Because the storage in computer room A is the main library, we call computer room A the "host room" and computer room B as the "slave computer room".

The two computer rooms are deployed in the "same city", and the physical distance is relatively close, and the two computer rooms are connected by a "dedicated line" network. Although the latency of cross-computer room access is greater than that of a single computer room, the overall delay is still acceptable.

After the service transformation is completed, computer room B can slowly access traffic (from 10%, 30%, 50% and gradually covered to 100%). You can continue to observe whether there are any problems in the business of computer room B. If there are problems, repair them in time, and gradually let computer room B The working ability of A is up to the same level as the A computer room.

Now: Because computer room B has real-time access to the traffic, if computer room A hangs up at this time, then we can "boldly" switch all the traffic of A to computer room B to complete the quick switch!

Here you can see: Although the computer room B we deployed is physically separated from A, the entire system is "logically" viewed. We plan the two computer rooms as a "whole". In other words, it is equivalent to using two computer rooms as one computer room.

This kind of architecture scheme is "one step closer" than the previous intra-city disaster recovery. Computer room B has real-time access to traffic and can also cope with failover at any time. This scheme is called "single-city active-active".

Because both computer rooms can handle business requests, this provides more implementation space for the internal maintenance, transformation, and upgrade of our system (flow can be switched at any time). Now, the flexibility of the entire system has also become greater, is it better? NS?

So what's wrong with this architecture?

10. Two-location three-center architecture

Let's go back to the risk.

As mentioned in the previous section, although we plan the two computer rooms as a whole, the two computer rooms are still in "one city" on the physical level. Not long ago), the two computer rooms still have the risk of "overall annihilation".

It's really hard to guard against. How to do? No way, continue to be redundant.

But this time the redundant computer room cannot be deployed in the same city. You need to place it at a farther distance and deploy it in a "different place."

It is usually recommended that the distance between the two computer rooms be more than 1000 kilometers, so as to cope with city-level disasters.

Assuming that the previous computer rooms A and B are in Beijing, then the newly deployed computer room C can be located in Shanghai.

According to the previous ideas, the simplest and rude solution for using computer room C is to do "cold backup", that is, periodically back up the data in computer rooms A and B in computer room C to prevent data loss.

This kind of plan is the "two places and three centers" that we often hear.

Specifically: Two places refer to two cities, and three centers refer to three computer rooms. Two of the computer rooms are in the same city and provide services at the same time, and the third computer room is deployed in a different place, only for data disaster recovery.

This kind of architecture scheme is usually used in banking, finance, government and enterprise-related projects. The problem is the same as mentioned earlier. It takes time to activate the disaster recovery computer room, and the service after activation is uncertain whether it will work as scheduled.

Therefore: In order to truly resist city-level failures, more and more Internet companies have begun to implement "remote active-active".

11. Pseudo-remote active-active architecture

Here, we still analyze the architecture of the two computer rooms.

We no longer deploy computer rooms A and B in the same city, but deploy them separately (for example, computer room A is located in Beijing and computer room B is located in Shanghai).

Earlier we talked about dual-active in the same city. Can the dual-active in different places directly "copy" the dual-active mode in the same city to deploy it?

Things are not as simple as you think.

If it is deployed according to the same-city dual-active architecture, then the remote dual-active architecture is like this:

Pay attention: The networks of the two computer rooms are connected through a "cross-city dedicated line".

At this time, both computer rooms have access to traffic, and the request from the Shanghai computer room may have to read and write the storage of the Beijing computer room. There is a big problem here: network delay.

Because the two computer rooms are far apart, they are limited by physical distance. Now, the network delay between the two places has become a "not to be ignored" factor.

The distance between Beijing and Shanghai is about 1,300 kilometers. Even if a high-speed "network dedicated line" is set up, and the optical fiber transmits at the speed of light, a round trip requires a delay of nearly 10ms.

Moreover: there will be various routers, switches and other network equipment between the network lines. The actual delay may reach 30ms ~ 100ms. If the network is jittered, the delay may even reach 1 second.

More than delay: The quality of the long-distance network dedicated line is far from the quality of the network in the computer room. The dedicated line network often suffers from delays, packet loss, and even interruptions. In short, we should not over-trust and rely on "inter-city dedicated lines."

You may ask, does this delay have a big impact on the business? The impact is very big!

Imagine: a client requests to call the Shanghai computer room, and the Shanghai computer room wants to read and write the storage in the Beijing computer room. A cross-computer room access delay reaches 30ms, which is roughly 60 times (30ms) the access speed of the computer room intranet network (0.5 ms) / 0.5ms), a request is 60 times slower, and a round trip is more than 100 times slower.

When we open a page in the App, we may access dozens of back-end APIs, each time across the computer room, and the response delay of the entire page may reach the second level. This performance is horrible and unacceptable.

Have you seen it: Although we simply deployed the computer room in a "remote place", the architecture model of "same-city dual-active" is not applicable here. We still deploy it in this way. This is a "pseudo-remote dual-active" !

So how to achieve true dual-live in different places?

12. A true remote active-active architecture

Since the "cross-computer room" call delay is a factor that cannot be ignored, we can only try to avoid cross-computer room "call" to avoid this delay problem.

That is to say: The application in the Shanghai computer room can no longer "cross the computer room" to read and write the storage in the Beijing computer room. Only the local storage in Shanghai is allowed to read and write, so as to achieve "nearby access", so as to avoid the problem of delay.

Still the problem mentioned before: Shanghai computer room storage is from the library, and writing is not allowed. Unless we only allow Shanghai computer room to access "read traffic" and not receive "write traffic", it will not be able to meet the requirement of no longer crossing computer rooms. .

Obviously: The plan of only letting the Shanghai computer room receive the read traffic is unrealistic, because there are very few projects that only have read traffic and no write traffic. So this scheme still doesn't work, what should I do?

At this point, you must make a transformation in the "storage layer".

If you want the Shanghai computer room to read and write the storage in the local computer room, the storage in the Shanghai computer room can no longer be the slave library of the Beijing computer room, but must become the "master library".

You read that right: the storage of the two computer rooms must be the "main database", and the data of the two computer rooms must be "synchronized with each other", that is, no matter which computer room the client writes, the data can be synchronized to the other. A computer room.

Because only two computer rooms have "full data", can they support arbitrary switching of computer rooms and continue to provide services.

How to realize this "dual master" architecture? How do they synchronize data with each other?

If you know something about MySQL, you should know that MySQL itself provides a dual-master architecture, which supports two-way replication of data, but it is not often used. Moreover, databases such as Redis and MongoDB do not provide this function. Therefore, you must develop the corresponding "data synchronization middleware" to achieve the two-way synchronization function.

In addition: In addition to stateful software such as databases, your project usually uses message queues (such as RabbitMQ, Kafka). These are also stateful services, so they also need to develop two-way synchronization middleware to support arbitrary Data is written in the computer room and synchronized to another computer room.

Have you seen it: The complexity is up at this moment, and it takes a lot of effort to develop synchronization middleware for each database and queue.

The industry has also open sourced many data synchronization middleware, such as Ali's Canal, RedisShake, and MongoShake, which can synchronize MySQL, Redis, and MongoDB data in two computer rooms respectively.

Many capable companies also use self-developed synchronization middleware to do so (for example, Ele.me, Ctrip, and Meituan have all developed their own synchronization middleware).

Now, the whole architecture becomes like this:

Note: The storage layers of the two computer rooms are synchronized with each other.

With data synchronization middleware, you can achieve this effect:

1) Beijing computer room writes X = 1;
2) Write Y = 2 in the Shanghai computer room;
3) Two-way synchronization of data through middleware;
4) Both Beijing and Shanghai computer rooms have X = 1, Y = 2 data.

Here: We use middleware to synchronize data in both directions, so there is no need to worry about the dedicated line problem (Once the dedicated line fails, our middleware can automatically retry until it succeeds and the data is finally consistent).

But there is still a problem here: both computer rooms can write, and the operation is not the same data, so it is okay. If the modification is the same data, what should I do if there is a conflict?

1) The user has sent 2 modification requests within a short period of time, both of which modify the same piece of data;
2) A request falls in the Beijing computer room, modify X = 1 (not yet synchronized to the Shanghai computer room);
3) Another request falls in the Shanghai computer room, modify X = 2 (not yet synchronized to the Beijing computer room);
4) Which of the two computer rooms shall prevail?

That is to say: In a short period of time, the same user modifies the same piece of data, and the two computer rooms cannot confirm who comes first, and the data "conflicts".

This is a very serious problem: the system failure is not terrible, the terrible thing is that the data is "error" because the cost of correcting the data is too high. We must avoid this from happening.

There are 2 solutions to solve this problem.

The first solution: The data synchronization middleware must have the ability to automatically "merge" data and resolve "conflicts".

This scheme is more complicated to implement. If you want to merge data, you must distinguish the "first-order" order. The solution we can easily think of is to use "time" as the ruler and "late arrival" request as the standard.

However, this solution requires that the "clocks" of the two computer rooms are strictly consistent, otherwise problems are prone to occur.

E.g:

1) The first request falls to the Beijing computer room, the clock of the Beijing computer room is 10:01, modify X = 1;
2) The second request falls to the Shanghai computer room, the Shanghai computer room clock is 10:00, modify X = 2.

Because the time in the Beijing computer room is "later", the final result will be X = 1. But here, the second request should prevail, and X = 2 is correct.

It can be seen that the conflict resolution solution that completely "relies on" the clock is not very rigorous.

Therefore, the second solution is usually adopted to avoid data conflicts from the "source".

We continue to learn. . .

13. Better remote active-active architecture and implementation ideas

Continuing from the previous section: Since the implementation cost of automatically merging data is high, then we have to think, can we "avoid" data conflicts from the source?

This idea is great!

13.1 Basic Ideas
The idea of avoiding data conflicts from the source is: Don't let conflicts occur when accessing traffic at the top layer.

Specifically, it is necessary to "differentiate" users at the top level. Some users request a fixed call to the Beijing computer room, and other users request a fixed call to the Shanghai computer room. The user requests to enter a certain computer room, and all subsequent business operations are Completed in this computer room, avoiding "cross-computer room" from the root.

So at this time: You need to deploy a "routing layer" (usually deployed on a cloud server) above the access layer, and you can configure routing rules to "distribute" users to different computer rooms.

But how do you decide on this routing rule?

There are many ways to achieve, the most common I summarized 3 categories:

1) Fragmentation by business type;
2) Directly hash sharding;
3) Fragmentation by geographic location.

13.2 Sharding by business type
This kind of plan is divided according to the "business type" of the application.

Example: Suppose we have 4 applications in total, and these applications are deployed in the Beijing and Shanghai computer rooms. However, applications 1 and 2 only access traffic in the Beijing computer room, and only hot backup in the Shanghai computer room. Applications 3 and 4 access traffic only in the Shanghai computer room, and hot standby in the Beijing computer room.

In this way: All business requests of applications 1 and 2 only read and write the Beijing computer room storage, and all requests of applications 3 and 4 can only read and write the Shanghai computer room storage.

In this way, slicing by business type can also prevent the same user from modifying the same piece of data.

To access traffic in different computer rooms according to the type of service, you also need to consider the dependencies between multiple applications. It is necessary to deploy applications that complete "related" services in the same computer room as much as possible to avoid cross-computer room calls.

For example, if the order and payment services have a dependency relationship, they will call each other. Then these two services will access traffic in the A computer room. The community and the posting service are dependent, so these two services are connected to the traffic in the B computer room.

13.3 Direct Hash Sharding
This solution is: the top routing layer will calculate the "hash" based on the user ID, then find the corresponding computer room from the routing table, and then forward the request to the designated computer room.

Example: A total of 200 users, calculate the hash value according to the user ID, and then route users 1-100 to the Beijing computer room, and 101-200 users to the Shanghai computer room according to the routing rules. In this way, the same user is prevented from modifying the same entry The data situation occurs.

13.4 Sharding by Geographical Location
This solution is very suitable for businesses that are closely related to geographic locations, such as taxi rides and takeaway services.

Take the food delivery service as an example: You must order food at the "nearby" order. The entire business scope is related to merchants, users, and riders, all of which are in the same geographic location.

In response to this feature: you can split it into different computer rooms at the top level according to the user's "geographical location".

For example: users in Beijing and Hebei will only call the computer room in Beijing if they order food, while users in Shanghai and Zhejiang will only call the computer room in Shanghai. Such fragmentation rules can also avoid data conflicts.

Reminder: These 3 common sharding rules are not easy to understand at first glance. It is recommended to understand several times with the diagram. Only by understanding these three sharding rules can you truly understand how to do more work in different places.

In short: The core idea of sharding is to allow the same user's related requests to complete all business "closed loop" in only one computer room, and no "cross-computer room" access will occur.

When Alibaba implemented this plan, he gave it a name called "Unitization".

Of course, after the top routing layer divides users into fragments, theoretically the same user will only fall into the same computer room, but it is not ruled out that program bugs cause users to "drift" in two computer rooms.

For safety reasons, each computer room needs to have a mechanism to detect "data attribution" when writing storage. When operating storage at the application layer, it needs to use middleware as a "pocket" to avoid situations in which it should not be written to the computer room. (Limited by space, I won’t explain it here, just understand the idea)

Now: Both computer rooms can receive "read and write" traffic (request for fragmentation), the underlying storage maintains "two-way" synchronization, and both computer rooms have full data. When any computer room fails, another computer room can "take over" all the traffic and realize fast switching, which is not too cool.

Not only that: because the computer room is deployed in a different place, we can also "optimize" the routing rules in more detail, allowing users to access the nearby computer room, so that the performance of the entire system will also be greatly improved.

There is also a situation where data sharding is impossible: global data, such as system configuration, commodity inventory, and other data that require strong consistency. This type of service can still only use the program of writing to the host room and reading from the computer room, not doing it. Double live.

The focus of dual-active is to give priority to ensuring that "core" businesses achieve dual-active first, not "all" businesses achieve dual-active.

At this point, we have realized the real "remote live"!

From here, you can see that the cost to complete such a set of architecture is huge:

Routing rules, routing and forwarding, data synchronization middleware, and data verification strategies require not only the development of powerful middleware, but also a series of tasks such as business cooperation transformation (business boundary division, dependency splitting), and there is not enough manpower and material resources. , This structure is difficult to implement.

14. Multi-active architecture in different places

Understand the dual-active in different places, the "multi-living in different places", as the name implies, is to deploy multiple computer rooms on the basis of the dual-active in different places.

The architecture becomes like this:

These services are deployed in a "unitized" manner, allowing each computer room to be deployed in any area, and new computer rooms can be expanded at any time. You only need to define the sharding rules at the top level.

But there is a small problem here: With more and more expanded computer rooms, after a computer room writes data, more and more computer rooms need to be synchronized, and the complexity of this implementation will be relatively high.

Therefore, the industry has further optimized this structure and upgraded the "mesh" structure to a "star shape":

This kind of scheme must set up a "central computer room", after any computer room writes data, it will only synchronize to the central computer room, and then from the central computer room to other computer rooms.

The advantage of this is that to write data in a computer room, you only need to synchronize the data to the central computer room, and you don't need to care about how many computer rooms are deployed in total, and the complexity is greatly "simplified".

But at the same time: the "stability" requirements of this central computer room will be relatively high. Fortunately, even if the central computer room fails, we can upgrade any computer room to a central computer room and continue to provide services in accordance with the previous structure.

So far, our system has completely realized "living in different places"!

The advantage of Duohuo is that it can arbitrarily expand the computer room "nearby" deployment. When any computer room fails, a fast "switch" can be completed, which greatly improves the availability of the system.

At the same time: We no longer have to worry about the growth of the system scale, because this architecture has a very strong "scalability."

How about it? We went from the simplest application, optimized all the way to the final architecture plan, did we help you thoroughly understand how to live in a different place?

15. Summary of this article

Well, to summarize the key points of this article.

1) A good software architecture should follow the three principles of high performance, high availability, and easy expansion. Among them, "high availability" becomes particularly important as the system becomes larger and larger.

2) System failure is not terrible. It is the goal of high availability to be able to recover at the "fastest" speed. Living more in different places is an effective means to achieve high availability.

3) The core of improving high availability is "redundancy." Backup, master-slave copy, intra-city disaster recovery, intra-city dual-active, two-location three-center, remote dual-active, and multiple activities in different locations are all doing redundancy.

4) In-city disaster recovery is divided into "cold standby" and "hot standby". Cold standby only backs up data and does not provide services. Hot standby synchronizes data in real time, and is ready to switch at any time.

5) The advantage of intra-city dual-active over disaster recovery is that both computer rooms can access "read and write" traffic, which improves availability while also improving system performance. Although physically there are two computer rooms, but "logically" they are still used as one computer room.

6) On the basis of dual-active in the same city, the three centers in the two places deploy an additional remote computer room for "disaster recovery" to resist "city"-level disasters, but it takes time to activate the disaster recovery computer room.

7) Remote active-active is a better solution to resist "urban" level disasters. Two computer rooms provide services at the same time, faults can be switched at any time, and high availability. But the realization is also the most complicated. Only when you understand the dual-live in different places can you fully understand the multi-living in different places.

8) Multi-activity in different places is to expand multiple computer rooms arbitrarily on the basis of double-active in different places, which not only improves the availability, but also can cope with the pressure of larger-scale traffic. It has the strongest scalability and is the ultimate solution to achieve high availability.

16, write at the end

From the "macro" level, I introduced this article to you the "core" idea of the multi-living structure in different places. The amount of information in the whole article is still very large. If it is not easy to understand, I suggest you read it several times.

Due to space limitations, I did not expand on many details. This article is more like talking about the "Tao" of the framework of multiple activities in different places, and the actual implementation of the "technique" has many points to consider, because it requires the development of a powerful "infrastructure" to complete the implementation.

Not only that, if you want to truly achieve multiple activities in different places, you need to follow some principles, such as business sorting, business classification, data classification, data final consistency guarantee, computer room switching consistency guarantee, exception handling, and so on. At the same time, related operation and maintenance facilities and monitoring systems must be able to keep up.

At the macro level, business (microservice deployment, dependency, split, SDK, Web framework), infrastructure (service discovery, traffic scheduling, continuous integration, synchronization middleware, self-developed storage) need to be considered, and various middleware must be developed at the micro level , We must also pay attention to the high performance, high availability, and fault tolerance of middleware.

I have been fortunate enough to participate in the design and development of storage layer synchronization middleware, and have realized the "cross-computer room" synchronization of MySQL, Redis, and MongoDB middleware, and I have stepped on a lot of pits. Of course, the design ideas of these middlewares are also very interesting. I have time to share the design ideas of these middlewares separately.

It’s worth reminding you that only when you truly understand "Live in a different place" can you fully understand "Live in a different place."

In my opinion, the process of evolving from dual-active in the same city to dual-active in different places is the most complicated. The core things include: business unitization, two-way synchronization of storage layer data, and the uppermost sharding logic. These are the top priorities for realizing multiple activities in different places.

I hope the architecture experience I shared will inspire you.

Appendix: More IM technology dry goods

[1] Articles on IM architecture design:
"On the architecture design of IM system"
"A brief description of the pits of mobile IM development: architecture design, communication protocol and client"
"A set of mobile IM architecture design practice sharing for massive online users (including detailed graphics and text)"
"An Original Distributed Instant Messaging (IM) System Theoretical Architecture Plan"
"A set of high-availability, easy-scalable, and high-concurrency IM group chat and single chat architecture design practices"
"From guerrilla to regular army (1): the evolution of the IM system architecture of Mafengwo Travel Network"
"The data architecture design of Guazi IM intelligent customer service system (organized from the on-site speech, with supporting PPT)"
"Ali DingTalk Technology Sharing: Enterprise-level IM King-DingTalk's outstanding features in the back-end architecture"
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 1): overall architecture, service split, etc."
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 2): reliability, orderliness, weak network optimization, etc."
"From novice to expert: How to design a distributed IM system with billions of messages"
"The Secret of the IM Architecture Design of Enterprise WeChat: Message Model, Ten Thousands of People, Read Receipt, Message Withdrawal, etc."
"Rongyun Technology Sharing: Fully Revealing the Reliable Delivery Mechanism of 100 Million-level IM Messages"
"IM Development Technology Learning: Demystifying the System Design Behind the Information Push of WeChat Moments"
"Alibaba IM Technology Sharing (3): The Road to the Evolution of the Architecture of Xianyu's Billion-level IM Message System"
"Alibaba IM Technology Sharing (4): Reliable Delivery Optimization Practice of Xianyu's 100-million-level IM Message System"
"Ali IM Technology Sharing (5): Timeliness Optimization Practice of Xianyu's Billion-level IM Message System
[2] Other IM technology comprehensive articles:
"One entry is enough for novices: develop mobile IM from scratch"
"Mobile IM developers must read (1): easy to understand, understand the "weak" and "slowness" of mobile networks"
"A Must-Read for Mobile IM Developers (2): Summary of the Most Complete Mobile Weak Network Optimization Method in History"
"From the perspective of the client to talk about the message reliability and delivery mechanism of the mobile terminal IM"
"Summary of optimization methods for short connection of modern mobile network: request speed, weak network adaptation, security assurance"
"How to ensure the efficiency and real-time performance of large-scale group message push in mobile IM? 》
"Technical issues that need to be faced in mobile IM development"
"Implementation of IM Message Delivery Guarantee Mechanism (1): Guarantee the reliable delivery of online real-time messages"
"Implementation of IM Message Delivery Guarantee Mechanism (2): Guaranteeing the Reliable Delivery of Offline Messages"
"How to ensure the "sequence" and "consistency" of IM real-time messages? 》
"A low-cost method to ensure the timing of IM messages"
"Should I use "push" or "pull" for online status synchronization in IM single chat and group chat? 》
"IM group chat messages are so complicated, how to ensure that they are not lost or repetitive? 》
"Talk about the optimization of login request in mobile IM development"
"How to save data by pulling data during IM login on the mobile terminal? 》
"On the principle of multi-sign-in and message roaming of IM on the mobile terminal"
"How to design a "failure retry" mechanism for a completely self-developed IM? 》
"Is it so difficult to develop IM yourself? Teach you to teach yourself an Andriod version of simple IM (with source code) "
"Suitable for novices: develop an IM server from scratch (based on Netty, with complete source code)"
"Pick up the keyboard and do it: work with me to develop a distributed IM system by hand"
"IM Message ID Technology Topic (1): Practice of Generating Massive IM Chat Message Sequence Numbers on WeChat (Principles of Algorithms)"
"IM Development Collection: The most complete in history, a summary of various function parameters and logic rules of WeChat"
"IM development dry goods sharing: how do I solve a large number of offline messages causing the client to freeze"
"Introduction to zero-based IM development (1): What is an IM system? 》
"Introduction to zero-based IM development (2): What is the real-time nature of the IM system? 》
"Introduction to zero-based IM development (3): What is the reliability of the IM system? 》
"Introduction to zero-based IM development (4): What is the message timing consistency of the IM system? 》
"IM development dry goods sharing: how to elegantly realize the reliable delivery of a large number of offline messages"
"IM Scan Code Login Technology Topic (3): Easy to understand, one detailed principle of IM scan code login function is enough"
"IM Scan Code Login Technology Topic (4): Do you really understand QR codes? Get to the bottom of the question and master it in one article! 》
"Understanding the "Reliability" and "Consistency" Issues of IM Messages and Discussion of Solutions"
"IM Development and Dry Goods Sharing: Long 10,000-character text, detailed explanation of IM "message" list lagging optimization practice"
"IM development and dry goods sharing: Netease Yunxin IM client's chat message full-text retrieval technology practice"
"IM Development Technology Learning: Demystifying the System Design Behind the Information Push of WeChat Moments"

This article has been simultaneously published on the official account of "Instant Messaging Technology Circle".
The synchronous publishing link is: http://www.52im.net/thread-3742-1-1.html

IM development basic knowledge make-up lesson (10): How difficult is the large-scale IM system? A long essay with ten thousand characters, know how to live in a different place!

1 Introduction

2. Series of articles

3. Content overview

4. What is system availability

5. Stand-alone architecture

6. Master-slave replica architecture

7. An easily overlooked risk

8. In-city disaster recovery architecture

9. Live-active architecture in the same city

10. Two-location three-center architecture

11. Pseudo-remote active-active architecture

12. A true remote active-active architecture

13. Better remote active-active architecture and implementation ideas

14. Multi-active architecture in different places

15. Summary of this article

16, write at the end

Appendix: More IM technology dry goods

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

极致出海友好，融云 IM 支持消息免打扰设置时区

如何基于 Go 语言设计一个简洁优雅的分布式任务系统

支持百万人超大群聊的Web端IM架构设计与实践

全平台开源即时通讯IM框架MobileIMSDK：7端+TCP/UDP/WebSocket协议

鸿蒙NEXT如何保证应用安全：详解鸿蒙NEXT数字签名和证书机制

《北京日报》点赞！融云助力打造“数字丝路”新范式