That day, my fear of being dominated by the Redis master-slave architecture

Interviewer : Would you like to talk about the things you have been watching recently? You can pull it out and discuss it together (I don’t know what to ask today)

Candidate : Recently watching "Redis" related content

Interviewer : Well, I remember asking about the basics and persistence of Redis

Interviewer : How about the architecture of your company's Redis?

Candidate : The Redis architecture of my former company is a "sharded cluster", which uses the "Proxy" layer to distribute the Key to different Redis servers

candidate : supports dynamic expansion, failure recovery, etc...

Interviewer : Then you come to talk about the architecture and basic implementation principles of the Proxy layer?

Candidate : Sorry, the middleware team is responsible for this part, and I haven’t read it carefully.

candidate :...

Interviewer :...

candidate : However, I can tell you about the existing common open source Redis architecture (:

Interviewer : That’s all there is, okay, you can start

candidate : So let me start with the basics?

candidate : mentioned earlier that Redis has a persistence mechanism, even if Redis restarts, you can rely on RDB or AOF files to reload the data

candidate : But at this time, there is only one Redis server storing all the data. At this time, if the Redis server cannot be repaired "temporarily", then the service that relies on Redis is gone

Candidate : Therefore, for Redis "high availability", Redis will basically be "backed up" now: start one more Redis server to form a "master-slave architecture"

candidate : The data of the "slave server" is copied from the "master server", and the data of the master and slave servers are consistent

candidate : If the master server is down, you can manually upgrade the "slave server" to the "master server" to shorten the unavailability time

Interviewer : How does the "master server" "copy" its own data to the "slave server"?

candidate : "copy" is also called "synchronization". In Redis, the "PSYNC" command is used for synchronization. The command has two models: full resynchronization and partial resynchronization

Candidate : It can be simply understood as: if it is the first "synchronization", the slave server has not replicated any master server, or the master server to be replicated by the slave server is different from the master server that was replicated last time, then it will Use "full resynchronization" mode for replication

candidate : If it is only due to a network interruption and only a "short time" disconnection, then the "partial resynchronization" mode will be used for replication

candidate : (If the data gap between the master and slave server is too large, the "full resynchronization" mode will still be used for replication)

Interviewer : Can we talk a little about the principle of "synchronization"?

candidate : Well, no problem

Candidate : The master server wants to copy data to the slave server. The first step is to establish a Socket "connection". This process will do some information verification, identity verification, etc.

candidate : Then the slave server will send the "PSYNC" command to the master server, requesting synchronization (this time will bring the "server ID" RUNID and "replication progress" offset parameters, if the slave server is new, there will be no)

candidate : The master server finds that this is a new slave server (because the parameters are not brought up), it will adopt the "full resynchronization" mode, and set the "server ID" (runId) and "copy progress" (offset) Send it to the slave server, and the slave server will record the information.

Interviewer : Well...

candidate : Then, the master server will generate the RDB file in the background and send it to the slave server through the established connection

candidate : After receiving the RDB file from the server, first empty its own data, and then load and restore the RDB file

candidate : During this process, the main server is not idle either (continues to receive client requests)

Interviewer : Well...

candidate : The master server will generate the RDB file "commands modified later" will be recorded with "buffer", after the slave server finishes loading the RDB, the master server will send all the commands recorded in the "buffer" to the slave server

candidate : In this way, the master and slave servers have reached data consistency (the replication process is asynchronous, so the data is "final consistency")

Interviewer : Well...

Interviewer : the "partial resynchronization" process?

candidate : Well, it actually relies on "offset" to perform partial resynchronization. Every time the master server transmits a command, it will give "offset" to the slave server

candidate master server and the slave server will save the "offset" (if there is a difference between the offsets on both sides, then the data of the master and slave servers are not fully synchronized)

candidate : After the slave server is disconnected, the "PSYNC" command will be sent to the master server, and it will also carry RUNID and offset (this information will still exist after reconnecting)

Interviewer : Well...

candidate : After the master server receives the command, see if the RUNID can be matched, which means that it may have been copied partly before

candidate : Then check whether the "offset" still exists in the offset recorded by the main server

candidate : (explained here, because the main server record offset uses a ring buffer, if the buffer is full, it will overwrite the previous record)

candidate : If found, start with the missing part of the offer, and send the corresponding modification command to the slave server

candidate : If the secondary ring buffer is not found, you can only use the "full resynchronization" mode to perform master-slave replication again

Interviewer : I understand the master-slave replication, then you said that now, if the Redis main library is down, you still have to "manually" upgrade from the library to the main library

Interviewer : Do you know any way to achieve "automatic" failure recovery?

Candidate : It’s a must, then it’s the "Sentinel" debut

Interviewer : Start your performance.

candidate : "Sentinel" mainly does: monitoring (monitoring the status of the master server), election of the master (the master server is hung up, and one of the slave servers is selected as the master server), notification (the failure to send a message to the administrator) ) And configuration (as a configuration center, providing information about the current main server)

candidate : "Sentinel" can be regarded as a Redis server running in "special" mode. For "high availability", Sentinel is also a cluster architecture.

candidate : First, it needs to create a corresponding connection with the Redis master-slave server (get their information)

candidate : Each sentry continuously uses the ping command to see if the main server is offline. If the main server does not respond normally within the "configuration time", then the current sentry "subjectively" thinks that the main server is offline

candidate : Other "sentinels" will also ping the main server. If "enough" (or depending on the configuration) sentries think that the main server is offline, then they will be considered "objectively offline". Perform a failover operation on the primary server.

Interviewer : Well...

candidate : A "leader" will be selected among the "sentinels", and there are many rules for selecting the leader. Generally speaking, it is first-come, first-served (whichever is faster, whichever is chosen)

Candidate : "Leading Sentry" will fail over the main server that has gone offline

Interviewer : Well...

candidate : first select one on the "slave server" as the master server

candidate : (Here is also selected carefully, such as: the configuration priority of the slave library, which slave server has the largest replication offset, the RunID size, the length of time to disconnect from the master...)

candidate : Then, the previous slave server needs to perform "master-slave replication" with the new master server

Candidate : The master server that has gone offline, when you reconnect again, you need to make him a slave server of the new master server

Interviewer : Hmm... I would like to ask, does Redis cause data loss during master-slave replication and failover?

Candidate : Obviously it will. From the above "master-slave replication" process, this process is asynchronous (during the replication process: the master server will always receive the request, and then send the modification command to the slave server)

candidate : If the command of the master server has not been sent to the slave server, it will be dead. At this time, I want to make the slave server top the master server, but the data of the slave server is incomplete (:

candidate : There is another situation: it is possible that the sentry thinks that the main server is down, but the real main server is not down (network jitter), and the sentry has elected a slave server as the master server. At this time, the "client" has not responded yet, and continues to write data to the old main server

candidate : When the old master server reconnects, it has been included in the slave server of the new master server... So, during that time, the data written by the client into the old master server was lost

candidate : In the above two cases (master-slave replication delay && split-brain), you can configure to avoid data loss "as much as possible"

candidate : (When a certain threshold is reached, the main server is directly prohibited from receiving write requests, in an attempt to reduce the risk of data loss)

Interviewer 1619c3d1c3d1a4: talking about Redis sharding cluster?

Candidate : Well... the sharded cluster is to store part of the data in each Redis server, and all the Redis server data add up to form a complete data (distributed)

candidate : To form a sharded cluster, you need to "route" (shard) different keys

candidate : There are now two general routing schemes: "client routing" (SDK) and "server routing" (Proxy)

candidate : representative of client-side routing (Redis Cluster), representative of server-side routing (Codis)

Interviewer : about the difference between them?

candidate : I'm a little sleepy today, why not next time?

This article summarizes :

Redis achieves high availability :
- AOF/RDB persistence mechanism
- Master-slave architecture (the master server is hung up, and the slave server is manually placed on top)
- Introduce the sentinel mechanism to automatically fail escape
Master-slave replication principle :
- Two modes of PSYNC command: full resynchronization and partial resynchronization
- Complete resynchronization: the master-slave server establishes a connection, the master server generates an RDB file and sends it to the slave server, the master server is not blocked (the relevant modification commands are recorded in the buffer), and the modification commands are sent to the slave server
- Partial resynchronization: the slave server disconnects and reconnects, sends RunId and offset to the master server, the master server judges the offset and runId, and sends the offset related instructions that have not been synchronized to the slave server
Sentry mechanism :
- The sentinel can be understood as a special Redis server, which generally forms a sentinel cluster
- The main work of the sentinel is monitoring, alarming, configuration and selection of the master
- When the main server fails, a slave server will be "selected" on top of the "objective offline" server, and the "leader sentinel" will switch
data loss :
- Data loss may occur during the master-slave replication and failover phases of Redis (to avoid as much as possible through configuration)