28
头图

Hello everyone, I am Xiao Fu~

Personal public number: Programmer's internal affairs , welcome to learn and exchange

In the past two days, I saw a small partner in the technical group discussing the issue of the consistent hash algorithm, and I just came up with a topic that I had nothing to write about, so I will briefly introduce its principle. Below we take the classic scenarios in distributed caching as an example, and some topics are often mentioned in interviews to see what is a consistent hash algorithm and what it has to offer.

Build the scene

If we have three cache server numbers node0 , node1 , node2 , and now there are 30 million key , we hope that these keys can be evenly cached on the three machines, what plan do you think of?

The first solution we may think of is the modulo algorithm hash(key)% N , which takes the modulo after hashing the key, and N is the number of machines. The result after the key is hashed is modulo 3, and the obtained result must be 0, 1 or 2, which corresponds to the server node0 , node1 , node2 . To access the data, you can directly find the corresponding server. It is simple and rude, and can completely solve the above problems. problem.

hash problem

Although the modulo algorithm is simple to use, taking the modulo of the number of machines has certain limitations when expanding and shrinking the cluster, because it is common to adjust the number of servers in a production environment according to the size of the business volume; and the number of servers When N changes, hash(key)% N will also change accordingly.

For example: a server node hangs up, the calculation formula changes from hash(key)% 3 to hash(key)% 2 , and the result will change. At this time, if you want to access a key, the cache location of this key will probably change, and the data of the previously cached key will also be lost. role and meaning.

A large number of caches fail at the same time, resulting in cache avalanches, which in turn leads to the unavailability of the entire cache system, which is basically unacceptable. In order to solve the optimization of the above situation, the consistent hash algorithm came into being~

So, how does the consistent hashing algorithm solve the above problems?

Consistent hash

Consistent hash algorithm is also a modulo algorithm in essence. However, unlike the above, which takes the modulo by the number of servers, the consistent hash takes the modulo of a fixed value of 2^32.

IPv4 addresses are composed of 4 groups of 8-bit binary numbers, so using 2^32 can ensure that each IP address will have a unique mapping

hash ring

We can abstract these 2^32 values into a ring⭕️ ( not a circle, think of a shape, just understand it ), the point just above the ring represents 0, arranged clockwise, and so on , 1, 2, 3, 4, 5, 6... until 2^32-1, and this ring composed of 2 to the 32nd power point is collectively called the hash ring.

So what does this hash ring have to do with the consistent hash algorithm? Let's take the above scenario as an example, three cache servers numbered node0 , node1 , node2 , 30 million key .

server maps to hash ring

At this time, the calculation formula changes from hash(key)% N to hash(server ip)% 2^32 , use the server IP address for hash calculation, and use the hash result to modulo 2^32, The result must be an integer between 0 and 2^32-1, and the position of this integer mapped on the hash ring represents a server, and the node0 , node1 , and node2 are mapped to the hash ring in turn.

object key is mapped to hash ring

Then map the key object that needs to be cached to the hash ring, hash(key)% 2^32 , both the server node and the key object to be cached are mapped to the hash ring, which server should the object key be cached on. Woolen cloth?

object key is mapped to server

starts from the position of the cache object key, and the first server encountered in a clockwise direction is the server current object will be cached.

Because the hashed value of the cached object and the server is fixed, the object key must be cached on a fixed server under the condition that the server remains unchanged. According to the above rules, the mapping relationship in the following figure:

  • key-1 -> node-1
  • key-3 -> node-2
  • key-4 -> node-2
  • key-5 -> node-2
  • key-2 -> node-0

If you want to access a key, you just need to use the same calculation method to know which server the key is cached on.

Advantages of Consistent Hash

We have a simple understanding of the principle of consistent hashing, so how does it optimize the addition and reduction of nodes in the cluster, the cache service caused by the common modulo algorithm, and the problem of large-scale unavailability?

Let's take a look at the expansion scenario first. If the business volume surges, the system needs to expand and add a server node-4 . It happens that node-4 is mapped between node-1 and node-2 , and the objects are mapped clockwise to find the object node-2 key-4 and key-5 were remapped to node-4 , and only a small part of the data between nodes node-4 and node-1 was affected during the entire expansion process.

On the other hand, if the node-1 node goes down, and the object is mapped to the node in a clockwise direction, the object key-1 node-1 is remapped to node-4 , and the affected data is only a small part of the data between node-0 and node-1

It can be found from the above two situations that when the number of servers in the cluster changes, the consistent hash calculation will only affect a small part of the data, ensuring that the cache system as a whole can still provide services to the outside world.

data skew problem

In order to facilitate the understanding of the principle, the nodes in the drawing are ideally and relatively uniformly distributed, but the ideal and actual scenes are often very different. For example, I have a fitness year card, and I have only been to the gym twice, and I have Just took a shower.

想要健身的你

When the number of server nodes is too small, it is easy to cause data skew problem due to uneven distribution of nodes. Most of the cached objects as shown in the figure below are cached on the node-4 server, which leads to waste of resources on other nodes, and most of the system pressure is concentrated. On the node-4 node, such a cluster is very unhealthy.

The solution to data skew is also simple. We need to find a way to make the nodes map to the hash ring relatively evenly distributed.

The consistent hash algorithm introduces a virtual node mechanism, that is, multiple hash values are calculated for each server node, and they will be mapped to the hash ring, and the object keys mapped to these virtual nodes will eventually be cached in the real node. superior.

The hash calculation of the virtual node can usually be performed by adding the IP address of the corresponding node to the digital number suffix hash (10.24.23.227#1) . For example, the node-1 node IP is 10.24.23.227, and the hash of node-1 value.

  • hash(10.24.23.227#1)% 2^32

Suppose we set up three virtual nodes for node-1, node-1#1 , node-1#2 , node-1#3 , and take the modulo after hashing them.

  • hash(10.24.23.227#1)% 2^32
  • hash(10.24.23.227#2)% 2^32
  • hash(10.24.23.227#3)% 2^32

After the virtual nodes are added in the following figure, the original nodes are relatively evenly distributed on the hash ring, and the pressure on the remaining nodes is shared.

But it should be noted that the more virtual nodes are allocated, the more uniform the mapping will be on the hash ring. If there are too few nodes, it is difficult to see the effect.

The introduction of virtual nodes also adds new problems. To do the mapping between virtual nodes and real nodes, object key->virtual node->real node conversion.

Application scenarios of consistent hashing

Consistent hashing should be the preferred algorithm for load balancing in distributed systems. Its implementation is flexible, and it can be implemented on the client side or on middleware, such as the cache middleware memcached and redis Clusters use it.

The cluster of memcached is quite special. Strictly speaking, it can only be regarded as a pseudo-cluster , because its servers cannot communicate with each other, and the distribution route of the request depends entirely on the client to calculate which server the cached object should fall on. The routing algorithm uses a consistent hash.

There is also the concept of hash slots in redis clusters. Although the implementations are different, the ideas remain the same. After reading the consistent hash in this article, it will be much easier for you to understand redis slots.

There are many other application scenarios:

  • RPC frame Dubbo used to select service provider
  • Distributed relational database sub-database sub-table: mapping relationship between data and nodes
  • LVS load balancing scheduler
  • .....................

Summarize

The consistent hash is briefly explained. If there is something wrong, you can leave a message to correct it. No technology is perfect. The consistent hash algorithm also has some potential hidden dangers. If the number of nodes on the hash ring is very large or updated frequently , the retrieval performance will be relatively low, and the entire distributed cache needs a routing service for load balancing. Once the routing service hangs, the entire cache will be unavailable, and high availability should also be considered.

But having said that, as long as it can solve the problem, it is a good technology, and a little side effect is still tolerable.


程序员小富
2.7k 声望5.3k 粉丝