【工程化】一致性hash

介绍

Consistent hashing，一致性hash最早是由David Karger等人在《Consistent Hashing and Random Trees：Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web》论文中提出的，为的是解决分布式网络中减少或消除热点问题的发生而提出的缓存协议。

论文指出了一致性的4个特性：

Balance，平衡性是指哈希的结果能够尽可能分布到所有的缓存中去，这样可以使得所有的缓存空间都得到利用。The balance property is what is prized about standard hash functions: they distribute items among buckets in a balanced fasion。
Monotonicity，单调性是指如果已经有一些item通过哈希分派到了相应的bucket中，又有新的bucket加入到系统中。哈希的结果应能够保证原有已分配的item可以被映射到原有的或者新的bucket中去，而不会被映射到旧的bucket集合中的其他bucket中。This property says that if items are initially assigned to a set of buckets V1 and then some new buckets are added to form V2, then an item may move from an old bucket to a new bucket, but not from one old bucket to another. This reflects one intuition about consistency: when the set of usable buckets changes, items should only move if necessary to preserve an even distribution.
Spread，分散性是指在分布式环境中，终端有可能看不到所有的buckets，而是只能看到其中的一部分。当终端希望通过哈希过程将内容映射到bucket上时，由于不同终端所见的buckets范围有可能不同，从而导致哈希的结果不一致，最终的结果是相同的内容被不同的终端映射到不同的bucket中。这种情况显然是应该避免的，因为它导致相同内容被存储到不同bucket去，降低了系统存储的效率。分散性的定义就是上述情况发生的严重程度。好的哈希算法应能够尽量避免不一致的情况发生，也就是尽量降低分散性。 The idea behind spread is that there are V people, each of whom can see at least a constant fraction ( 1/t ) of the buckets that are visible to anyone. Each person tries to assign an item i to a bucket using a consistent hash function. The property says that across the entire group, there are at most i different opinions about which bucket should contain the item. Clearly, a good consistent hash function should have low spread over all item.
Load,负载问题实际上是从另一个角度看待分散性问题。既然不同的终端可能将相同的内容映射到不同的缓冲区中，那么对于一个特定的缓冲区而言，也可能被不同的用户映射为不同的内容。与分散性一样，这种情况也是应当避免的，因此好的哈希算法应能够尽量降低缓冲的负荷。The load property is similar to spread. The same V people are back, but this time we consider a particular bucket b instead of an item. The property says that there are at most b distinct items that at least one person thinks belongs in the bucket. A good consistent hash function should also have low load.

常见的使用

一些已知的场景如：

memcached的分布式缓存访问
用作负载均衡，如dubbo的ConsistentHashLoadBalance
分布式哈希表（DHT，Distributed Hash Table）用来在一群节点中实现(key, value)的关系映射。在类似Cassandra等分布式系统中使用了DHT

一致性hash更多的应用在负载均衡。

问题提出

一般在分布式系统设计中，如果我们将某些用户请求、或者某些城市数据，访问指定的某台机器，一般的算法是基于关键字取hash值然后%机器数（hash(key)% N）。
假设我们有3台机器A、B 、C，后来新加了一台机器D，其索引与机器映射如下：

针对不同的key，其hashcode为1-10取模运算：

经过上面的表格可以看到，当添加了一台新机器D的时候，导致大部分key产生了miss，命中率按照上面表格计算只为20%。虽然是一个简单的列子，但足以说明该算法在机器伸缩时候，会造成大量的数据无法被正确被命中。如果这是缓存架构设计，那么缓存miss后会把请求都落在DB上，造成DB压力。如果这是个分布式业务调用，原来访问机器可能做了配置数据、或缓存了上下文等，miss就意味着本次调用失败。

就上面的case，这个算法本身违背了“单调性” 设计特性。

单调性是指如果已经有一些item通过哈希分派到了相应的bucket中，又有新的bucket加入到系统中。哈希的结果应能够保证原有已分配的item可以被映射到原有的或者新的bucket中去，而不会被映射到旧的bucket集合中的其他bucket中

Consistent Hashing 算法

先构造一个长度为2^32的整数环（这个环被称为一致性Hash环），根据节点名称的Hash值（其分布为[0, 2^32-1]）将缓存服务器节点放置在这个Hash环上，然后根据需要缓存的数据的Key值计算得到其Hash值（其分布也为[0, 2^32-1]），然后在Hash环上顺时针查找距离这个Key值的Hash值最近的服务器节点，完成Key到服务器的映射查找。

以上通过特定的Hash函数f=h(x)，
（1）计算出Node节点,然后散列到一致性Hash环上:

Node节点的hash值：
h(Node1)=K1
h(Node2)=K2
h(Node3)=K3

（2）计算出对象的hash值，然后以顺时针的方向计算，将所有对象存储到离自己最近的机器中。

h(object1)=key1
h(object2)=key2
h(object3)=key3
h(object4)=key4

当发生机器节点Node的添加和删除时：

（1）机器节点Node增加，新增一个节点Node4
计算出h(Node4)=K4，将其映射到一致性Hash环上如下：

通过按顺时针迁移的规则，那么object3被迁移到了NODE4中，其它对象还保持原有的存储位置。

（2）机器节点Node删除，删除节点Node2

通过顺时针迁移的规则，那么object2被迁移到Node3中，其他对象还保持原有的存储位置。

通过对节点的添加和删除的分析，一致性哈希算法在保持了单调性的同时，还是数据的迁移达到了最小，这样的算法对分布式集群来说是非常合适的，避免了大量数据迁移，减小了服务器的的压力。

算法实现

根据之前的算法的描述，使得Node节点基于其hash值大小，按顺序分布在[0-2^32-1]这个环上，然后根据object的hash值，查找
a、hash值相等，返回这个节点Node。
b、大于它hash值的第一个，返回这个节点Node。

1）选择合适的数据结构：

论文中提到：

官方建议实现可以使用平衡二叉树。如AVL、红黑树

2）选择合适的Hash函数，足够散列。

先看下Java String的hashcode：

 public static void main(String[] args) {
    System.out.println("192.168.0.1:1111".hashCode());
    System.out.println("192.168.0.2:1111".hashCode());
    System.out.println("192.168.0.3:1111".hashCode());
    System.out.println("192.168.0.4:1111".hashCode());
 }
散列值：1874499238
1903128389
1931757540
1960386691

2^32-1 = 4294967296
如果我们把上面4台机器Node分布到[0-2^32-1]这个环上，取值的范围只是一个很小的范围区间，这样90%的请求将会落在Node1这个节点，这样的分布是在太糟糕了。

因此我们要寻找一种冲突较小，且分布足够散列。一些hash函数有CRC32_HASH、FNV1_32_HASH、KETAMA_HASH、MYSQL_HASH，以下是一张各hash算法的比较（未验证，来自网络）

简单判断是FNV1_32_HASH不错，KETAMA_HASH是MemCache推荐的一致性Hash算法。

代码实现

public class ConsistentHashingWithoutVirtualNode {

    /**
     * key表示服务器的hash值，value表示服务器的名称
     */
    private static SortedMap<Integer, String> sortedMap =
            new TreeMap<Integer, String>();

    /**
     * 使用FNV1_32_HASH算法计算服务器的Hash值,这里不使用重写hashCode的方法，最终效果没区别
     */
    private static int getFNV1_32_HASHHash(String str) {
        final int p = 16777619;
        int hash = (int) 2166136261L;
        for (int i = 0; i < str.length(); i++)
            hash = (hash ^ str.charAt(i)) * p;
        hash += hash << 13;
        hash ^= hash >> 7;
        hash += hash << 3;
        hash ^= hash >> 17;
        hash += hash << 5;

        // 如果算出来的值为负数则取其绝对值
        if (hash < 0)
            hash = Math.abs(hash);
        return hash;
    }

    /**
     * 待添加入Hash环的服务器列表
     */
    private static String[] servers = {"192.168.0.1:111", "192.168.0.2:111", "192.168.0.3:111",
            "192.168.0.3:111", "192.168.0.4:111"};

    /**
     * 程序初始化，将所有的服务器放入sortedMap中
     */
    static {
        for (int i = 0; i < servers.length; i++) {
            int hash = getFNV1_32_HASHHash(servers[i]);
            System.out.println("[" + servers[i] + "]加入集合中, 其Hash值为" + hash);
            sortedMap.put(hash, servers[i]);
        }
        System.out.println();
    }

    /**
     * 得到应当路由到的结点
     */
    private static String getServer(String node) {
        // 得到带路由的结点的Hash值
        int hash = getFNV1_32_HASHHash(node);
        if (!sortedMap.containsKey(hash)) {
            // 得到大于该Hash值的所有Map
            SortedMap<Integer, String> tailMap =
                    sortedMap.tailMap(hash);
            if (!tailMap.isEmpty()) {
                // 第一个Key就是顺时针过去离node最近的那个结点
                return sortedMap.get(tailMap.firstKey());
            } else {
                return sortedMap.get(sortedMap.firstKey());
            }
        }
        return sortedMap.get(hash);
    }


    public static void main(String[] args) {
        String[] nodes = {"hello1", "hello2", "hello3"};
        for (int i = 0; i < nodes.length; i++)
            System.out.println("[" + nodes[i] + "]的hash值为" +
                    getFNV1_32_HASHHash(nodes[i]) + ", 被路由到结点[" + getServer(nodes[i]) + "]");
    }

算法的缺陷

一致性hashing虽然满足了单调性和负载均衡的特性以及一般hash算法的分散性。但是不满足“平衡性”。

Balance，平衡性是指哈希的结果能够尽可能分布到所有的缓存中去，这样可以使得所有的缓存空间都得到利用。

该算法中，Hash函数是不能保证平衡的，如上面分析的，当集群中发生节点添加时，该节点会承担一部分数据访问，当集群中发生节点删除时，被删除的节点P负责的数据就会落在下一个节点Q上，这样势必会加重Q节点的负担。这就是发生了不平衡。

解决

引入虚拟节点。Virtual Node，是实际节点的复制品Replica。
比如集群中现在有2个节点Node1、Node3，就是那个删除Node2的图，

每个节点引入2个副本，Node1-1、Node1-2,Node3-1、Node3-2

如此引入虚拟节点，使得对象的分布比较均衡。那么对于节点，物理节点和虚拟节点之间的映射如下：

到此，该算法的改进已经完成，不过要用在工程中，仍有几个问题需解决：

一个真实节点应该映射成多少个虚拟节点
根据虚拟节点如何找到对应的真实节点

解决方案

1）理论上物理节点越少，需要的虚拟节点就越多。看下ketama算法的描述中：

ketama默认是节点为160个

2）“虚拟节点”的hash计算可以采用对应节点的IP地址加带数字后缀的方式。如“192.168.0.0:111”，2个副本为“192.168.0.0:111-VN1”、“192.168.0.0:111-VN2”。
tips：在初始化虚拟节点到一致性hash环上的时候，可以直接h(192.168.0.0:111-VN2)->"192.168.0.0:111" 真实节点。

Ketama算法实现

以下的是net.spy.memcached.KetamaNodeLocator.Java的setKetamaNodes()方法的实现：

protected void setKetamaNodes(List<MemcachedNode> nodes) {
    TreeMap<Long, MemcachedNode> newNodeMap =
            new TreeMap<Long, MemcachedNode>();
    int numReps = config.getNodeRepetitions();
    int nodeCount = nodes.size();
    int totalWeight = 0;

    if (isWeightedKetama) {
        for (MemcachedNode node : nodes) {
            totalWeight += weights.get(node.getSocketAddress());
        }
    }

    for (MemcachedNode node : nodes) {
      if (isWeightedKetama) {

          int thisWeight = weights.get(node.getSocketAddress());
          float percent = (float)thisWeight / (float)totalWeight;
          int pointerPerServer = (int)((Math.floor((float)(percent * (float)config.getNodeRepetitions() / 4 * (float)nodeCount + 0.0000000001))) * 4);
          for (int i = 0; i < pointerPerServer / 4; i++) {
              for(long position : ketamaNodePositionsAtIteration(node, i)) {
                  newNodeMap.put(position, node);
                  getLogger().debug("Adding node %s with weight %s in position %d", node, thisWeight, position);
              }
          }
      } else {
          // Ketama does some special work with md5 where it reuses chunks.
          // Check to be backwards compatible, the hash algorithm does not
          // matter for Ketama, just the placement should always be done using
          // MD5
          if (hashAlg == DefaultHashAlgorithm.KETAMA_HASH) {
              for (int i = 0; i < numReps / 4; i++) {
                  for(long position : ketamaNodePositionsAtIteration(node, i)) {
                    newNodeMap.put(position, node);
                    getLogger().debug("Adding node %s in position %d", node, position);
                  }
              }
          } else {
              for (int i = 0; i < numReps; i++) {
                  newNodeMap.put(hashAlg.hash(config.getKeyForNode(node, i)), node);
              }
          }
      }
    }
    assert newNodeMap.size() == numReps * nodes.size();
    ketamaNodes = newNodeMap;
  }

详细的算法实现和分析见这篇文章