How does HashMap work? Detailed pictures and texts, let's take a look!

MarkerHub
中文

1 How does HashMap work in JAVA?

Based on the principle of Hash.

2 What is a hash?

The simplest form of hash is a method of assigning a unique code to any variable/object property after applying any formula/algorithm.

A real hash method must follow the following principles:

The hash function should return the same hash code every time when the hash function is applied to the same or equal object. In other words, two equal objects must consistently generate the same hash code.

All Java objects are Hash method, all objects in Java inherit Object class defined hashCode() default Function. This function usually generates a hash code by converting the internal address of the object into an integer, thereby generating different hash codes for all different objects.

3 Node class in HashMap

The definition of Map is: an object that maps keys to values.

Therefore, HashMap to store this key-value pair. The answer is yes. HashMap has an internal class Node as shown below:

static class Node<K,V> implements Map.Entry<K,V> {
    final int hash; // 记录hash值, 以便重hash时不需要再重新计算
    final K key; 
    V value;
    Node<K,V> next;
    
    ...// 其余的代码
}

Of course, the Node class has a map of keys and values stored as attributes.

The key has been marked as final, and there are two more fields: next and hash.

In the following, we will understand the necessity of these attributes.

4 How are key-value pairs stored in HashMap

The key-value pairs are HashMap in Node as an array of internal classes of 060f10b58297a0, as shown below:

transient Node<K,V>[] table;

After the hash code is calculated, it will be converted into a subscript of the array, and the key-value pair of the corresponding hash code is stored in the subscript. I will not explain the hash collision in detail here.

The length of the array is always a power of 2 , the process is realized by the following function

static final int tableSizeFor(int cap) {
    int n = cap - 1;// 如果不做该操作, 则如传入的 cap 是 2 的整数幂, 则返回值是预想的 2 倍
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

The principle is to change all the low binary of the incoming parameter (cap) to 1, and finally add 1 to obtain the corresponding power of 2 greater than cap as the array length.

Why use a power of 2 as the capacity of the array?

Here is involved in the HashMap of the hash function of 060f10b58298c8 and the subscript of the array. The hash code calculated by the key may be larger than the capacity of the array. What should I do?

It can be obtained by simple remainder operation, but this method is too inefficient. HashMap , the following methods are used to ensure that the hash value is less than the capacity of the array after calculation.

(n - 1) & hash

This also explains why the power of 2 is needed as the capacity of the array. Since n is a power of 2, n-1 is similar to a low-bit mask.

Through the AND operation, the high-order hash values are all reset to zero, ensuring that the low-order bits are valid, so that the obtained values are all less than n. At the same time, in the next resize() operation, it will be very simple to recalculate the array subscript of each Node. The details will be explained later.

Take the default initial value of 16 as an example:

    01010011 00100101 01010100 00100101
&   00000000 00000000 00000000 00001111
----------------------------------
    00000000 00000000 00000000 00000101    //高位全部归零,只保留末四位
    // 保证了计算出的值小于数组的长度 n

However, after using this function, since only the low bit is taken, the hash collision will become serious accordingly. At this time, you need to use the "disturbance function"

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

This function is obtained by shifting the upper 16 bits of the hash code to the right and XORing with the original hash code. Take the above example as an example

XOR

This method ensures that the upper 16 bits are unchanged, and the lower 16 bits are changed according to the result of the exclusive OR. The calculated index of the array will change from 5 to 0.

After using the "disturbance function", the probability of hash collision will decrease. Someone has done a similar test. Although the use of this "disturbance function" does not achieve the maximum probability of avoiding hash collisions, considering its computational performance and the probability of collisions, this method is used in the JDK and only hashed once.

5 Hash collision and its handling

In an ideal situation, the hash function maps each key to a unique bucket. However, this is impossible. Even if it is designed with a good hash function, hash collisions will occur.

Predecessors have studied a lot of methods for solving hash conflicts. In Wikipedia, four categories are summarized

Hash collision resolution

In Java's HashMap , the first method of Separate chaining (mostly translated as zipper method) + linked list and red-black tree is used to resolve conflicts.

HashMap results in JDK8

In HashMap , after the hash collision, the member variable Node<K,V> next; Node class will be used to form a linked list (nodes less than 8) or red-black tree (nodes are greater than 8, and will be converted to linked lists when they are less than 6), so as to resolve the conflict the goal of.

static final int TREEIFY_THRESHOLD = 8;

static final int UNTREEIFY_THRESHOLD = 6;

6 Initialization of HashMap

public HashMap();
public HashMap(int initialCapacity);
public HashMap(Map<? extends K, ? extends V> m);
public HashMap(int initialCapacity, float loadFactor); 

HashMap , most of which are operations to initialize capacity and load factor. Take public HashMap(int initialCapacity, float loadFactor) as an example

public HashMap(int initialCapacity, float loadFactor) {
    // 初始化的容量不能小于0
    if (initialCapacity < 0)
        throw new IllegalArgumentException("Illegal initial capacity: " +
                                           initialCapacity);
    // 初始化容量不大于最大容量
    if (initialCapacity > MAXIMUM_CAPACITY)
        initialCapacity = MAXIMUM_CAPACITY;
    // 负载因子不能小于 0
    if (loadFactor <= 0 || Float.isNaN(loadFactor))
        throw new IllegalArgumentException("Illegal load factor: " +
                                           loadFactor);
    this.loadFactor = loadFactor;
    this.threshold = tableSizeFor(initialCapacity);
}

The capacity and load factor are initialized through this function. If other constructors are called, the corresponding load factor and capacity will use the default values ( default load factor = 0.75, default capacity = 16 ).

At this time, the storage container table has not yet been initialized, and the initialization has to be delayed until the first use. HashMap interview 21 questions! the recommendation of 160f10b5829c65 interview. Follow the public account Java technology stack to reply to the interview for more interview information.

7 Initialization or dynamic expansion of the hash table in HashMap

The so-called hash table refers to the following table variable Node

transient Node<K,V>[] table;

As an array, its length needs to be specified when it is initialized. In the actual use process, the number of our storage may be greater than this length, so HashMap . When the storage capacity reaches the specified threshold, it needs to be expanded.

I personally think that initialization is also a kind of dynamic expansion, but its expansion is the expansion of the capacity from 0 to the value in the constructor (default 16). And there is no need to re-hash the elements.

7.1 Conditions for expansion

Initialization will be performed as long as the value is empty or the length of the array is 0. The expansion is triggered when the number of elements is greater than the threshold.

threshold = loadFactor * capacity

For example, the default loadFactor=0.75, capacity=16 in HashMap

threshold = loadFactor * capacity = 0.75 * 16 = 12

Then when the number of elements is greater than 12, the capacity will be expanded. The capacity and threshold after expansion will also change accordingly.

The load factor affects the trigger threshold. Therefore, when its value is small, there HashMap . At this time, the access performance is very high, and the corresponding disadvantage is that it requires more memory; and its value When it is larger, HashMap are a lot of hash collisions in 060f10b5829e20. At this time, the access performance is relatively low, and the corresponding advantage is that it requires less memory; it is not recommended to change the default value. If you want to change it, it is recommended to determine after the corresponding test.

7.2 Let's talk about the integer power of 2 and the calculation of array index

As mentioned earlier, the capacity of the array is a power of 2. At the same time, the subscript of the array is calculated by the following code

index = (table.length - 1) & hash

In addition to calculating the index of the array quickly, this method can also calculate the new hash value very cleverly when performing heavy hashing after expansion. Since the capacity of the array is doubled after the expansion, the effective bit of n-1 will be one bit more than the original after the expansion, and the extra bit will be in the same position as the binary of the original capacity. Example

Before and after expansion

In this way, the new index can be calculated quickly

7.3 Procedure

  1. First determine whether to initialize or expand, the two will be different when calculating newCap and newThr
  2. Calculate the capacity and critical value after expansion.
  3. Modify the critical value of hashMap to the critical value after expansion
  4. Create a new array based on the expanded capacity, and then point the reference of the hashMap table to the new array.
  5. Copy the elements of the old array to the table. In this process, several situations are involved, which need to be processed separately (only one element exists, general linked list, red-black tree)

Look at the code specifically

final Node<K, V>[] resize() {
        //新建oldTab数组保存扩容前的数组table
        Node<K, V>[] oldTab = table;
        //获取原来数组的长度
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        //原来数组扩容的临界值
        int oldThr = threshold;
        int newCap, newThr = 0;
        //如果扩容前的容量 > 0
        if (oldCap > 0) {
            //如果原来的数组长度大于最大值(2^30)
            if (oldCap >= MAXIMUM_CAPACITY) {
                //扩容临界值提高到正无穷
                threshold = Integer.MAX_VALUE;
                //无法进行扩容,返回原来的数组
                return oldTab;
                //如果现在容量的两倍小于MAXIMUM_CAPACITY且现在的容量大于DEFAULT_INITIAL_CAPACITY
            } else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                    oldCap >= DEFAULT_INITIAL_CAPACITY)
                //临界值变为原来的2倍
                newThr = oldThr << 1;
        } else if (oldThr > 0) //如果旧容量 <= 0,而且旧临界值 > 0
            //数组的新容量设置为老数组扩容的临界值
            newCap = oldThr;
        else { //如果旧容量 <= 0,且旧临界值 <= 0,新容量扩充为默认初始化容量,新临界值为DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY
            newCap = DEFAULT_INITIAL_CAPACITY;//新数组初始容量设置为默认值
            newThr = (int) (DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);//计算默认容量下的阈值
        }
        // 计算新的resize上限
        if (newThr == 0) {//在当上面的条件判断中,只有是初始化时(oldCap=0, oldThr > 0)时,newThr == 0
            //ft为临时临界值,下面会确定这个临界值是否合法,如果合法,那就是真正的临界值
            float ft = (float) newCap * loadFactor;
            //当新容量< MAXIMUM_CAPACITY且ft < (float)MAXIMUM_CAPACITY,新的临界值为ft,否则为Integer.MAX_VALUE
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float) MAXIMUM_CAPACITY ?
                    (int) ft : Integer.MAX_VALUE);
        }
        //将扩容后hashMap的临界值设置为newThr
        threshold = newThr;
        //创建新的table,初始化容量为newCap
        @SuppressWarnings({"rawtypes", "unchecked"})
        Node<K, V>[] newTab = (Node<K, V>[]) new Node[newCap];
        //修改hashMap的table为新建的newTab
        table = newTab;
        //如果旧table不为空,将旧table中的元素复制到新的table中
        if (oldTab != null) {
            //遍历旧哈希表的每个桶,将旧哈希表中的桶复制到新的哈希表中
            for (int j = 0; j < oldCap; ++j) {
                Node<K, V> e;
                //如果旧桶不为null,使用e记录旧桶
                if ((e = oldTab[j]) != null) {
                    //将旧桶置为null
                    oldTab[j] = null;
                    //如果旧桶中只有一个node
                    if (e.next == null)
                        //将e也就是oldTab[j]放入newTab中e.hash & (newCap - 1)的位置
                        newTab[e.hash & (newCap - 1)] = e;
                        //如果旧桶中的结构为红黑树
                    else if (e instanceof TreeNode)
                        //将树中的node分离
                        ((TreeNode<K, V>) e).split(this, newTab, j, oldCap);
                    else {  //如果旧桶中的结构为链表,链表重排,jdk1.8做的一系列优化
                        Node<K, V> loHead = null, loTail = null;
                        Node<K, V> hiHead = null, hiTail = null;
                        Node<K, V> next;
                        //遍历整个链表中的节点
                        do {
                            next = e.next;
                            // 原索引
                            if ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            } else {// 原索引+oldCap
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        // 原索引放到bucket里
                        if (loTail != null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        // 原索引+oldCap放到bucket里
                        if (hiTail != null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
}

7.4 Matters needing attention

Although the design of HashMap is very good, resize() should be avoided as little as possible, this process will be very time consuming.

At the same time, because the hashmap cannot automatically reduce the capacity. Therefore, if your hashmap has a large capacity but performs many remove operations, the capacity will not decrease. If you think you need to reduce the capacity, please create a new hashmap.

8 How does the HashMap.put() function work internally?

After using HashMap many times, I can roughly tell the principle of adding elements: calculate the hash value of each key, calculate its position in the hash table after a certain calculation, and put the key-value pair into that position , If there is a hash collision, the hash collision processing is performed.

The working principle is as follows (I saved the picture a long time ago, and I forgot the source)

The source code is as follows:

/* @param hash         指定参数key的哈希值
 * @param key          指定参数key
 * @param value        指定参数value
 * @param onlyIfAbsent 如果为true,即使指定参数key在map中已经存在,也不会替换value
 * @param evict        如果为false,数组table在创建模式中
 * @return 如果value被替换,则返回旧的value,否则返回null。当然,可能key对应的value就是null。
 */
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
               boolean evict) {
    Node<K, V>[] tab;
    Node<K, V> p;
    int n, i;
    //如果哈希表为空,调用resize()创建一个哈希表,并用变量n记录哈希表长度
    if ((tab = table) == null || (n = tab.length) == 0)
        n = (tab = resize()).length;
    /**
     * 如果指定参数hash在表中没有对应的桶,即为没有碰撞
     * Hash函数,(n - 1) & hash 计算key将被放置的槽位
     * (n - 1) & hash 本质上是hash % n,位运算更快
     */
    if ((p = tab[i = (n - 1) & hash]) == null)
        //直接将键值对插入到map中即可
        tab[i] = newNode(hash, key, value, null);
    else {// 桶中已经存在元素
        Node<K, V> e;
        K k;
        // 比较桶中第一个元素(数组中的结点)的hash值相等,key相等
        if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
            // 将第一个元素赋值给e,用e来记录
            e = p;
            // 当前桶中无该键值对,且桶是红黑树结构,按照红黑树结构插入
        else if (p instanceof TreeNode)
            e = ((TreeNode<K, V>) p).putTreeVal(this, tab, hash, key, value);
            // 当前桶中无该键值对,且桶是链表结构,按照链表结构插入到尾部
        else {
            for (int binCount = 0; ; ++binCount) {
                // 遍历到链表尾部
                if ((e = p.next) == null) {
                    p.next = newNode(hash, key, value, null);
                    // 检查链表长度是否达到阈值,达到将该槽位节点组织形式转为红黑树
                    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        treeifyBin(tab, hash);
                    break;
                }
                // 链表节点的<key, value>与put操作<key, value>相同时,不做重复操作,跳出循环
                if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                    break;
                p = e;
            }
        }
        // 找到或新建一个key和hashCode与插入元素相等的键值对,进行put操作
        if (e != null) { // existing mapping for key
            // 记录e的value
            V oldValue = e.value;
            /**
             * onlyIfAbsent为false或旧值为null时,允许替换旧值
             * 否则无需替换
             */
            if (!onlyIfAbsent || oldValue == null)
                e.value = value;
            // 访问后回调
            afterNodeAccess(e);
            // 返回旧值
            return oldValue;
        }
    }
    // 更新结构化修改信息
    ++modCount;
    // 键值对数目超过阈值时,进行rehash
    if (++size > threshold)
        resize();
    // 插入后回调
    afterNodeInsertion(evict);
    return null;
}

In this process, the resolution of hash collisions will be involved.

9 How does the HashMap.get() method work internally?

/**
 * 返回指定的key映射的value,如果value为null,则返回null
 * get可以分为三个步骤:
 * 1.通过hash(Object key)方法计算key的哈希值hash。
 * 2.通过getNode( int hash, Object key)方法获取node。
 * 3.如果node为null,返回null,否则返回node.value。
 *
 * @see #put(Object, Object)
 */
public V get(Object key) {
    Node<K, V> e;
    //根据key及其hash值查询node节点,如果存在,则返回该节点的value值
    return (e = getNode(hash(key), key)) == null ? null : e.value;
}

It finally called the getNode function. The logic is as follows

getNode working logic

The source code is as follows:

 /**
 * @param hash 指定参数key的哈希值
 * @param key  指定参数key
 * @return 返回node,如果没有则返回null
 */
final Node<K, V> getNode(int hash, Object key) {
    Node<K, V>[] tab;
    Node<K, V> first, e;
    int n;
    K k;
    //如果哈希表不为空,而且key对应的桶上不为空
    if ((tab = table) != null && (n = tab.length) > 0 &&
            (first = tab[(n - 1) & hash]) != null) {
        //如果桶中的第一个节点就和指定参数hash和key匹配上了
        if (first.hash == hash && // always check first node
                ((k = first.key) == key || (key != null && key.equals(k))))
            //返回桶中的第一个节点
            return first;
        //如果桶中的第一个节点没有匹配上,而且有后续节点
        if ((e = first.next) != null) {
            //如果当前的桶采用红黑树,则调用红黑树的get方法去获取节点
            if (first instanceof TreeNode)
                return ((TreeNode<K, V>) first).getTreeNode(hash, key);
            //如果当前的桶不采用红黑树,即桶中节点结构为链式结构
            do {
                //遍历链表,直到key匹配
                if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                    return e;
            } while ((e = e.next) != null);
        }
    }
    //如果哈希表为空,或者没有找到节点,返回null
    return null;
}

Author: A Jin's Desk

Source: https://www.cnblogs.com/homejim/

阅读 296

Java技术干货
每天一点Java小知识,让Java不再难懂!
427 声望
208 粉丝
0 条评论
你知道吗?

427 声望
208 粉丝
宣传栏