1

Hash table

Hash table is a data structure of key-value mapping. In the hash table, data is stored in an array format, where each data value has its own unique index value, and the index value is calculated by the hash function of the hash table.

hashTable

The following two steps convert the key hash value into the index value of the hash table.

  • Hash value = hash function (key)
  • Index value = hash value% hash table length

Conflict resolution

For a hash table with a finite length, collisions are inevitable. There are two ways to resolve conflicts, zipper method and open addressing .

Zipper method

The elements at the conflict position are constructed into a linked list. When adding data conflicts, append the element to the linked list. As shown in the figure below, when "Sandra Dee" is added, it is calculated that the index value of 152 conflicts with "John Smith", and then it is appended to the linked list.

hash.png

open addressing

Taking the current conflict position as the starting point, detecting the empty position according to certain rules and inserting the element. The simpler way is to linearly with , which will search in a loop at a fixed interval (usually 1).

As shown in the figure below, when "Sandra Dee" was added, it conflicted with "John Smith". It was inserted into 153 by detecting the empty position, and then adding "Ted Baker" found that it conflicted with "Sandra Dee", and then detecting 154 empty positions to insert.

hash

performance

load factor

The value of the load factor is number of entries occupies the hash bucket , when the load factor exceeds the ideal value, the hash table will be expanded. For example, the ideal value of the hash table is 0.75 and the initial capacity is 16. When the number of entries exceeds 12, the hash table will be expanded to re-hash . 0.6 and 0.75 are usually reasonable load factors.

$$ {\displaystyle loadfactor\ (\alpha )={\frac {n}{k}}} $$

  • $n$ The number of entries in the hash table.
  • $k$ The number of buckets.

Two main factors affecting hash table performance

  • The cache is missing. As the load factor increases, the number of cache misses increases, and search and insert performance will drop significantly as a result.
  • Expand capacity and re-hash. Resizing is an extremely time-consuming task. Setting an appropriate load factor can control the number of expansions.

The figure below shows that as the load factor increases, the number of cache misses starts to rise, and starts to rise rapidly after 0.8.

load factor

HashMap

About HashMap, interpret its hash method and conflict tree.

about hash()

Take the hashCode value of the key, then XOR the high 16 bits with the low 16 bits, and finally take the modulo operation.

static final int hash(Object key) {   //jdk1.8 & jdk1.7
     int h;
     // h = key.hashCode()  取hashCode值
     // h ^ (h >>> 16)      将高16位与低16位进行异或
     return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

// jdk1.7
static int indexFor(int h, int length) { 
     return h & (length-1);
}
// jdk1.8
(n - 1) & hash

The upper 16 bits and the low-order 16-bit exclusive OR increase randomness low .

Regarding randomness, there is a test example on the Internet: he randomly selected 352 strings to test the collision probability under different length arrays.

The result shows that when the length of the HashMap array is 2^9 = 512 , the hashCode is directly used for 103 conflicts, and the high-low XOR is performed for 92 conflicts.

冲突表

https://www.todaysoftmag.com/article/1663/an-introduction-to-optimising-a-hashing-strategy

conflict tree

HashMap resolves conflicts using zipper method . In jdk1.8, when a bucket linked list node exceeds TREEIFY_THRESHOLD=8 , the linked list will be converted to a red-black tree. When the nodes in the bucket are removed or re-hash less than UNTREEIFY_THRESHOLD=6 , the red-black tree will be converted to a normal linked list.

The elements of the linked list are traversed from the beginning node to the corresponding node. The time complexity is O(N). The red-black tree is based on the binary tree structure and the time complexity is O(logN). Therefore, when the number of elements is too large, use red Black tree storage can improve search efficiency. However, the space required for a single tree node is about twice that of an ordinary node, so the use of a tree and a linked list is the result of a time-space trade-off.

Why is the tree threshold 8?

The HashMap document has such a description. Generally it means that the number of nodes on the hash bucket list presented Poisson distribution .

Ideally, under random hashCodes, the frequency of
* nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a
* parameter of about 0.5 on average for the default resizing
* threshold of 0.75, although with a large variance because of
* resizing granularity. Ignoring variance, the expected
* occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
* factorial(k)). The first values are:
*
* 0:    0.60653066
* 1:    0.30326533
* 2:    0.07581633
* 3:    0.01263606
* 4:    0.00157952
* 5:    0.00015795
* 6:    0.00001316
* 7:    0.00000094
* 8:    0.00000006
* more: less than 1 in ten million

What is Poisson distribution ?

Poisson distribution describes the specific probability of occurrence of an event within a certain period of time. The Poisson distribution can estimate the probability of an event through the average.

$$ P(N(t) = n) = \frac{(\lambda t)^n e^{-\lambda t}}{n!} $$

  • $P$ probability;
  • $N$ some kind of functional relationship;
  • $t$ time;
  • The number of occurrences of $n$;

For example, a programmer writes an average of 3 bugs per day, expressed as \( P(N(1) = 3) \). From this, the following can also be obtained:

The probability that he will write a bug tomorrow: 0.1493612051
The probability that he will write 2 bugs tomorrow: 0.2240418077
The probability that he will write 3 bugs tomorrow: 0.2240418077
The probability that he will write 10 bugs tomorrow: 0.0008101512
/**
* @param n 节点数量
* @param r 平均数量
*/
public static String poisson(int n, double r) {
    double value = Math.exp(-r) * Math.pow(r, n) / IntMath.factorial(n);
    return new BigDecimal(value).setScale(10, ROUND_HALF_UP).toPlainString();
}

Assuming that the HashMap has \( n \) pieces of data and the load factor is \( k \), then the minimum length of the HashMap is \( \frac{n}{k} \), and the maximum is about \( \frac{2n}{ k} \) (the capacity must be a power of 2) so the average value is \( \frac{3n}{2k} \), and the average number of nodes per bucket is

$$ n \div(\frac{3n}{2k})= \frac{2k}{3} = \frac{2\times 0.75}{3} = 0.5 $$

The default load factor of HashMap is 0.75, so the average number of nodes in each bucket is 0.5. Substituting Bai Song's formula to get the following data

1个桶中出现1个节点的概率:0.3032653299
1个桶中出现2个节点的概率:0.0758163325
1个桶中出现3个节点的概率:0.0126360554
1个桶中出现4个节点的概率:0.0015795069
1个桶中出现5个节点的概率:0.0001579507
1个桶中出现6个节点的概率:0.0000131626
1个桶中出现7个节点的概率:0.0000009402
1个桶中出现8个节点的概率:0.0000000588

tree-based is a last resort for extremely bad hashing, and the probability of 8 nodes in a bucket is less than one in ten million, so TREEIFY_THRESHOLD=8.

summary

Hash table is a data structure of key-value mapping. There are two ways to resolve conflicts: zipper method and open addressing . Reasonably set the load factor and initial capacity to avoid excessive expansion operations and cache loss. Understand the hash method and conflict tree of HashMap.


编程码农
455 声望1.4k 粉丝

多年编程老菜鸟👨‍💻🦍