Introduction

There are two commonly used classes related to hash in java, hashTable and hashMap. The underlying storage of both classes is an array. This array is not an ordinary array, but something called a hash table.

A hash table is a data structure that maps keys to values. It uses a hash function to map keys to a small range of exponents (usually [0.. hash table size-1]). At the same time, it is necessary to provide conflicts and solutions to conflicts.

Today we will learn about the characteristics and functions of the hash table.

There is a code address at the end of the article, welcome to download.

Key concepts of hash tables

The three key concepts in the hash table are the hash table, hash function, and conflict resolution.

Hashing is an algorithm (through a hash function) that maps a large variable-length data set to a fixed-length smaller integer data set.

A hash table is a data structure that uses a hash function to efficiently map keys to values for efficient search/retrieval, insertion and/or deletion.

Hash tables are widely used in a variety of computer software, especially associative arrays, database indexes, caches and collections.

The hash table must support at least the following three operations and be as efficient as possible:

Search (v)-determine if v exists in the hash table,
Insert (v)-insert v into the hash table,
Delete (v)-delete v from the hash table.

Because the hash algorithm is used to map the long data set to the short data set, conflicts may occur during insertion. According to the different solutions to the conflicts, it can be divided into linear detection, secondary detection, and double hashing. Conflict resolution methods such as separating links and separating links.

Arrays and hash tables

Consider the problem of finding the first repeated character in a given string.

How to solve this problem? The easiest way is to traverse n times, the first traversal finds out whether there is a character equal to the first character in the string, the second traversal finds whether there is a character equal to the second character in the string, And so on.

Because n*n traversal is performed, the time complexity is O(n²).

Is there an easier way?

Consider that the set of characters in a string is actually limited. If all ASCII characters are used, then we can construct a 256-length array to traverse once.

The specific method is to traverse a character to +1 the value of the corresponding index in the relative array. When we find that the value of a certain index is already 1, we know that the character is repeated.

Array problem

So what's wrong with the implementation of the array?

The problem with the array:

The range of the key must be small. If we have a (very) large area, the memory usage will be (very) large.
The keys must be dense, that is, there are not too many blanks in the key values. Otherwise the array will contain too many empty cells.

We can use a hash function to solve this problem.

By using hash functions, we can:

Map some non-integer keys to integer keys,
Map large integers to smaller integers.

By using the hash function, we can effectively reduce the size of the storage array.

hash problem

There are advantages and disadvantages. Although the hash function can be used to map a large data set into a small data set, the hash function may and is likely to map different keys to the same integer slot, that is, many-to-one mapping instead of one-to-one mapping One mapping.

Especially when the density of the hash table is very high, such conflicts will often occur.

Here is a concept that affects the density or load factor of the hash table α = N / M, where N is the number of keys and M is the size of the hash table.

In fact, the probability of this conflict is greater than we thought. Let’s take a question about the birthday paradox:

How many students in a class make the probability that at least two have the same birthday greater than 50%?

Let's calculate the above problem.

Suppose Q(n) is the probability that n people in the class have different birthdays.

Q(n) = 365/365×364/365×363/365×...×(365-n + 1)/365, that is, the first person’s birthday can be any day in 365 days, and the second person’s The birthday can be any 365 days except the birthday of the first person, and so on.

Let P(n) be the probability of the same birthday of n people in the class, then P(n) = 1-Q(n).

By calculation, when n=23, P(23) = 0.507> 0.5 (50%).

In other words, when a class has 23 people, the probability that at least two people in the class have the same birthday has exceeded 50%. This paradox tells us that things that individuals find rare are common in the collective.

Well, back to our hash conflict, we need to build a good hash function to minimize data conflicts.

What is a good hash function?

It can be calculated quickly, that is, its time complexity is O(1).
Use the smallest hash table as much as possible,
Distribute the keys as evenly as possible to different base addresses ∈[0..M-1],
Minimize collisions as much as possible.

Before discussing the implementation of the hash function, let us discuss the ideal situation: a perfect hash function.

A perfect hash function is a one-to-one mapping between the key and the hash value, that is, there is no conflict at all. Of course, this situation is very rare. If we know the key to be stored in the hash function in advance, it can still be done.

Okay, let's discuss several common methods of resolving conflicts in hashes.

Linear detection

First, give the linear detection formula: i is described as i = (base + step * 1)%M, where base is the hash value of the key v, that is, h(v), and step is the linear detection step starting from 1.

The detection sequence of linear detection can be formally described as follows:

h(v)//base address
(H(v) + 1 * 1) %M //The first detection step, if a collision occurs
(H(v) + 2 * 1) %M //The second detection step, if there is still collision
(H(v)+3*1)%M //The third detection step, if there are still conflicts
...
(H(v) + k * 1) %M //the kth detection step, etc...

Let's look at an example first. In the above array, our base is 9, and there are three elements 1, 3, and 5 in the array.

Now we need to insert 10 and 12. According to the calculation, the hash values of 10 and 12 are 1 and 3, but 1 and 3 now have data, so we need to linearly detect one bit forward, and finally insert it after 1 and 3.

The above is an example of deleting 10. Similarly, first calculate the hash value of 10 = 1, and then determine whether the position element of 1 is 10, and if it is not 10, linear detection is performed forward.

Look at the key code of linear detection:

    //插入节点
    void insertNode(int key, int value)
    {
        HashNode temp = new HashNode(key, value);

        //获取key的hashcode
        int hashIndex = hashCode(key);

        //find next free space
        while(hashNodes[hashIndex] != null && hashNodes[hashIndex].key != key
            && hashNodes[hashIndex].key != -1)
        {
            hashIndex++;
            hashIndex %= capacity;
        }
        //插入新节点,size+1
        if(hashNodes[hashIndex] == null || hashNodes[hashIndex].key == -1) {
            size++;
        }
        //将新节点插入数组
        hashNodes[hashIndex] = temp;
    }

If we call the continuous storage space with the same h(v) address as clusters, linear detection is likely to create large main clusters, which will increase the search (v)/insert (v)/delete (v) operations operation hours.

To solve this problem, we introduced a second detection.

Second detection

First, give the formula for the second detection: i is described as i = (base + step * step)%M, where base is the hash value of the key v, that is, h(v), and step is the linear detection step starting from 1.

h(v)//base address
(H(v) + 1 * 1) %M //The first detection step, if a collision occurs
(H(v) + 2 * 2) %M //The second detection step, if there are still conflicts
(H(v)+3*3)%M //The third detection step, if there are still conflicts
...
(H(v)+k*k)%M //the kth detection step, etc...

That's it, the probe jumps to the quadratic and wraps around the hash table as needed.

Look at an example of secondary detection. In the above example, we already have three elements: 38, 3, and 18. Now insert 10 and 12 into it. You can study the path of detection by yourself.

Let's look at an example of deleting nodes in the second exploration.

Look at the key code for the second detection:

    //插入节点
    void insertNode(int key, int value)
    {
        HashNode temp = new HashNode(key, value);

        //获取key的hashcode
        int hashIndex = hashCode(key);

        //find next free space
        int i=1;
        while(hashNodes[hashIndex] != null && hashNodes[hashIndex].key != key
            && hashNodes[hashIndex].key != -1)
        {
            hashIndex=hashIndex+i*i;
            hashIndex %= capacity;
            i++;
        }

        //插入新节点,size+1
        if(hashNodes[hashIndex] == null || hashNodes[hashIndex].key == -1) {
            size++;
        }
        //将新节点插入数组
        hashNodes[hashIndex] = temp;
    }

In the second detection, clusters are formed along the detection path, rather than around the base address as in linear detection. These clusters are called secondary clusters (Secondary Clusters).
Since the same pattern is used in the detection of all keys, a secondary cluster is formed.

The secondary cluster in the secondary detection is not as bad as the primary cluster in the linear detection, because in theory the hash function should first distribute the keys to different base addresses ε[0..M-1].

In order to reduce primary and secondary clusters, we introduced double hashing.

Double hash

First give the formula of double hash: i is described as i = (base + step * h2(v))%M, where base is the hash value of key v, that is, h(v), step starts from 1. Linear detection step.

h(v)//base address
(H(v) + 1 * h2(v))%M //The first detection step, if there is a collision
(H(v)+ 2 * h2(v))%M //The second detection step, if there are still conflicts
(H(v) + 3 * h2(v))%M //The third detection step, if there are still conflicts
...
(H(v)+k*h2(v))%M //the kth detection step, etc...

That's it, the detector jumps according to the value of the second hash function h2(v), surrounding the hash table as needed.

Look at the key code for double hashing:

    //插入节点
    void insertNode(int key, int value)
    {
        HashNode temp = new HashNode(key, value);

        //获取key的hashcode
        int hashIndex = hash1(key);

        //find next free space
        int i=1;
        while(hashNodes[hashIndex] != null && hashNodes[hashIndex].key != key
            && hashNodes[hashIndex].key != -1)
        {
            hashIndex=hashIndex+i*hash2(key);
            hashIndex %= capacity;
            i++;
        }

        //插入新节点,size+1
        if(hashNodes[hashIndex] == null || hashNodes[hashIndex].key == -1) {
            size++;
        }
        //将新节点插入数组
        hashNodes[hashIndex] = temp;
    }

If h2(v) = 1, then Double Hashing works exactly the same as Linear Probing. So we usually hope h2(v)>1 to avoid the main cluster.

If h2(v) = 0, then Double Hashing will not work.

Usually for integer keys, h2(v) = M'-v%M' where M'is a prime number less than M. This makes h2(v)∈[1..M'].

The use of the secondary hash function makes it theoretically difficult to generate primary or secondary clustering problems.

Split link

The separation link method (SC) conflict resolution technique is very simple. If two keys a and b have the same hash value i, then these two keys will be appended at the position to be inserted in the form of a linked list.

Because the place where keys will be inserted is completely dependent on the hash function itself, we also call the split link method a closed addressing conflict resolution technology.

The above is an example of separate link insertion, inserting the two elements 12 and 3 into the existing hashMap.

The above is an example of deleting a separate link. The element 10 is deleted from the link.

Take a look at the key code to separate the link:

 //添加元素
    public void add(int key,int value)
    {

        int index=hashCode(key);
        HashNode head=hashNodes[index];
        HashNode toAdd=new HashNode(key,value);
        if(head==null)
        {
            hashNodes[index]= toAdd;
            size++;
        }
        else
        {
            while(head!=null)
            {
                if(head.key == key )
                {
                    head.value=value;
                    size++;
                    break;
                }
                head=head.next;
            }
            if(head==null)
            {
                head=hashNodes[index];
                toAdd.next=head;
                hashNodes[index]= toAdd;
                size++;
            }
        }
        //动态扩容
        if((1.0*size)/capacity>0.7)
        {
            HashNode[] tmp=hashNodes;
            hashNodes=new HashNode[capacity*2];
            capacity=2*capacity;
            for(HashNode headNode:tmp)
            {
                while(headNode!=null)
                {
                    add(headNode.key, headNode.value);
                    headNode=headNode.next;
                }
            }
        }

rehash

When the load factor α becomes higher, the performance of the hash table will decrease. For the (standard) secondary detection conflict resolution method, when the hash table α>0.5, the insertion may fail.
If this happens, we can rehash. We use a new hash function to build another hash table that is about twice as large. We traverse all the keys in the original hash table, recalculate the new hash value, then reinsert the key value into the new larger hash table, and finally delete the earlier smaller hash table.

The code address of this article:

learn-algorithm

This article has been included in http://www.flydean.com/14-algorithm-hashtable/

The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!

Welcome to pay attention to my official account: "Program those things", know technology, know you better!


flydean
890 声望437 粉丝

欢迎访问我的个人网站:www.flydean.com