[PHP data structure] hash table search

硬核项目经理
中文

Was the search in the previous article an unfinished feeling? Because we are really exposed to the optimization of time complexity. The direct optimization from O(n) for linear search to O(logN) for binary search is definitely a qualitative leap. But what is the core requirement of our binary search? That is, the original data must be in order. This is a troublesome thing, after all, if the amount of data is very large, sorting will become very troublesome. But don't worry, the hash table search we are going to learn today is another form of search. To what extent can it be done?

O(1), yes, you read that right. In the best case, hash table search can achieve such a constant level of search efficiency. Isn't it amazing?

Hash hashing (except for the remainder method)

First look at a very simple hashing algorithm through a practical example. When the amount of data is relatively large, we often have to perform table operations on the data table. The simplest solution is to modulo a certain field, such as ID. In other words, if we want to divide 20 tables, then divide the ID of the data by 20, and then get the remainder. Then add this piece of data to the table corresponding to the remainder. We simulate this operation through code.

or($i=0;$i<100;$i++){
    $arr[] = $i+1;
}

$hashKey = 7;
$hashTable = [];
for($i=0;$i<100;$i++){
    $hashTable[$arr[$i]%$hashKey][] = $arr[$i];
}

print_r($hashTable);

Here, we assume that 100 pieces of data are put into 7 tables, that is, the modulus operator% is used directly to obtain the remainder, and then the data is put into the corresponding array subscript. These 100 data are respectively placed in the subscripts 0-6 in the array. In this way, we have realized the simplest idea of data sub-table. Of course, it is much more complicated than this in actual business development. Because we consider a variety of different scenarios to determine what form to divide the table, how many tables to divide, and the subsequent expansion, that is to say, the real situation is much more complicated than what we wrote here.

As a demonstration code, the hash form of this sub-table is actually the most classic and most used method of dividing and leaving the remainder in the hash table lookup. In fact, there are other methods, such as the square method, the folding method, and the numerical analysis method. Their core idea is to act as a hash algorithm, let the original data correspond to a new value (location).

In fact, the most typical similar idea is the hash operation of md5(). Different content will produce different values. In addition, key-value pair cache databases such as Redis and Memcached will actually hash the Key value we set and store it in memory to achieve fast search capabilities.

Hash collision problem (linear detection method)

In the above example, we will actually find a problem, that is, if the value of the hash algorithm is small, there will be a lot of repeated conflicting data. If it is a real hash table of stored data, such storage actually cannot help us quickly and accurately find the data we need. Search and search, its core ability is actually search. So if we randomly give some data, then how to save them in the same length range and avoid conflicts? This is the problem of hash conflict that we will learn next.

$arr = [];
$hashTable = [];
for($i=0;$i<$hashKey;$i++){
    $r = rand(1,20);
    if(!in_array($r, $arr)){
        $arr[] = $r;
    }else{
        $i--;
    }
}

print_r($arr);
for($i=0;$i<$hashKey;$i++){
    if(!$hashTable[$arr[$i]%$hashKey]){
        $hashTable[$arr[$i]%$hashKey] = $arr[$i];
    }else{
        $c = 0;
        echo '冲突位置:', $arr[$i]%$hashKey, ',值:',$arr[$i], PHP_EOL;
        $j=$arr[$i]%$hashKey+1;
        while(1){
            if($j>=$hashKey){
                $j = 0;
            }
            if(!$hashTable[$j]){
                $hashTable[$j] = $arr[$i];
                break;
            }
            $c++;
            $j++;
            if($c >= $hashKey){
                break;
            }
        }
    }
}
print_r($hashTable);

This time we only generate 7 random data, let them still divide by 7 as the modulo. At the same time, we also need to store them in another array as the result of the hash. This new array can be regarded as a space in memory. If there are data with the same hash, then of course they can't be placed in the same space. If there are two pieces of data in a different space, we don't know which data we really want to fetch.

In this code, we use the linear detection method in the open address method. This is the easiest way to handle hash collisions. Let's take a look at the output and then analyze what we did during the conflict.

// Array
// (
//     [0] => 17     // 3
//     [1] => 13     // 6
//     [2] => 9      // 2
//     [3] => 19     // 5
//     [4] => 2      // 2 -> 3 -> 4
//     [5] => 20     // 6 -> 0
//     [6] => 12     // 5 -> 6 -> 0 -> 1
// )
// 冲突位置:2,值:2
// 冲突位置:6,值:20
// 冲突位置:5,值:12
// Array
// (
//     [3] => 17
//     [6] => 13
//     [2] => 9
//     [5] => 19
//     [4] => 2
//     [0] => 20
//     [1] => 12
// )
  • First of all, the numbers we generated are the seven numbers 17, 13, 9, 19, 2, 20, and 12.
  • 17%7=3, 17 is saved in subscript 3.
  • 13%7=6, 13 is saved in subscript 6.
  • 9%7=2, 9 is saved in subscript 2.
  • 19%7=5, 19 is saved in subscript 5.
  • 2%7=2, okay, the conflict has occurred, the result of 2%7 is also 2, but the subscript of 2 is already someone, then we will start from 2 and look at the subscript of 3 to see if there is anyone, the same 3 is also occupied, so to 4, at this time 4 is empty, so 2 is stored in the subscript 4.
  • 20%7=6, as above, 6 is already occupied, so we go back to the 0 subscript at the beginning and find that 0 is not yet occupied, so 20 is saved in subscript 0.
  • The last 12%7=5, it will go through the subscripts 5, 6, 0, 1 and then put it at the subscript 1.

The final result is the result of our final array output. It can be seen that linear detection is actually to search down one by one if the position is found to be occupied by people. So its time complexity is not very good. Of course, the best case is that the total length of the data matches the length of the hash key, so that it can reach the O(1) level.

Of course, in addition to linear detection, there are algorithms such as secondary detection (square) and pseudo-random detection. In addition, you can also use a linked list to implement the chain address method to solve the problem of hash conflicts. You can check the relevant documents or books by yourself for these contents.

Summarize

The final lookup function of the hash hash is actually the same as the process of generating the hash table above. It is found that the conflict resolution is also the same, so I won't talk about it here. Regarding Hash, whether it is textbooks or various learning materials, the content of the introduction is not particularly large. Therefore, we also use an introductory mentality to briefly understand the knowledge of Hash. You can study more and share more content by yourself!

Test code:

https://github.com/zhangyue0503/Data-structure-and-algorithm/blob/master/6. lookup/source/6.2 hash table

Reference documents:

"Data Structure" Second Edition, Yan Weimin

"Data Structure" Second Edition, Chen Yue

Searchable on their respective media platforms [Hardcore Project Manager]

阅读 143
71 声望
14 粉丝
0 条评论
你知道吗?

71 声望
14 粉丝
宣传栏