Abstract: maps a set of keywords to a limited, continuous address set (interval) based on the set hash function H(key) and the selected method of handling conflicts, and uses the keyword The "image" in the address set serves as the storage location of the corresponding record in the table. The lookup table constructed in this way is called a "hash table".

This article is shared from Huawei Cloud Community " Search-HASH ", the original author: ruochen.

For frequently used lookup tables, hope that ASL = 0
There is a certain relationship between the position of the record in the table and its key

HASH

definition

The set hash function H (Key) and the selected conflict method, a set of keywords mapped to a limited set of consecutive addresses on the address (section), and to The "image" of the keyword in the address set is used as the storage location of the corresponding record in the table. The lookup table constructed in this way is called "hash table"

The construction of HASH function

  • Construction principle

    • The function itself is easy to calculate
    • The calculated addresses are evenly distributed, that is, the probability of any keyword k, f(k) corresponding to different addresses is equal, and the purpose is to minimize conflicts

Direct addressing

  • Hash function is a linear function of keywords

    • H(key) = key
    • H(key) = a * key + b
  • This method is only suitable for:
    The size of the address collection
  • Advantages: a linear function value of the key code key is used as a hash address, no conflicts will occur
  • Disadvantages: occupies continuous address space, low space efficiency
    image.png

Digital analysis

  • Assuming that each keyword in the keyword set is composed of s digits (u1, u2, …, us), analyze the entire keyword set, and extract several evenly distributed bits or their combination as the address
  • This method is only suitable for:
    can pre-estimate the frequency of various numbers appearing on each digit of all keywords
    image.png

Square method

  • Use the middle bits of the square value of the key as the storage address. The purpose of finding the “square value of the keyword” is to “enlarge the difference”, and at the same time the middle bits of the square value can be affected by the bits in the entire keyword
  • This method is suitable for:
    Every digit in the keyword has certain numbers that repeat frequently
    image.png

Folding method

  • Divide the keyword into several parts, and then take their superimposed sum as the hash address. There are two methods of superposition processing: shift superposition and boundary superposition
  • This method is suitable for:
    keyword has a very large number of digits
    image.png

Divide and leave remainder method

  • Hash(key)=key mod p (p is an integer)
    p≤m (table length)

    • p should be the largest prime number less than or equal to m
    • impose restrictions on p?

Given a set of keywords: 12, 39, 18, 24, 33, 21 If p=9, their corresponding hash function value will be:
3, 3, 0, 6, 6, 3

It can be seen that if p contains prime factor 3, all keywords containing prime factor 3 are mapped to the address of "multiple of 3", which increases the possibility of "conflict"
image.png

Random number method

  • H(key) = Random(key) (Random is a pseudo-random function)
  • This method is used to construct a hash function for keywords of unequal length

Considerations

  1. Execution speed (that is, the time required to calculate the hash function)
  2. Keyword length
  3. The size of the hash table
  4. Keyword distribution
  5. Find frequency

The method of constructing the hash function depends on the key set of the table.
principle of 160ee452ad47eb is to reduce the possibility of conflict to as small as possible

Ways to deal with conflicts

1. Open addressing method

Basic idea

When there is a conflict, look for the next empty hash address. As long as the hash table is large enough, the empty hash address can always be found, and the data element is stored

Linear detection method

Hi=(Hash(key)+di) mod m ( 1≤i < m )
Among them: m is the length of the hash table
di is the increment sequence 1, 2,...m-1, and di=i
Once there is a conflict, find the next empty address and deposit
image.png

  • Advantages: As long as the hash table is not filled, guaranteed to find an empty address unit
  • Disadvantages: The synonym of the i-th hash address can be stored in the i+1-th address, so that the element that should be stored in the i+1-th hash address becomes a synonym of the i+2th hash address. ……, the gathering " occurs, which reduces the search efficiency

    Second detection method

    di = 12, -12, 22, -22, …±k2
    image.png

Pseudo-random detection

Hi=(Hash(key)+di) mod m ( 1≤i < m )
Among them: m is the length of the hash table
di is a random number

Open addressing method to establish a hash table steps

- 取数据元素的关键字key,计算其哈希函数值(地址)。若该地址对应的存储 空间还没有被占用,则将该元素存入;否则执行step2解决冲突
- 根据选择的冲突处理方法,计算关键字key的下一个存储地址。若下一个存储地址仍被占用,则继续执行step2,直到找 到能用的存储地址为止

#### 开放定址哈希表的存储结构
/* ------------- 开放定址哈希表的存储结构 ------------- */

int hashsize[] = {997, ...};
typedef struct{
    ElemType* elem;
    int count;  // 当前数据元素个数
    int sizeindex;  // hashsize[sizeindex]为当前容量
} HashTable;

#define SUCCESS 1
#define UNSUCCESS 0
#define DUPLICATE -1

Status SearchHash(HashTable H, KeyType K, int &p, int &c){
    // 在开放定址哈希表H中查找关键码为K的记录
    p = Hash(K);  // 求得哈希地址
    while(H.elem[p].key != NULLKEY && !EQ(K, H.elem[p].key))
        collisiion(p, ++c);  // 求得下一探测地址p
    if(EQ(K, H.elem[p].key)) return SUCCESS;  // 查找成功,返回待查数据元素位置 p
    else return UNSUCCESS;  // 查找不成功
}

2. Re-HASH method

H2(key) is another hash function set, and its function value should be relatively prime to m
image.png

3. Chain address method

Basic idea

  • The records of the same hash address are chained into a singly linked list, m hash addresses set up m singly linked lists , and then use an array to store the head pointers of m singly linked lists to form a dynamic structure
    image.png

advantage:

  • Non-synonyms will not conflict, no "gathering" phenomenon
  • Dynamic application for node space on the linked list is more suitable for situations where the length of the list is uncertain

Hash table lookup

For a given value K, calculate the hash address i = H(K)

  • If r[i] = NULL, the search is unsuccessful
  • If r[i].key = K, the search is successful, otherwise "seeking the next address Hi", until r[Hi] = NULL (search is unsuccessful) or r[Hi].key = K (search is successful)
    image.png

Case v01

Linear detection method to resolve conflicts
image.png

Case v02

Chain address method to handle conflicts
image.png

Analysis of hash table lookup

From the search process, it is known that the average search length of the hash table search is actually not equal to zero

Factors that determine the ASL for hash table lookup

  • Hash function selected
  • The method of choice to handle conflicts
  • The degree of saturation of the hash table, the loading factor α=n/m value size (n—the number of records, m—the length of the table)

The larger the α, the more records in the table, indicating that the fuller the table is, the greater the possibility of conflicts, and the greater the number of comparisons when searching
image.png

  1. Has a very good average performance for hash table technology, better than some traditional technologies
  2. Chain address method is better than open address method
  3. The method of removing the remainder as a hash function is better than other types of functions

Hash table application example

Compiler management of identifiers mostly uses hash tables

  • constructing a hash function

    • Convert each character in the identifier to a non-negative integer
    • Combine the obtained integers into an integer (the first, middle and last character values can be added together, or the values of all characters can be added together)
    • Adjust the result number to the range of 0~M-1, you can use the modulo method, Ki%M (M is a prime number)

Click to follow, and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量