In addition to arrays, hash tables are the most common data structure. The bottom layer of map in Go language is a hash table, which can easily provide key-value pairs mapping.

Features

The value of an uninitialized map is nil. Adding an element to a map with a value of nil will trigger panic, which is one of the easy mistakes for novices.

The map operation is not atomic. When multiple coroutines operate the map at the same time, a read-write conflict may occur. At this time, a panic will be triggered and the program will exit. If you need to read and write concurrently, you can use a lock to protect the map, or you can use sync.Map in the sync package of the standard library.

realization principle

Data structure

The underlying data structure of map is defined by runtime/map.go/hmap:

type hmap struct {
    count     int             // 元素个数,调用 len(map) 时,直接返回此值
    flags     uint8
    B         uint8           // buckets 数组长度的对数
    noverflow uint16          // overflow 的 bucket 近似数
    hash0     uint32          // 哈希种子,为哈希函数的结果引入随机性,在调用哈希函数时作为参数传入
    
    buckets    unsafe.Pointer // 指向 buckets 数组,大小为 2^B,元素个数为0时为 nil
    oldbuckets unsafe.Pointer // 在扩容时用于保存旧 buckets 数组,大小为 buckets 的一半
    nevacuate  uintptr        // 指示扩容进度,小于此地址的 buckets 都已迁移完成
    extra *mapextra           // 附加项
}

Buckets are stored in the buckets array. Buckets are often translated into buckets. They are the real carrier of map key-value pairs. Its data structure is defined as follows:

type bmap struct {
    tophash [bucketCnt]uint8    // 存储键的 hash 值的高 8 位
}

During operation, the bmap structure actually contains more than the tophash field, because different types of key-value pairs may be stored in the hash table, and the Go language (before 1.17) does not support generics, so the memory space occupied by the key-value pairs is only Can be deduced at compile time. Other fields in bmap are also accessed by calculating memory addresses at runtime, so these fields are not included in its definition. At runtime, bmap is transformed into the following:

type bmap struct {
    topbits  [8]uint8
    keys     [8]keytype
    values   [8]valuetype
    pad      uintptr
    overflow uintptr
}

The overall map structure diagram is roughly as follows:

image.png

The internal composition of bmap is similar to the following figure:

image.png

HOB Hash refers to top hash. Key and value are put together separately, not in the form of key/value/key/value/... The advantage of this is that the padding field can be omitted in some cases to save memory space.

Each bucket is designed to hold up to 8 key-value pairs. If the ninth key-value falls into the current bucket, then another bucket needs to be constructed and connected through the overflow pointer.

related operations

find

The key is hashed to obtain the hash value, which has a total of 64 bits. When calculating which bucket it should fall in, only the last B bits are used. Remember the B mentioned earlier? B is equal to the number of buckets, which is the logarithm of the length of the buckets array.

For example, after a key is calculated by a hash function, the hash result obtained is:

10010111 | 000011110110110010001111001010100010010110010101010 │ 00110

Use the last 5 bits, which is 00110, to locate the 6th bucket. This operation is actually the remainder operation, but the remainder is too expensive, so the code implementation uses bit operations instead. Then use the high 8 bits of the hash value to find the position of the key in the bucket. As shown below:

image.png

Because there may be a hash conflict, after locating the position of the key in the bucket, you need to obtain the corresponding key value and compare it with the key to be queried. If it is not equal, continue the above locating process. If it is not found in the current bucket and the overflow is not empty, continue to search in the overflow bucket until it is found or all key slots have been searched. If it is not found, it will not return nil, but return the zero value of the corresponding type.

Note: If you are currently in the relocation process (capacity expansion), you will find it in oldbuckets first.

assignment

The assignment operation finally calls the mapassign function. Its initial process is similar to the search described above, and the location in the corresponding bucket is also found through the key. Prepare two pointers, one (inserti) points to the location of the key's hash value in the tophash array, and the other (insertk) points to the location of the cell (that is, the address where the key is finally placed). During the cycle, inserti and insertk respectively point to the first free cell found. If the key is not found in the end, it will be inserted at this position. If the current bucket is full, it will call newoverflow to create a new bucket or use hmap to create a bucket in noverflow to save data. The newly created bucket will not only be appended to the end of the existing bucket, but also increase the noverflow counter of hmap .

If the current key exists in the map, it will directly return the memory address of the target area; if it does not exist, the memory address of the new key-value pair will be planned for storage, and the key will be moved to the corresponding memory space through typedmemmove and the key will be returned. The address of the corresponding value. map does not copy the value to the bucket in the run-time function of mapassign. The function only returns the memory address. The actual assignment operation is inserted during compilation.

Expansion

In the previous introduction of the map writing process, the expansion operation was omitted. As the elements in the map gradually increase, the performance will gradually deteriorate, so more buckets and larger memory are needed to ensure the performance of the map. When a new key is inserted into the map, a conditional check will be performed, and expansion will be triggered if it is met:

if !h.growing() && (overLoadFactor(h.count+1, h.B) || tooManyOverflowBuckets(h.noverflow, h.B)) {
    hashGrow(t, h)
    goto again
}

func overLoadFactor(count int, B uint8) bool {
    return count > bucketCnt && uintptr(count) > loadFactorNum*(bucketShift(B)/loadFactorDen)
}

func tooManyOverflowBuckets(noverflow uint16, B uint8) bool {
    if B > 15 {
        B = 15
    }
    return noverflow >= uint16(1)<<(B&15)
}

It can be seen from the source code that the condition for triggering expansion is one of the following two:

  1. The loading factor exceeds the threshold, the threshold defined in the source code is 6.5;
  2. There are too many buckets for overflow: when B is less than 15, that is, when the total number of buckets is less than 2^15, the number of overflow buckets exceeds 2^B; when B >= 15, that is, when the total number of buckets is greater than or equal to 2^15, overflow The number of buckets exceeds 2^15.

For condition one, it means that the number of buckets is too small, and the number of buckets needs to be expanded at this time, which is called incremental expansion; for condition two, it means that the key-value pairs in the bucket are too sparse, and there is no need to expand the number of buckets at this time, which is called For equal expansion. In either case, you need to open a new bucket array and move the key-value pairs in the old bucket array to the new one, but the number of buckets is doubled during incremental expansion. Let's take a look at the core code of the expanded entry hashGrow function:

func hashGrow(t *maptype, h *hmap) {
    bigger := uint8(1)
    if !overLoadFactor(h.count+1, h.B) {
        bigger = 0
        h.flags |= sameSizeGrow
    }
    oldbuckets := h.buckets
    newbuckets, nextOverflow := makeBucketArray(t, h.B+bigger, nil)

    h.B += bigger
    h.flags = flags
    h.oldbuckets = oldbuckets
    h.buckets = newbuckets
    h.nevacuate = 0
    h.noverflow = 0

    ...
}

We can see that the hashGrow function just allocates new buckets and hangs the old buckets to the oldbuckets field, without actually migrating data. This is because if there are a large number of key-value pairs that need to be migrated, performance will be greatly affected. Therefore, the expansion of the map adopts a "gradual" approach. The original keys will not be migrated all at once, and only 2 buckets will be migrated each time. When inserting, modifying, or deleting keys, it will first check whether the oldbuckets are migrated, specifically, check whether the oldbuckets are nil. If it is false, the migration is executed, that is, the growWork function is called. The code of the growWork function is as follows:

func growWork(t *maptype, h *hmap, bucket uintptr) {
    evacuate(t, h, bucket&h.oldbucketmask())

    if h.growing() {
        evacuate(t, h, h.nevacuate)
    }
}

The evacuate function is a function that performs migration. It redistributes the elements in the incoming bucket. The general logic is as follows: this function creates an evacDst structure for saving the allocation context. Only one is created for equal expansion, and incremental expansion Two will be created, and each evacDst points to a new bucket. In the case of equal expansion, each key is migrated to the bucket with the same sequence number as before; in the case of incremental expansion, it will be divided into two buckets according to the hash value and the new mask. Finally, the nevacuate counter of the map will be increased and the oldbuckets and oldoverflow of the map will be cleared after all the old buckets are diverted.


与昊
225 声望636 粉丝

IT民工,主要从事web方向,喜欢研究技术和投资之道