The performance of a show operation version number is improved by 300%

From the public account: Gopher

In a performance analysis, it was found that the online service CompareVersion took up a long CPU time. As shown below.

Among them, the strings.Split function takes the longest time. This function should be very familiar to Gopher. And CompareVersion is based on strings.Split function to achieve version comparison, let's take a look at the implementation of CompareVersion .

// 判断是否全为0
func zeroRune(s []rune) bool {
    for _, r := range s {
        if r != '0' && r != '.' {
            return false
        }
    }
    return true
}
// CompareVersion 比较两个appversion的大小
// return 0 means ver1 == ver2
// return 1 means ver1 > ver2
// return -1 means ver1 < ver2
func CompareVersion(ver1, ver2 string) int {
    // fast path
    if ver1 == ver2 {
        return 0
    }
    // slow path
    vers1 := strings.Split(ver1, ".")
    vers2 := strings.Split(ver2, ".")
    var (
        v1l, v2l = len(vers1), len(vers2)
        i        = 0
    )
    for ; i < v1l && i < v2l; i++ {
        a, e1 := strconv.Atoi(vers1[i])
        b, e2 := strconv.Atoi(vers2[i])
        res := 0
        // 如果不能转换为数字，使用go默认的字符串比较
        if e1 != nil || e2 != nil {
            res = strings.Compare(vers1[i], vers2[i])
        } else {
            res = a - b
        }
        // 根据比较结果进行返回， 如果res=0，则此部分相等
        if res > 0 {
            return 1
        } else if res < 0 {
            return -1
        }
    }
    // 最后谁仍有剩余且不为0，则谁大
    if i < v1l {
        for ; i < v1l; i++ {
            if !zeroRune([]rune(vers1[i])) {
                return 1
            }
        }
    } else if i < v2l {
        for ; i < v2l; i++ {
            if !zeroRune([]rune(vers2[i])) {
                return -1
            }
        }
    }
    return 0
}

Try to optimize the strings.Split function

The logic of CompareVersion is clear and simple, and according to the flame graph, the performance is mainly consumed in the strings.Split function, so Lao Xu's first goal is to try to optimize strings.Split function.

At this time, the first method that Lao Xu thought of was Baidu Dafa and Google Dafa. Finally, he found the strings.FieldsFunc function in an article. According to the description of the article, the strings.FieldsFunc function splits strings much faster than the strings.Split function. So whether we can use the strings.FieldsFunc function to replace strings.Split function, please see the test results below.

func BenchmarkSplit(b *testing.B) {
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        strings.Split("7.0.09.000", ".")
        strings.Split("7.0.09", ".")
        strings.Split("9.01", ".")
    }
}

func BenchmarkFieldsFunc(b *testing.B) {
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        strings.FieldsFunc("7.0.09.000", func(r rune) bool { return r == '.' })
        strings.FieldsFunc("7.0.09", func(r rune) bool { return r == '.' })
        strings.FieldsFunc("9.01", func(r rune) bool { return r == '.' })
    }
}

The results of the above benchmark test run on Lao Xu's machine are as follows:

cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkSplit-4                 3718506               303.2 ns/op           144 B/op          3 allocs/op
BenchmarkSplit-4                 4144340               287.6 ns/op           144 B/op          3 allocs/op
BenchmarkSplit-4                 3859644               304.5 ns/op           144 B/op          3 allocs/op
BenchmarkSplit-4                 3729241               287.9 ns/op           144 B/op          3 allocs/op
BenchmarkFieldsFunc-4            3459463               336.5 ns/op           144 B/op          3 allocs/op
BenchmarkFieldsFunc-4            3604345               335.5 ns/op           144 B/op          3 allocs/op
BenchmarkFieldsFunc-4            3411564               313.9 ns/op           144 B/op          3 allocs/op
BenchmarkFieldsFunc-4            3661268               309.6 ns/op           144 B/op          3 allocs/op

According to the output, the strings.FieldsFunc function is not as fast as expected, or even better than the strings.Split function. Since this road was blocked, Lao Xu had to find another way.

try to introduce cache

According to the most volume company, if we have one version per week, and all year round, it takes 19 years (1000/(365 / 7)) for a company to release 1000 versions. Based on the data of this inner volume, if we can cache these versions and then compare the sizes, the execution speed will definitely be improved qualitatively.

`Self-implementing expired cache`

If you want to introduce cache, the first thing that Xu Xu thinks about is expired cache. At the same time, in order to be as lightweight as possible, it is undoubtedly a good solution to implement an expired cache yourself.

1. Define a structure containing expiration time and data

type cacheItem struct {
    data      interface{}
    expiredAt int64
}

// IsExpired 判断缓存内容是否到期
func (c *cacheItem) IsExpired() bool {
    return c.expiredAt > 0 && time.Now().Unix() >= c.expiredAt
}

2. Use sync.Map as a concurrent safe cache

var (
    cacheMap sync.Map
)

// Set 增加缓存
func Set(key string, val interface{}, expiredAt int64) {
    cv := &cacheItem{val, expiredAt}
    cacheMap.Store(key, cv)
}

// Get 得到缓存中的值
func Get(key string) (interface{}, bool) {
    // 不存在缓存
    cv, isExists := cacheMap.Load(key)
    if !isExists {
        return nil, false
    }
    // 缓存不正确
    citem, ok := cv.(*cacheItem)
    if !ok {
        return nil, false
    }
    // 读数据时删除缓存
    if citem.IsExpired() {
        cacheMap.Delete(key)
        return nil, false
    }
    // 最后返回结果
    return citem.Data(), true
}

3. Define a structure that can store each part of the data by dividing by .

// 缓存一个完整的版本使用切片即可
type cmVal struct {
    iv int
    sv string
    // 能否转换为整形
    canInt bool
}

4. Convert the app version to tiles for easy caching

func strs2cmVs(strs []string) []*cmVal {
    cmvs := make([]*cmVal, 0, len(strs))
    for _, v := range strs {
        it, e := strconv.Atoi(v)
        // 全部数据都保存
        cmvs = append(cmvs, &cmVal{it, v, e == nil})
    }
    return cmvs
}

5. Use the cached method for version size comparison

func CompareVersionWithCache1(ver1, ver2 string) int {
    // fast path
    if ver1 == ver2 {
        return 0
    }
    // slow path
    var (
        cmv1, cmv2             []*cmVal
        cmv1Exists, cmv2Exists bool
        expire                 int64 = 200 * 60
    )
    // read cache 1
    cmv, cmvExists := Get(ver1)
    if cmvExists {
        cmv1, cmv1Exists = cmv.([]*cmVal)
    }
    if !cmv1Exists {
        // set val and cache
        cmv1 = strs2cmVs(strings.Split(ver1, "."))
        Set(ver1, cmv1, time.Now().Unix()+expire)
    }
    // read cache 2
    cmv, cmvExists = Get(ver2)
    if cmvExists {
        cmv2, cmv2Exists = cmv.([]*cmVal)
    }
    if !cmv2Exists {
        // set val and cache
        cmv2 = strs2cmVs(strings.Split(ver2, "."))
        Set(ver2, cmv2, time.Now().Unix()+expire)
    }
    // compare ver str
    var (
        v1l, v2l = len(cmv1), len(cmv2)
        i        = 0
    )
    for ; i < len(cmv1) && i < len(cmv2); i++ {
        res := 0
        // can use int compare
        if cmv1[i].canInt && cmv2[i].canInt {
            res = cmv1[i].iv - cmv2[i].iv
        } else {
            res = strings.Compare(cmv1[i].sv, cmv2[i].sv)
        }
        if res > 0 {
            return 1
        } else if res < 0 {
            return -1
        }
    }
    if i < v1l {
        for ; i < v1l; i++ {
            if cmv1[i].canInt && cmv1[i].iv != 0 {
                return 1
            }
            if !zeroRune([]rune(cmv1[i].sv)) {
                return 1
            }
        }
    } else if i < v2l {
        for ; i < v2l; i++ {
            for ; i < v1l; i++ {
                if cmv2[i].canInt && cmv2[i].iv != 0 {
                    return -1
                }
                if !zeroRune([]rune(cmv2[i].sv)) {
                    return -1
                }
            }
        }
    }
    return 0
}

CompareVersionWithCache1 function comparison steps are:

Return directly if the version strings are equal
Read the cached data corresponding to the two versions respectively, if there is no cached data, generate the cached data and cache it
Compare the []*cmVal data corresponding to the two versions and return the size

Finally, perform performance verification. The following is the benchmark comparison between the CompareVersionWithCache1 function and the CompareVersion function.

cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4                  1642657           767.6 ns/op         304 B/op           6 allocs/op
BenchmarkCompareVersionWithCache1-4        1296520           844.9 ns/op           0 B/op           0 allocs/op

Through the analysis of the above results, the only optimization after using the cache is to reduce the memory allocation by a small amount. This result really made Lao Xu full of doubts. After using pprof analysis, he finally found the reason why the performance did not improve. Below is the flame graph of the BenchmarkCompareVersionWithCache1 function during the benchmark.

Considering that the number of app versions is small, the lazy elimination method is used to eliminate the expired cache, and each time the data is read, it is judged whether the cache expires. According to the flame graph, the biggest performance loss is to judge whether the cache has expired. Every time to judge whether the cache has expired, you need to call time.Now().Unix() to get the current timestamp. That is to say, because of this call of time.Now() , this optimization fell short.

`Introduce LRU cache`

Considering that the number of versions itself is not large, and the commonly used versions can be cached as permanently as possible, LRU cache is introduced for further performance optimization attempts.

1. Introduce open source LRU cache, the corresponding open source library is: github.com/hashicorp/golang-lru

2. Based on the CompareVersionWithCache1 function, replace the read and write cache with the introduced LRU cache

Finally, perform performance verification. The following is the benchmark comparison between the CompareVersionWithCache2 function and the CompareVersion function.

cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4                  1583202           841.7 ns/op         304 B/op           6 allocs/op
BenchmarkCompareVersionWithCache2-4        1671758           633.9 ns/op          96 B/op           6 allocs/op

Hey, this result finally looks a bit like, but the optimization effect is not obvious, there is still room for further improvement.

`Self-implementing LRU cache`

Choosing an LRU cache is effective. On this basis, Xu decided to implement a minimalist LRU cache by himself.

1. Define a cache node structure

type lruCacheItem struct {
    // 双向链表
    prev, next *lruCacheItem
    // 缓存数据
    data       interface{}
    // 缓存数据对应的key
    key        string
}

2. Define a structure that operates the LRU cache

type lruc struct {
    // 链表头指针和尾指针
    head, tail *lruCacheItem
    // 一个map存储各个链表的指针，以方便o(1)的复杂度读取数据
    lruMap     map[string]*lruCacheItem
    rw         sync.RWMutex
    size       int64
}

func NewLRU(size int64) *lruc {
    if size < 0 {
        size = 100
    }
    lru := &lruc{
        head:   new(lruCacheItem),
        tail:   new(lruCacheItem),
        lruMap: make(map[string]*lruCacheItem),
        size:   size,
    }
    lru.head.next = lru.tail
    lru.tail.prev = lru.head
    return lru
}

3. Set method of LRU cache

func (lru *lruc) Set(key string, v interface{}) {
    // fast path
    if _, exist := lru.lruMap[key]; exist {
        return
    }
    node := &lruCacheItem{
        data: v,
        prev: lru.head,
        next: lru.head.next,
        key:  key,
    }
    // add first
    lru.rw.Lock()
    // double check
    if _, exist := lru.lruMap[key]; !exist {
        lru.lruMap[key] = node
        lru.head.next = node
        node.next.prev = node
    }
    if len(lru.lruMap) > int(lru.size) {
        // delete tail
        prev := lru.tail.prev
        prev.prev.next = lru.tail
        lru.tail.prev = prev.prev
        delete(lru.lruMap, prev.key)
    }
    lru.rw.Unlock()
}

4. Get method of LRU cache

func (lru *lruc) Get(key string) (interface{}, bool) {
    lru.rw.RLock()
    v, ok := lru.lruMap[key]
    lru.rw.RUnlock()
    if ok {
        // move to head.next
        lru.rw.Lock()
        v.prev.next = v.next
        v.next.prev = v.prev

        v.prev = lru.head
        v.next = lru.head.next
        lru.head.next = v
        lru.rw.Unlock()
        return v.data, true
    }
    return nil, false
}

5. Based on the CompareVersionWithCache1 function, replace the read-write cache with a self-implemented LRU cache

Finally, perform performance verification. The following is the benchmark comparison between the CompareVersionWithCache3 function and the CompareVersion function:

cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4                  1575007           763.1 ns/op         304 B/op           6 allocs/op
BenchmarkCompareVersionWithCache3-4        3285632           317.6 ns/op           0 B/op           0 allocs/op

After introducing the self-implemented LRU cache, the performance has been fully doubled. Here, Xu is almost ready to go to the company to pretend, but there is always a voice in my heart asking me if there is a lock-free way to read the cache.

`Reduce LRU cache lock contention`

The lock-free way is really not thought of, but only two ways to reduce lock competition.

It is not necessary to move the node to the head of the linked list every time the data is read, and only when the number of LRU caches is close to the upper limit of the Size will the newly read data be moved to the head of the linked list
Since it is an LRU cache, the higher the access frequency, the closer the cache node is to the head of the linked list. Based on this feature, it can be considered to add random numbers to each access to reduce lock competition (that is, the higher the access frequency, the more chance to pass random numbers. Controls moving cache nodes to the head of the linked list).

The implementation after adding random numbers is as follows:

func (lru *lruc) Get(key string) (interface{}, bool) {
    lru.rw.RLock()
    v, ok := lru.lruMap[key]
    lru.rw.RUnlock()
    if ok {
        // 这里随机写100
        if rand.Int()%100 == 1 {
            lru.rw.Lock()
            v.prev.next = v.next
            v.next.prev = v.prev

            v.prev = lru.head
            v.next = lru.head.next
            lru.head.next = v
            lru.rw.Unlock()
        }
        return v.data, true
    }
    return nil, false
}

The benchmark comparison between the CompareVersionWithCache3 function and the CompareVersion function after adding random numbers is as follows:

cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4                  1617837           761.5 ns/op         304 B/op           6 allocs/op
BenchmarkCompareVersionWithCache3-4        4817722           251.3 ns/op           0 B/op           0 allocs/op

After adding random numbers, the performance of the CompareVersionWithCache3 function is improved again by about 20% . The optimization is not over yet, when the number of caches is far below the set cache limit, there is no need to move to the head of the linked list.

func (lru *lruc) Get(key string) (interface{}, bool) {
    lru.rw.RLock()
    v, ok := lru.lruMap[key]
    lru.rw.RUnlock()

    if ok {
        // move to head.next
        if len(lru.lruMap) > int(lru.size)-1 && rand.Int()%100 == 1 {
            lru.rw.Lock()
            v.prev.next = v.next
            v.next.prev = v.prev

            v.prev = lru.head
            v.next = lru.head.next
            lru.head.next = v
            lru.rw.Unlock()
        }
        return v.data, true
    }
    return nil, false
}

After introducing the above optimization, the benchmark comparison is as follows:

cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4                1633576               793.2 ns/op           304 B/op          6 allocs/op
BenchmarkCompareVersion-4                1619822               882.7 ns/op           304 B/op          6 allocs/op
BenchmarkCompareVersion-4                1639792               737.2 ns/op           304 B/op          6 allocs/op
BenchmarkCompareVersion-4                1630004               758.3 ns/op           304 B/op          6 allocs/op
BenchmarkCompareVersionWithCache3-4      7538025               155.9 ns/op             0 B/op          0 allocs/op
BenchmarkCompareVersionWithCache3-4      7514742               150.1 ns/op             0 B/op          0 allocs/op
BenchmarkCompareVersionWithCache3-4      8357704               162.9 ns/op             0 B/op          0 allocs/op
BenchmarkCompareVersionWithCache3-4      7748578               148.0 ns/op             0 B/op          0 allocs/op

So far, the final version has achieved 4 times the performance under ideal conditions (with sufficient cache space).

`Some people are God's reward`

Originally, Lao Xu was going to go to the company to act as a coercive, but he never expected that his colleagues had developed a more reasonable and stable version comparison algorithm, which made Lao Xu feel ashamed.

The algorithm idea is as follows:

Do not use the strings.Split function to split the version by . , but compare each character from left to right until a different character is encountered, and record the index i,j
Traverse the remaining strings of the two versions, starting with i、j until the first . is encountered, convert the two strings into integers for comparison
If the first two steps are completed and still equal, whoever has remaining characters will be greater

The three algorithm benchmarks are as follows:

cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4                1803190               674.8 ns/op           304 B/op          6 allocs/op
BenchmarkCompareVersion-4                1890308               630.9 ns/op           304 B/op          6 allocs/op
BenchmarkCompareVersion-4                1855741               631.8 ns/op           304 B/op          6 allocs/op
BenchmarkCompareVersion-4                1850410               629.4 ns/op           304 B/op          6 allocs/op
BenchmarkCompareVersionWithCache3-4      8877466               132.2 ns/op             0 B/op          0 allocs/op
BenchmarkCompareVersionWithCache3-4      8489661               132.6 ns/op             0 B/op          0 allocs/op
BenchmarkCompareVersionWithCache3-4      8358210               132.6 ns/op             0 B/op          0 allocs/op
BenchmarkCompareVersionWithCache3-4      8456853               131.9 ns/op             0 B/op          0 allocs/op
BenchmarkCompareVersionNoSplit-4         6309705               178.9 ns/op             8 B/op          2 allocs/op
BenchmarkCompareVersionNoSplit-4         6228823               181.2 ns/op             8 B/op          2 allocs/op
BenchmarkCompareVersionNoSplit-4         6370544               177.8 ns/op             8 B/op          2 allocs/op
BenchmarkCompareVersionNoSplit-4         6351043               180.0 ns/op             8 B/op          2 allocs/op

BenchmarkCompareVersionNoSplit function does not need to introduce a cache, and there will be no performance loss as the number of caches in BenchmarkCompareVersionWithCache3 approaches the upper limit. It is almost the most ideal version comparison scheme I have found so far.

Lao Xu didn't say anything about being a fan of the authorities, and bystanders were clear about this kind of sour grapes, and they had to admit that some people were just enjoying the meal. I have to say that it is my luck to meet such a person. I believe that as long as he has a mouthful of food, I can rub a mouthful of soup on the back of his ass. For the complete implementation of the version number comparison algorithm mentioned at the end of the article, please go to the following github repository to view:

https://github.com/Isites/ares/tree/main/strs

Finally, I sincerely hope that this article can be helpful to all readers.

Note:
At the time of writing this article, the go version used by the author is: go1.16.6
Full example used in the article: https://github.com/Isites/go-coder/tree/master/strs

The performance of a show operation version number is improved by 300%

Try to optimize the strings.Split function

try to introduce cache

`Self-implementing expired cache`

`Introduce LRU cache`

`Self-implementing LRU cache`

`Reduce LRU cache lock contention`

`Some people are God's reward`

Gopher指北

`引用和评论`

用Go构建你专属的JA3指纹

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

腾讯 tRPC-Go 教学——（1）搭建服务

一文弄懂用Go实现MCP服务

如何系统地入门学习stm32？