From the public account: Gopher
In a performance analysis, it was found that the online service CompareVersion
took up a long CPU time. As shown below.
Among them, the strings.Split
function takes the longest time. This function should be very familiar to Gopher. And CompareVersion
is based on strings.Split
function to achieve version comparison, let's take a look at the implementation of CompareVersion
.
// 判断是否全为0
func zeroRune(s []rune) bool {
for _, r := range s {
if r != '0' && r != '.' {
return false
}
}
return true
}
// CompareVersion 比较两个appversion的大小
// return 0 means ver1 == ver2
// return 1 means ver1 > ver2
// return -1 means ver1 < ver2
func CompareVersion(ver1, ver2 string) int {
// fast path
if ver1 == ver2 {
return 0
}
// slow path
vers1 := strings.Split(ver1, ".")
vers2 := strings.Split(ver2, ".")
var (
v1l, v2l = len(vers1), len(vers2)
i = 0
)
for ; i < v1l && i < v2l; i++ {
a, e1 := strconv.Atoi(vers1[i])
b, e2 := strconv.Atoi(vers2[i])
res := 0
// 如果不能转换为数字,使用go默认的字符串比较
if e1 != nil || e2 != nil {
res = strings.Compare(vers1[i], vers2[i])
} else {
res = a - b
}
// 根据比较结果进行返回, 如果res=0,则此部分相等
if res > 0 {
return 1
} else if res < 0 {
return -1
}
}
// 最后谁仍有剩余且不为0,则谁大
if i < v1l {
for ; i < v1l; i++ {
if !zeroRune([]rune(vers1[i])) {
return 1
}
}
} else if i < v2l {
for ; i < v2l; i++ {
if !zeroRune([]rune(vers2[i])) {
return -1
}
}
}
return 0
}
Try to optimize the strings.Split function
The logic of CompareVersion
is clear and simple, and according to the flame graph, the performance is mainly consumed in the strings.Split
function, so Lao Xu's first goal is to try to optimize strings.Split
function.
At this time, the first method that Lao Xu thought of was Baidu Dafa and Google Dafa. Finally, he found the strings.FieldsFunc
function in an article. According to the description of the article, the strings.FieldsFunc
function splits strings much faster than the strings.Split
function. So whether we can use the strings.FieldsFunc
function to replace strings.Split
function, please see the test results below.
func BenchmarkSplit(b *testing.B) {
b.ResetTimer()
for i := 0; i < b.N; i++ {
strings.Split("7.0.09.000", ".")
strings.Split("7.0.09", ".")
strings.Split("9.01", ".")
}
}
func BenchmarkFieldsFunc(b *testing.B) {
b.ResetTimer()
for i := 0; i < b.N; i++ {
strings.FieldsFunc("7.0.09.000", func(r rune) bool { return r == '.' })
strings.FieldsFunc("7.0.09", func(r rune) bool { return r == '.' })
strings.FieldsFunc("9.01", func(r rune) bool { return r == '.' })
}
}
The results of the above benchmark test run on Lao Xu's machine are as follows:
cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkSplit-4 3718506 303.2 ns/op 144 B/op 3 allocs/op
BenchmarkSplit-4 4144340 287.6 ns/op 144 B/op 3 allocs/op
BenchmarkSplit-4 3859644 304.5 ns/op 144 B/op 3 allocs/op
BenchmarkSplit-4 3729241 287.9 ns/op 144 B/op 3 allocs/op
BenchmarkFieldsFunc-4 3459463 336.5 ns/op 144 B/op 3 allocs/op
BenchmarkFieldsFunc-4 3604345 335.5 ns/op 144 B/op 3 allocs/op
BenchmarkFieldsFunc-4 3411564 313.9 ns/op 144 B/op 3 allocs/op
BenchmarkFieldsFunc-4 3661268 309.6 ns/op 144 B/op 3 allocs/op
According to the output, the strings.FieldsFunc
function is not as fast as expected, or even better than the strings.Split
function. Since this road was blocked, Lao Xu had to find another way.
try to introduce cache
According to the most volume company, if we have one version per week, and all year round, it takes 19 years (1000/(365 / 7)) for a company to release 1000 versions. Based on the data of this inner volume, if we can cache these versions and then compare the sizes, the execution speed will definitely be improved qualitatively.
Self-implementing expired cache
If you want to introduce cache, the first thing that Xu Xu thinks about is expired cache. At the same time, in order to be as lightweight as possible, it is undoubtedly a good solution to implement an expired cache yourself.
1. Define a structure containing expiration time and data
type cacheItem struct {
data interface{}
expiredAt int64
}
// IsExpired 判断缓存内容是否到期
func (c *cacheItem) IsExpired() bool {
return c.expiredAt > 0 && time.Now().Unix() >= c.expiredAt
}
2. Use sync.Map
as a concurrent safe cache
var (
cacheMap sync.Map
)
// Set 增加缓存
func Set(key string, val interface{}, expiredAt int64) {
cv := &cacheItem{val, expiredAt}
cacheMap.Store(key, cv)
}
// Get 得到缓存中的值
func Get(key string) (interface{}, bool) {
// 不存在缓存
cv, isExists := cacheMap.Load(key)
if !isExists {
return nil, false
}
// 缓存不正确
citem, ok := cv.(*cacheItem)
if !ok {
return nil, false
}
// 读数据时删除缓存
if citem.IsExpired() {
cacheMap.Delete(key)
return nil, false
}
// 最后返回结果
return citem.Data(), true
}
3. Define a structure that can store each part of the data by dividing by .
// 缓存一个完整的版本使用切片即可
type cmVal struct {
iv int
sv string
// 能否转换为整形
canInt bool
}
4. Convert the app version to tiles for easy caching
func strs2cmVs(strs []string) []*cmVal {
cmvs := make([]*cmVal, 0, len(strs))
for _, v := range strs {
it, e := strconv.Atoi(v)
// 全部数据都保存
cmvs = append(cmvs, &cmVal{it, v, e == nil})
}
return cmvs
}
5. Use the cached method for version size comparison
func CompareVersionWithCache1(ver1, ver2 string) int {
// fast path
if ver1 == ver2 {
return 0
}
// slow path
var (
cmv1, cmv2 []*cmVal
cmv1Exists, cmv2Exists bool
expire int64 = 200 * 60
)
// read cache 1
cmv, cmvExists := Get(ver1)
if cmvExists {
cmv1, cmv1Exists = cmv.([]*cmVal)
}
if !cmv1Exists {
// set val and cache
cmv1 = strs2cmVs(strings.Split(ver1, "."))
Set(ver1, cmv1, time.Now().Unix()+expire)
}
// read cache 2
cmv, cmvExists = Get(ver2)
if cmvExists {
cmv2, cmv2Exists = cmv.([]*cmVal)
}
if !cmv2Exists {
// set val and cache
cmv2 = strs2cmVs(strings.Split(ver2, "."))
Set(ver2, cmv2, time.Now().Unix()+expire)
}
// compare ver str
var (
v1l, v2l = len(cmv1), len(cmv2)
i = 0
)
for ; i < len(cmv1) && i < len(cmv2); i++ {
res := 0
// can use int compare
if cmv1[i].canInt && cmv2[i].canInt {
res = cmv1[i].iv - cmv2[i].iv
} else {
res = strings.Compare(cmv1[i].sv, cmv2[i].sv)
}
if res > 0 {
return 1
} else if res < 0 {
return -1
}
}
if i < v1l {
for ; i < v1l; i++ {
if cmv1[i].canInt && cmv1[i].iv != 0 {
return 1
}
if !zeroRune([]rune(cmv1[i].sv)) {
return 1
}
}
} else if i < v2l {
for ; i < v2l; i++ {
for ; i < v1l; i++ {
if cmv2[i].canInt && cmv2[i].iv != 0 {
return -1
}
if !zeroRune([]rune(cmv2[i].sv)) {
return -1
}
}
}
}
return 0
}
CompareVersionWithCache1
function comparison steps are:
- Return directly if the version strings are equal
- Read the cached data corresponding to the two versions respectively, if there is no cached data, generate the cached data and cache it
- Compare the
[]*cmVal
data corresponding to the two versions and return the size
Finally, perform performance verification. The following is the benchmark comparison between the CompareVersionWithCache1
function and the CompareVersion
function.
cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4 1642657 767.6 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersionWithCache1-4 1296520 844.9 ns/op 0 B/op 0 allocs/op
Through the analysis of the above results, the only optimization after using the cache is to reduce the memory allocation by a small amount. This result really made Lao Xu full of doubts. After using pprof analysis, he finally found the reason why the performance did not improve. Below is the flame graph of the BenchmarkCompareVersionWithCache1
function during the benchmark.
Considering that the number of app versions is small, the lazy elimination method is used to eliminate the expired cache, and each time the data is read, it is judged whether the cache expires. According to the flame graph, the biggest performance loss is to judge whether the cache has expired. Every time to judge whether the cache has expired, you need to call time.Now().Unix()
to get the current timestamp. That is to say, because of this call of time.Now()
, this optimization fell short.
Introduce LRU cache
Considering that the number of versions itself is not large, and the commonly used versions can be cached as permanently as possible, LRU cache is introduced for further performance optimization attempts.
1. Introduce open source LRU cache, the corresponding open source library is: github.com/hashicorp/golang-lru
2. Based on the CompareVersionWithCache1
function, replace the read and write cache with the introduced LRU cache
Finally, perform performance verification. The following is the benchmark comparison between the CompareVersionWithCache2
function and the CompareVersion
function.
cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4 1583202 841.7 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersionWithCache2-4 1671758 633.9 ns/op 96 B/op 6 allocs/op
Hey, this result finally looks a bit like, but the optimization effect is not obvious, there is still room for further improvement.
Self-implementing LRU cache
Choosing an LRU cache is effective. On this basis, Xu decided to implement a minimalist LRU cache by himself.
1. Define a cache node structure
type lruCacheItem struct {
// 双向链表
prev, next *lruCacheItem
// 缓存数据
data interface{}
// 缓存数据对应的key
key string
}
2. Define a structure that operates the LRU cache
type lruc struct {
// 链表头指针和尾指针
head, tail *lruCacheItem
// 一个map存储各个链表的指针,以方便o(1)的复杂度读取数据
lruMap map[string]*lruCacheItem
rw sync.RWMutex
size int64
}
func NewLRU(size int64) *lruc {
if size < 0 {
size = 100
}
lru := &lruc{
head: new(lruCacheItem),
tail: new(lruCacheItem),
lruMap: make(map[string]*lruCacheItem),
size: size,
}
lru.head.next = lru.tail
lru.tail.prev = lru.head
return lru
}
3. Set method of LRU cache
func (lru *lruc) Set(key string, v interface{}) {
// fast path
if _, exist := lru.lruMap[key]; exist {
return
}
node := &lruCacheItem{
data: v,
prev: lru.head,
next: lru.head.next,
key: key,
}
// add first
lru.rw.Lock()
// double check
if _, exist := lru.lruMap[key]; !exist {
lru.lruMap[key] = node
lru.head.next = node
node.next.prev = node
}
if len(lru.lruMap) > int(lru.size) {
// delete tail
prev := lru.tail.prev
prev.prev.next = lru.tail
lru.tail.prev = prev.prev
delete(lru.lruMap, prev.key)
}
lru.rw.Unlock()
}
4. Get method of LRU cache
func (lru *lruc) Get(key string) (interface{}, bool) {
lru.rw.RLock()
v, ok := lru.lruMap[key]
lru.rw.RUnlock()
if ok {
// move to head.next
lru.rw.Lock()
v.prev.next = v.next
v.next.prev = v.prev
v.prev = lru.head
v.next = lru.head.next
lru.head.next = v
lru.rw.Unlock()
return v.data, true
}
return nil, false
}
5. Based on the CompareVersionWithCache1
function, replace the read-write cache with a self-implemented LRU cache
Finally, perform performance verification. The following is the benchmark comparison between the CompareVersionWithCache3
function and the CompareVersion
function:
cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4 1575007 763.1 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersionWithCache3-4 3285632 317.6 ns/op 0 B/op 0 allocs/op
After introducing the self-implemented LRU cache, the performance has been fully doubled. Here, Xu is almost ready to go to the company to pretend, but there is always a voice in my heart asking me if there is a lock-free way to read the cache.
Reduce LRU cache lock contention
The lock-free way is really not thought of, but only two ways to reduce lock competition.
- It is not necessary to move the node to the head of the linked list every time the data is read, and only when the number of LRU caches is close to the upper limit of the Size will the newly read data be moved to the head of the linked list
- Since it is an LRU cache, the higher the access frequency, the closer the cache node is to the head of the linked list. Based on this feature, it can be considered to add random numbers to each access to reduce lock competition (that is, the higher the access frequency, the more chance to pass random numbers. Controls moving cache nodes to the head of the linked list).
The implementation after adding random numbers is as follows:
func (lru *lruc) Get(key string) (interface{}, bool) {
lru.rw.RLock()
v, ok := lru.lruMap[key]
lru.rw.RUnlock()
if ok {
// 这里随机写100
if rand.Int()%100 == 1 {
lru.rw.Lock()
v.prev.next = v.next
v.next.prev = v.prev
v.prev = lru.head
v.next = lru.head.next
lru.head.next = v
lru.rw.Unlock()
}
return v.data, true
}
return nil, false
}
The benchmark comparison between the CompareVersionWithCache3
function and the CompareVersion
function after adding random numbers is as follows:
cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4 1617837 761.5 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersionWithCache3-4 4817722 251.3 ns/op 0 B/op 0 allocs/op
After adding random numbers, the performance of the CompareVersionWithCache3
function is improved again by about 20%
. The optimization is not over yet, when the number of caches is far below the set cache limit, there is no need to move to the head of the linked list.
func (lru *lruc) Get(key string) (interface{}, bool) {
lru.rw.RLock()
v, ok := lru.lruMap[key]
lru.rw.RUnlock()
if ok {
// move to head.next
if len(lru.lruMap) > int(lru.size)-1 && rand.Int()%100 == 1 {
lru.rw.Lock()
v.prev.next = v.next
v.next.prev = v.prev
v.prev = lru.head
v.next = lru.head.next
lru.head.next = v
lru.rw.Unlock()
}
return v.data, true
}
return nil, false
}
After introducing the above optimization, the benchmark comparison is as follows:
cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4 1633576 793.2 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersion-4 1619822 882.7 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersion-4 1639792 737.2 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersion-4 1630004 758.3 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersionWithCache3-4 7538025 155.9 ns/op 0 B/op 0 allocs/op
BenchmarkCompareVersionWithCache3-4 7514742 150.1 ns/op 0 B/op 0 allocs/op
BenchmarkCompareVersionWithCache3-4 8357704 162.9 ns/op 0 B/op 0 allocs/op
BenchmarkCompareVersionWithCache3-4 7748578 148.0 ns/op 0 B/op 0 allocs/op
So far, the final version has achieved 4
times the performance under ideal conditions (with sufficient cache space).
Some people are God's reward
Originally, Lao Xu was going to go to the company to act as a coercive, but he never expected that his colleagues had developed a more reasonable and stable version comparison algorithm, which made Lao Xu feel ashamed.
The algorithm idea is as follows:
- Do not use the
strings.Split
function to split the version by.
, but compare each character from left to right until a different character is encountered, and record the indexi,j
- Traverse the remaining strings of the two versions, starting with
i、j
until the first.
is encountered, convert the two strings into integers for comparison - If the first two steps are completed and still equal, whoever has remaining characters will be greater
The three algorithm benchmarks are as follows:
cpu: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
BenchmarkCompareVersion-4 1803190 674.8 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersion-4 1890308 630.9 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersion-4 1855741 631.8 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersion-4 1850410 629.4 ns/op 304 B/op 6 allocs/op
BenchmarkCompareVersionWithCache3-4 8877466 132.2 ns/op 0 B/op 0 allocs/op
BenchmarkCompareVersionWithCache3-4 8489661 132.6 ns/op 0 B/op 0 allocs/op
BenchmarkCompareVersionWithCache3-4 8358210 132.6 ns/op 0 B/op 0 allocs/op
BenchmarkCompareVersionWithCache3-4 8456853 131.9 ns/op 0 B/op 0 allocs/op
BenchmarkCompareVersionNoSplit-4 6309705 178.9 ns/op 8 B/op 2 allocs/op
BenchmarkCompareVersionNoSplit-4 6228823 181.2 ns/op 8 B/op 2 allocs/op
BenchmarkCompareVersionNoSplit-4 6370544 177.8 ns/op 8 B/op 2 allocs/op
BenchmarkCompareVersionNoSplit-4 6351043 180.0 ns/op 8 B/op 2 allocs/op
BenchmarkCompareVersionNoSplit
function does not need to introduce a cache, and there will be no performance loss as the number of caches in BenchmarkCompareVersionWithCache3
approaches the upper limit. It is almost the most ideal version comparison scheme I have found so far.
Lao Xu didn't say anything about being a fan of the authorities, and bystanders were clear about this kind of sour grapes, and they had to admit that some people were just enjoying the meal. I have to say that it is my luck to meet such a person. I believe that as long as he has a mouthful of food, I can rub a mouthful of soup on the back of his ass. For the complete implementation of the version number comparison algorithm mentioned at the end of the article, please go to the following github repository to view:
https://github.com/Isites/ares/tree/main/strs
Finally, I sincerely hope that this article can be helpful to all readers.
Note:
At the time of writing this article, the go version used by the author is: go1.16.6
Full example used in the article: https://github.com/Isites/go-coder/tree/master/strs
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。