头图

Text|Zhang Xihong (flower name: Zhiyu)

Ant Group Technical Expert

Responsible for the construction of high-availability capabilities under the cloud native architecture of Ant Group
The main technical fields include ServiceMesh, Serverless, etc.

This article is 3631 words read in 8 minutes

PART. 1 Story background

After the double eleven promotion this year, we conducted a detailed analysis of the system operation data during the big promotion period according to the usual practice. Compared with the performance data of the same period last year, we found that MOSN's CPU utilization rate has increased by about 1%.

Why has it increased?

Is it reasonable?

Can it be optimized?

Is it an inevitable increase in entropy, or is it man-made waste?

With these series of soul tortures, we analyzed the system

图片

PART. 2 Problem location

We found from monitoring that this part of the additional overhead is present when the system is idle, and will not decrease as the pressure test traffic increases. The total CPU consumption increases by 1.2%, of which 0.8% is brought by cpu_sys.

Through perf analysis, it is found that the new version of MOSN has a significant increase in syscall compared to the old version.

旧版本

新版本

图片

After layer-by-layer analysis, it is found that part of the reason is that the Sleep in the func of a StartTimeTicker in sentinel-golang that MOSN relies on generates a large number of system calls. What logic is this?

PART. 3 Theoretical Analysis

图片

Looking at the source code, there is a millisecond-level timestamp caching logic, which is designed to reduce the performance overhead under high call frequency. However, frequent acquisition of timestamps and Sleep in the idle state will generate a large number of system calls, causing cpu sys util to rise . Let's first theoretically analyze why this part of optimization is usually invalid in engineering. Let's take a look at Sentinel's code first:

package util

import (
  "sync/atomic"
  "time"
)

var nowInMs = uint64(0)

// StartTimeTicker starts a background task that caches current timestamp per millisecond,
// which may provide better performance in high-concurrency scenarios.
func StartTimeTicker() {
  atomic.StoreUint64(&nowInMs, uint64(time.Now().UnixNano())/UnixTimeUnitOffset)
  go func() {
    for {
      now := uint64(time.Now().UnixNano()) / UnixTimeUnitOffset
      atomic.StoreUint64(&nowInMs, now)
      time.Sleep(time.Millisecond)
    }
  }()
}

func CurrentTimeMillsWithTicker() uint64 {
  return atomic.LoadUint64(&nowInMs)
}

As you can see from the above code, Sentinel uses a goroutine loop to get the timestamp and store it in the atomic variable, and then calls Sleep to sleep for 1ms. In this way, the millisecond-level timestamp is cached. There is an external switch to control whether this logic should be enabled, and it is enabled by default. From the point of view of this code, Sleep should be the biggest performance overhead, because Sleep will generate syscall. As we all know, the cost of syscall is relatively high.

How much is the overhead compared to time.Sleep and time.Now?

After checking the information (1), I found a counter-intuitive fact. Due to Golang's special scheduling mechanism, one time.Sleep in Golang may generate 7 syscalls, and time.Now is implemented by vDSO, so the problem is vDSO What is the improvement compared to 7 system calls?

I found information that can be corroborated. It happened to be an optimization of Golang (2). It mentioned that in the old version of Golang (golang 1.9-), there is no optimization of vDSO under Linux/386, and there will be 2 syscalls at this time. After optimization, the theoretical performance of the new version is improved by 5~7x+, which can be approximately equal to the cost of one time.Now <= 0.3 syscall.

The purpose of Cache design is to reduce time.Now calls, so in theory, there may be benefits if the amount of calls here is large enough. According to the above analysis, assume that the overhead ratio of time.Now and Sleep system calls is 0.3:7.3 ( 7+0.3), Sleep will be executed 1000 times per second (without considering the loss of system accuracy), which means that the total number of calls of CurrentTimeMillsWithTicker in one second exceeds 2.4W to be profitable.

So let's analyze the number of calls to CurrentTimeMillsWithTicker. I added a counter to this place for verification, and then simulated the request to call Sentinel's Entry. After testing, I found that:

  1. When the resource point is created for the first time, the magnification ratio of Entry and CurrentTimeMillsWithTicker is 20. This is mainly because a large amount of timestamp calculation is required when creating the underlying sliding window
  2. When the same resource calls Entry, the call magnification ratio ⁰ is 5:1

|Note 0: The MOSN version used internally is customized based on the original Sentinel, and the amplification ratio of the community version is theoretically lower than this ratio.

Considering that the creation of resource points is low-frequency, we can approximate the call magnification ratio as 5. Therefore, in theory, when the QPS of a single machine exceeds 4800 or more, it is possible to obtain benefits... We have heard about C10K, C100K, and C1000K problems at every turn. This value does not seem to be very high? But in actual business systems, this is actually a very high amount.

I randomly selected a number of applications with relatively large daily requests to view QPS (QPS here includes all types of resource points, entry/exit calls, and sub-resource points, etc., in short, all the requests that will go through Sentinel Entry calls), daily The peak value did not exceed 4800QPS. It can be seen that in the actual business system, the scenario where the single machine request exceeds this value is very rare. ¹

|Note 1: The monitoring here is minute-level data monitoring, which may be different from second-level monitoring, and is only used to guide daily request volume evaluation.

图片

Considering that there is another benefit of this optimization, it can reduce the time consumption of synchronizing the timestamp request, so we can compare the speed of reading the cache value directly from the atomic variable and reading the timestamp through time.Now().

图片

It can be seen that obtaining the timestamp directly in a single time is indeed much more expensive than reading from memory, but it is still at the ns level. This level of time-consuming increase is negligible for a request.

图片

It is about 0.06 microseconds, even if it is multiplied by 5, it is an increase of 0.3 microseconds. We can also look at the actual MOSN RT under the 4000QPS flow gear.

图片

There is no obvious difference between the MOSN RT of the two machines, after all, it is only 0.3 microseconds...

PART. 4 Test conclusion

At the same time, we also found two machines to disable/enable this Cache for testing. The test results corroborated the conclusion of the above analysis.

图片

As can be seen from the data in the above figure, when Cache is enabled, cpu sys util is always larger than the version without Cache. As the number of requests increases, the performance gap is gradually narrowing, but there is still no positive benefit until 4000QPS.

After testing and theoretical analysis, it can be known that in conventional scenarios, this Cache feature of Sentinel is not profitable, but it causes performance loss, especially under low load conditions. Even under high load conditions, it can be inferred that without this Cache, the system will not have much impact.

This performance analysis also made us aware of several problems:

  1. Don't optimize prematurely, as it is said that premature optimization is the root of all evil;
  2. It is necessary to use objective data to prove that the optimization result is positive, rather than relying on intuition;
  3. The analysis should be combined with the actual scene, and some low probability scenes should not be given priority;
  4. There may be differences in the low-level implementation between different languages, and should be carefully evaluated when porting.

PART. 5 Is it necessary?

Didn't you say above, don't optimize prematurely, then is this considered premature optimization? Are you a double standard?

"Premature optimization is the root of all evil" is actually misused, and it has context.

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. —— Donald Knuth

Donald Knuth believes that many optimizations are unnecessary. We may spend a lot of time doing things with low input-output ratios, but he also emphasized the necessity of some key optimizations. In short, we must consider cost performance. We cannot optimize performance blindly and without data support. Premature seems to be translated as "immature and blind" more appropriately. Therefore, the original meaning of this sentence is "Blind optimization is all evil." source of". Only one line of code change is needed here to save this part of unnecessary overhead. The cost performance is extremely high. Why not do it?

From the data point of view, this optimization only reduces the cpu sys util by 0.7%. Are we missing this 0.7%?

It may be good to think from the perspective of the system water level. After all, we have prepared more resources than actual needs for the sake of insurance. This 0.7% will not become the last straw that crushes our system. But from the perspective of environmental protection, it is necessary! This year we emphasized green and environmental protection, improving efficiency and reducing costs. This mere line of code runs as a Sidecar in hundreds of thousands of business Pods, and behind it corresponds to tens of thousands of servers.

图片

Use a less rigorous way to make a rough estimate. Take the conventional server CPU Xeon E5 as an example. TDP² is 120W, 0.7% 120W 24 * 365/1000 = 73584 kilowatt-hours of electricity, 7 per 10,000 machines a year Ten thousand kilowatt-hours of electricity, this does not include the greater heat exchange energy loss in order to maintain the temperature of the computer room (to put it simply, the air-conditioning fee, the PUE of the conventional computer room is about 1.5), according to the estimation of experts who do not know the reliability of Saving 1 kilowatt-hour of electricity = reducing 0.997 kg of carbon dioxide, which is rounded to the nearest 100000 kg of carbon dioxide.

At the same time, this is also a line of open source community code. The community has adopted our suggestion (3) This feature is set to off by default. Perhaps thousands of companies and tens of thousands of servers will also benefit from it.

图片

|Note 2: TDP means thermal power design, which cannot be equivalent to electrical power consumption. Thermal design power consumption refers to the maximum heat that the processor can generate when running actual applications. TDP is mainly used for the basis that the radiator can effectively cool the processor when it is matched with the processor. The TDP power consumption of the processor does not represent the true power consumption of the processor, and has no arithmetic relationship, but it can usually be considered that the actual power consumption will be greater than the TDP.

"Extended reading"

(1) Verification information: https://github.com/golang/go/issues/25471

(2) Golang optimization: https://go-review.googlesource.com/c/go/+/69390

(3) Our suggestion: https://github.com/alibaba/sentinel-golang/issues/441

Thanks to Yigang, Maoxiu, Hao Ye, Yongpeng, Zhuo You and other students for their contributions to the problem location. This article partially quotes the data provided by the MOSN big promotion version performance comparison document. At the same time, I would like to thank Su He and other students in the Sentinel community for their active support to related issues and PRs.

Recommended reading this week

limit on

depth HTTP/3 (1) | From the establishment and closure of the QUIC link to see the evolution of the protocol

Network business double eleven service link isolation technology and practice based on ServiceMesh technology

improve efficiency! The transformation of the registration center in the Ant Group

img


SOFAStack
426 声望1.6k 粉丝

SOFAStack™(Scalable Open Financial Architecture Stack)是一套用于快速构建金融级分布式架构的中间件,也是在金融场景里锤炼出来的最佳实践。