Time-consuming optimization of IO-intensive services

background

Background of the project

As a feature service, Feature service produces feature data for upstream business use.
Service pressure: 10wQPS for API modules and 20wQPS for computing modules during peak periods.
Service local caching mechanism:

The computing module has a local cache, and the hit rate is high, up to about 50%;
The local cache of the computing module will all fail at the 0th second of every minute, and at this time, all traffic will be broken down to the downstream Codis;
Key name in Codis = feature name + geographic grid Id + minute-level time string;

                                 Feature 服务模块图

face the problem

There is a serious P99 time-consuming glitch on the service API side (fixed at 0-10s per minute), which causes the access error rate of upstream services to reach more than 1‰, affecting business indicators;
Goal: Solve the problem of time-consuming glitches and optimize the overall time-consuming of P99 to less than 15ms;

                             API 模块返回上游 P99 耗时图

solution

Service CPU optimization

background

In an accidental online change, it was found that the CPU utilization rate of the Feature service will greatly affect the service time. Therefore, from the perspective of improving the service CPU Idle , we will optimize the service time-consuming glitch.

optimization

Through the observation of the Pprof profile graph, it is found that the JSON deserialization operation occupies a large proportion (more than 50%), so it is optimized by reducing the deserialization operation and replacing the JSON serialization library (json-iterator).

Effect

Benefit: CPU idle is increased by 5% , and P99 time-consuming glitch is reduced from 30ms to less than 20ms .

                             优化后的耗时曲线（红色与绿色线）

About CPU and time-consuming

Why does CPU Idle increase time-consuming decrease?

Deserialization overhead is reduced, resulting in reduced computation time in a single request;
The processing time of a single request is reduced, which reduces the number of concurrently processed requests and reduces the overhead of scheduling switching, coroutine/thread queuing, and resource competition;

About the json-iterator library

Why is the json-iterator library fast <br>The standard library json library uses reflect.Value for value acquisition and assignment, but reflect.Value is not a reusable reflection object, and a reflect.Value structure needs to be generated every time according to the variable, so Performance is poor.
The implementation principle of json-iterator is to use the type information obtained by reflect.Type to directly obtain and assign values through the method of "object pointer address + field offset", without relying on reflect.Value, reflect.Type is a reusable The reflect.Type of the same type is equal, so the reflect.Type can be cached multiplexed according to the type.
In general, its role is to reduce the number of memory allocation and reflection calls , thereby reducing the cost of system calls, locks, and GC caused by memory allocation, as well as the overhead of using reflection.

Details can be found here: https://cloud.tencent.com/developer/article/1064753

Call Mode Optimization - Hedging Requests

background

Feature service API module access calculation module P99 is significantly higher than P95;

                      API 模块访问计算模块 P99 与 P95 耗时曲线

It is observed that the time points of the burr appearing between different machines in the computing module are different, the burr of a single machine is an occasional phenomenon , and the aggregation of all machines shows a regular burr;
```
                      计算模块返回 API P99 耗时曲线（未聚合）
```
```
                      计算模块返回 API P99 耗时曲线（均值聚合）
```

optimization

Aiming at the phenomenon that P99 is higher than P95, a hedging request scheme is proposed to optimize the glitch problem;
Hedging requests: split a request to the downstream into two, send the first one first, and send the second one after n milliseconds timeout, whichever of the two requests is returned first is used;
Hedged requests.
A simple way to curb latency variability is to issue the same request to multiple replicas and use the results from whichever replica responds first. We term such requests “hedged requests” because a client first sends one request to the replica be-lieved to be the most appropriate, but then falls back on sending a secondary request after some brief delay. The cli- ent cancels remaining outstanding re- quests once the first result is received. Although naive implementations of this technique typically add unaccept- able additional load, many variations exist that give most of the latency-reduction effects while increasing load only modestly.
One such approach is to defer send- ing a secondary request until the first request has been outstanding for more than the 95th-percentile expected la- tency for this class of requests. This approach limits the additional load to approximately 5% while substantially shortening the latency tail. The tech- nique works because the source of la- tency is often not inherent in the par- ticular request but rather due to other forms of interference.
From: The paper "The Tail at Scale"
research
- Read the paper Google "The Tail at Scale";
- Open source implementation: BRPC, RPCX;
- Industrial practice: Baidu is enabled by default, the Grab LBS service (downstream pure memory database) has a very obvious effect, and there are also relevant practical effects in the Google paper;
Implementation: Modified from RPCX open source implementation

 package backuprequest


import (
    "sync/atomic"
    "time"

    "golang.org/x/net/context"
)


var inflight int64

// call represents an active RPC.
type call struct {
    Name  string
    Reply interface{} // The reply from the function (*struct).
    Error error       // After completion, the error status.
    Done  chan *call  // Strobes when call is complete.
}


func (call *call) done() {
    select {
    case call.Done <- call:
    default:
        logger.Debug("rpc: discarding Call reply due to insufficient Done chan capacity")
    }
}


func BackupRequest(backupTimeout time.Duration, fn func() (interface{}, error)) (interface{}, error) {
    ctx, cancelFn := context.WithCancel(context.Background())
    defer cancelFn()
    callCh := make(chan *call, 2)
    call1 := &call{Done: callCh, Name: "first"}
    call2 := &call{Done: callCh, Name: "second"}


    go func(c *call) {
        defer helpers.PanicRecover()
        c.Reply, c.Error = fn()
        c.done()
    }(call1)


    t := time.NewTimer(backupTimeout)
    select {
    case <-ctx.Done(): // cancel by context
        return nil, ctx.Err()
    case c := <-callCh:
        t.Stop()
        return c.Reply, c.Error
    case <-t.C:
        go func(c *call) {
            defer helpers.PanicRecover()
            defer atomic.AddInt64(&inflight, -1)
            if atomic.AddInt64(&inflight, 1) > BackupLimit {
                metric.Counter("backup", map[string]string{"mark": "limited"})
                return
            }

            metric.Counter("backup", map[string]string{"mark": "trigger"})
            c.Reply, c.Error = fn()
            c.done()
        }(call2)
    }


    select {
    case <-ctx.Done(): // cancel by context
        return nil, ctx.Err()
    case c := <-callCh:
        metric.Counter("backup_back", map[string]string{"call": c.Name})
        return c.Reply, c.Error
    }
}

Effect

Benefit: The overall time spent on P99 is reduced from 20-60ms to 6ms, and all glitches are eliminated; (backupTimeout=5ms)

                             API 模块返回上游服务耗时统计图

Excerpts and Interpretation of "The Tail at Scale"

Content in parentheses is personal interpretation
Why does variability exist? (response time with high tail latency)

Variability in response times (cause of time-consuming long tails) that results in high tail latency for individual parts of a service can arise for a number of reasons, including:
shared resources. Machines may be shared by different applications, competing for shared resources (such as CPU cores, processor cache, memory bandwidth, and network bandwidth) (this problem is even worse in a cloud environment, such as different container resource contention, sidecar process impact); Within the same application, different requests may compete for resources.
daemon. Background daemons may only use limited resources on average, but may incur a few millisecond interruptions when scheduled.
Global resource sharing. Applications running on different machines may compete for global resources (such as network switches and shared file systems (databases)).
maintenance activities. Background activities (such as data reconstruction in distributed file systems, periodic log compaction in storage systems such as BigTable (here refers to the LSM Compaction mechanism, and RocksDB-based databases have this problem), and periodic garbage collection in garbage collection languages ( There will be GC problems in itself and upstream and downstream 1. Codis proxy is written in GO language, and there will also be GC problems; 2. The time-consuming glitch of this Feature service is due to the GC problem of the service itself, see below for details) will cause periodic delays and queuing. Multiple layers of queuing at intermediate servers and network switches amplify this variability.

Reduce component variability

Background tasks can generate significant CPU, disk, or network load; examples are log compaction for log-oriented storage systems and garbage collector activity for garbage collection languages.
By throttling, decompose heavyweight operations into smaller operations (such as GO, incremental relocation during Redis rehash), and trigger these operations when the overall load is low (for example, a database puts the RocksDB Compaction operation in the early hours of the morning. execution), often reducing the impact of background activity on interactive request latency.

About eliminating sources of variation

It is not practical to eliminate all sources of latency variation in large-scale systems, especially in shared environments.
Using an approach similar to fault-tolerant computing (here, hedging requests), tail-tolerant software techniques form a predictable whole from less predictable parts (modeling downstream time-consuming curves, probabilistically optimization).
Measurements of a real Google service that is logically similar to this idealized scenario; the root server distributes a request to a large number of leaf servers through intermediate servers. This table shows the effect of a large fanout on the latency distribution. The 99th percentile latency measured on the root server for a single random request to complete is 10ms. However, the 99th percentile latency for all requests to complete is 140ms, and the 99th percentile latency for 95% of requests to complete is 70ms, which means that the slowest 5% of requests are waiting for 99% of the total time to complete. Half of the % percentile delay. Techniques that focus on these slow outliers can drastically degrade overall service performance.
Also, since it is not feasible to eliminate all sources of variability, tail-tolerant techniques are being developed for large-scale services. While it is useful to address specific sources of latency variation, the most robust tail fault tolerance techniques can re-address latency regardless of the root cause. These tail-tolerant techniques allow designers to continue to optimize for common cases, while providing resilience to uncommon cases.

Hedging Request Principle

                                 对冲请求典型场景

The principle is from the point of view of probability, using the time-consuming model of downstream services, arbitrarily taking two points on this time-consuming curve, the probability of one of which is less than x, this probability is much greater than the probability that any point is less than x. Therefore, the time-consuming can be greatly reduced;
However, if there are too many requests, such as 1 times, the downstream pressure will increase sharply, the time-consuming curve model will deteriorate, and the expected effect will not be achieved. If the control is within 5%, for example, the downstream time-consuming curve will neither It will deteriorate, and the smooth curve before the 95th percentile can also be used, so the selection of the timeout period of the hedging request is also a point to pay attention to;
When the time-consuming time exceeds the 95th percentile, one more request is sent. At this time, the remaining time-consuming of the entire request depends on taking any point on the entire line, and taking any point on the line after the 95th percentile, the time-consuming It is the smaller of these two points. From the perspective of probability, the time-consuming curve after the 95th percentile will be much smoother than before;
This trade-off is quite clever. Only 5% more requests can basically eliminate the long tail situation directly;
limitation
- The request needs to be idempotent, otherwise it will cause data inconsistency;
- In general, the hedging request is to eliminate the influence of accidental factors from the perspective of probability , so as to solve the long-tail problem. Therefore, it is necessary to consider whether the time-consuming is caused by the fixed factors of the business side, for example:
  - For example, the time-consuming of checking 100 keys and checking 10,000 keys on an mget interface must be very different. In this case, there is nothing to do when hedging requests. Therefore, it is necessary to ensure that the quality of requests on the same interface is similar, so that the downstream consumption The timing factor does not depend on the content of the request itself;
  - For example, the feature service computing module accesses the time-consuming glitch caused by the breakdown of the Codis cache. In this case, the hedging request is powerless, and even worsening the time-consuming under certain circumstances;
- The hedging request timeout time is not dynamically adjusted but artificially set, so there is an avalanche risk in extreme cases, see the following section for solutions;

name origin
Backup request seems to be the name of BRPC when it landed. The original text of the paper is called Hedged requests, which is simply translated as hedging request. GRPC also uses this name.

About Avalanche Risk

The hedging request timeout is not dynamically adjusted, but artificially set, so there is a risk of avalanche in extreme cases;

                                 摘自《Google SRE》

If there is no limit, there will be avalanche risk, there are the following solutions

BRPC practice: Hedging requests will consume a number of downstream retries;
bilibili practice:
- The downstream of the retry request will block the cascade;
- itself to be fused;
- Implement window statistics at the middleware layer to limit the proportion of total retry requests, such as 1.1 times;
The service itself implements a circuit breaker mechanism for the downstream, and the downstream service has a flow-limiting mechanism for the upstream traffic to ensure that it will not be overwhelmed. To ensure the stability of the service from two aspects;
Feature service practice: add atmoic self-increment and self-decrease operations when sending and returning each hedging request. If it is greater than a certain value (request time ✖️ QPS ✖️ 5%), no hedging request will be issued, and the number of concurrent requests will be controlled from the angle to limit flow;

Language GC optimization

background

After the introduction of the hedging request mechanism for optimization, breakthroughs have been made in terms of time-consuming. final optimization;

optimization

Step 1: Observe the phenomenon and initially locate the cause <br>After analyzing the time consumption of the Trace graph during the early peak of the Feature service, it is found that the program GC pause time (the sum of the overlap of the GC cycle and the task life cycle) during the glitch period is as long as Nearly 50+ms (see the picture on the left), most goroutines have performed a long-term auxiliary mark (mark assist, see the light green part in the picture on the right) during GC, and the GC problem is serious, so it is suspected that the time-consuming glitch problem is caused by the GC Lead to;

Step 2: Starting from the reasons, carry out targeted analysis

According to the observation, the computing module service occurs 2 GCs on average every 10 seconds. The GC frequency is low, but there is a significant gap between the first and second GC pressures (the number of goroutines doing mark assist) in the first 10s of every minute, so it is suspected It is the high pressure during the first GC in the first 10s of every minute that causes the time-consuming glitch.
According to the analysis of the Golang GC principle, G is recruited to be an auxiliary marker because the G allocates heap memory too quickly, and the cache invalidation mechanism of the computing module per minute will lead to a large number of downstream accesses, thereby introducing more object allocations. The combination confirms why the first GC pressure in the first 10s of every minute is extraordinary;

About GC assist mark mark assist
In order to ensure that during the Marking process, other Gs allocate heap memory too fast, causing Mark to fail to keep up with the speed of Allocate, other Gs are also required to cooperate with part of the marking work. This part of the work is called mutator assists. During Marking, every time G allocates memory, its "debt index" (gcAssistBytes) will be updated. The faster the allocation, the larger the gcAssistBytes. This index is multiplied by the global "load rate" (assistWorkPerByte), and this G needs help. The memory size of Marking (this calculation process is called revise), that is, the mutator assists workload (gcAssistAlloc) it allocates this time.
Quoted from: https://wudaijun.com/2020/01/go-gc-keypoint-and-monitor/

Step 3: According to the analysis conclusion, design optimization operations <br>From the perspective of reducing the number of object allocations, observe the Pprof heap map

Under the inuse_objects indicator, the cache library occupies the largest amount;
Under the alloc_objects indicator, json serialization occupies the most;

But it was impossible to determine which one was the factor that really increased the allocated memory, so I started to optimize these two points separately;

After investigating the open source json and cache libraries in the industry (report: https://segmentfault.com/a/1190000041591284 ), the original libraries were replaced by GJSON with better performance and low allocation and BigCache with 0GC;

Effect

Replacing the JSON serialization library GJSON library optimization has no effect;
Replacing the Cache library BigCache library has obvious effect, the inuse_objects drops from 200-300w to 12w, and the burr basically disappears;

                     计算模块耗时统计图（浅色部分：GJSON，深色部分：BigCache）

                             API 模块返回上游耗时统计图

About Golang GC

In the popular sense, it is often believed that the triggering time of GO GC is when the heap size grows twice as much as the last GC. However, in the actual practice of GO GC, the Pacer frequency adjustment algorithm will be pre-calculated according to factors such as heap growth speed and object marking speed, so that GC is initiated in advance before the heap size reaches twice the size. In the best case, it will only occupy 25% of the CPU and The GC is just finished when the heap size has doubled.

About Pacer FM algorithm: https://golang.design/under-the-hood/zh-cn/part2runtime/ch08gc/pacing/

However, Pacer can only control the CPU usage to 25% in steady state. Once there is a transient situation in the service, such as scheduled tasks, cache invalidation, etc., Pacer's pre-judgment based on steady state fails, resulting in the GC marking speed being lower than the allocation speed. In order to achieve the GC recovery target (complete the GC before the heap size doubles), a large number of Goroutines will be recruited to perform Mark Assist operations to assist in the recovery work, thus hindering the normal work execution of Goroutines. Therefore, at present, the Marking phase of GO GC has the most serious impact on time consumption.

About the gc pacer tuner

Quoted from: https://go.googlesource.com/proposal/+/a216b56e743c5b6b300b3ef1673ee62684b5b63b/design/44167-gc-pacer-redesign.md

final effect

The time consumption of API module P99 is reduced from 20-50ms to 6ms, and the access error rate is reduced from 1‰ to 1‱.

                             API 返回上游服务耗时统计图

Summarize

When analyzing time-consuming problems, after observing monitoring or logs, you may find two indicators whose trends completely match, mistaking them for causal relationship, but it is possible that both are external manifestations and are jointly affected by the third variable. related but not causal ;
Compared with time-consuming services of 100 milliseconds, the time-consuming of low-latency services will be greatly affected by CPU usage . Do not ignore this when doing performance optimization; (thread queuing, scheduling loss, resource competition, etc.)
For high-concurrency and low-latency services, the time-consuming aspect may only be affected by the downstream, and the service overhead such as serialization, GC, etc. may affect the service time-consuming to a large extent;
Performance optimization starts from improving observability , such as link tracking, standardized metrics, go pprof tools, etc., lays a good foundation for investigation, analyzes and guesses based on reliable data from multiple parties, and finally begins to optimize and verify to avoid blind people touching elephants operation, trying to solve the problem by chance;
Knowing some simple modeling knowledge will be of great help to the analysis and guessing stage of time-consuming optimization;
Thinking with theory combined with practical problems ; read more articles, participate in sharing, communicate, learn more technologies, and expand horizons; every discussion and question is an opportunity for further in-depth thinking. Thanks to @李心宇@刘奇@ Gong Xun) for the practice after discussion;
The same is performance optimization. Time-consuming optimization is different from CPU, memory and other resource optimization. It is more complex and difficult. Go language comes with a convenient and easy-to-use PProf tool when doing resource optimization. Time optimization, especially the optimization of long-tail problems, is very difficult. Therefore, in the process of optimization, you must stabilize your mind and observe patiently .
Pay attention to the time-consuming problem caused by the contention of shared resources between requests, not only limited to downstream services, but also the CPU and memory of the service itself (causing GC), etc. are also part of the shared resources;

refer to

All are intranet articles, omitted. .

Time-consuming optimization of IO-intensive services

background

Background of the project

face the problem

solution

Service CPU optimization

background

optimization

Effect

About CPU and time-consuming

About the json-iterator library

Call Mode Optimization - Hedging Requests

background

optimization

Effect

Excerpts and Interpretation of "The Tail at Scale"

Hedging Request Principle

About Avalanche Risk

Language GC optimization

background

optimization

Effect

About Golang GC

final effect

Summarize

refer to

Jesse

引用和评论

大语言模型交互协议 MCP SDK Go-MCP 正式开源！

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

腾讯 tRPC-Go 教学——（1）搭建服务

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

一文弄懂用Go实现MCP服务

如何系统地入门学习stm32？