1
头图

Text|Junlong Liu

Shopee Digital Purchase & Local Services Engineering

Read this article 1743 words for 6 minutes

Contributor Foreword

I learned about Holmes during the development process. In order to ensure the stability of the system, a performance troubleshooting tool is needed, so a performance monitoring tool that retains the site is also needed. When I looked online for open source libraries for this, not many were available. Later, I found Holmes from the MOSN community and found that this open source library has basically complete functions and high scalability, especially the industry-leading function of GCHeapDump, which is very useful for solving the problem of increased memory.

Learn about Holmes components in late 2021, and then start learning about the MOSN community that Holmes is in. As a performance troubleshooting tool, the core function of Holmes is to detect abnormal performance indicators in time and profile the system.

Since Holmes is still in its infancy, there is not much documentation other than Readme. There are also some features that Holmes did not support at the time, such as dynamic configuration adjustment and reporting. Holmes had not released the first version at the time, but he was also interested and understood in this regard, so he raised several Issue discussions on GitHub, and the community responded very quickly. Subsequently, under the guidance of the seniors in the community, PR was proposed, and therefore, I learned a lot of design concepts about open source components through Holmes' code design.

So I decided to get involved in the open source community and contribute code to address real needs. After having a certain understanding and experience, I will summarize such a sharing article by discussing with senior Rende.

This article will introduce Holmes usage scenarios, quick start cases, multiple monitoring types, design principles, extended functions and how to build a simple performance inspection system with Holmes. Welcome to leave a message for guidance.

Holmes usage scenarios

For system performance spikes, we usually use the official built-in pprof package of Go to analyze, but the difficulty is that it is difficult for developers to save the scene in time for the "spikes" that pass by: when you receive an alarm message, from I got up from the bed and turned on my computer to connect to the VPN. The system might have restarted three or four times.

The Holmes of the MOSN community is a lightweight performance monitoring system based on Golang. When the performance indicators of the application fluctuate abnormally, Holmes will reserve the site for the first time, so that you can go to work the next day and drink wolfberry calmly. tea, while tracking down the root cause of the problem.

Quick Start

Using Holmes is as simple as adding the following code to your system initialization logic:

  // 配置规则
    h, _ := holmes.New(
        holmes.WithCollectInterval("5s"), // 指标采集时间间隔
        holmes.WithDumpPath("/tmp"),      // profile保存路径
    
        holmes.WithCPUDump(10, 25, 80, 2 * time.Minute),  // 配置CPU的性能监控规则
        holmes.WithMemDump(30, 25, 80, 2 * time.Minute),// 配置Heap Memory 性能监控规则
        holmes.WithGCHeapDump(10, 20, 40, 2 * time.Minute), // 配置基于GC周期的Heap Memory 性能监控规则
        holmes.WithGoroutineDump(500, 25, 20000, 100*1000, 2 * time.Minute),    //配置Goroutine数量的监控规则
    )
    // enable all
    h.EnableCPUDump().
    EnableGoroutineDump().
  EnableMemDump().
  EnableGCHeapDump().Start()

An API similar to holmes.WithGoroutineDump(min, diff, abs,max,2 * time.Minute) means:

When the Goroutine metrics meet the following conditions, the Dump operation will be triggered.

When the number of goroutines is greater than Max, Holmes will skip this dump operation, because when the number of goroutines is too large, the cost of the goroutine dump operation is very high.

2 * time.Minute is the minimum time interval between two Dump operations to avoid the impact of frequent Profiling on performance.

For more use cases, see the Holmes use cases documentation at the end of this article.

Profile Types

Holmes supports the following five Profile types, which users can configure as needed.

Mem: memory allocation

CPU: CPU usage

Thread: number of threads

Goroutine: number of goroutines

GCHeap: memory allocation based on GC cycle monitoring

Indicator collection

The four types of Mem, CPU, Thread, and Goroutine are based on the CollectInterval configured by the user, and the current performance indicators of the application are collected at regular intervals, while gcHeap collects performance indicators based on the GC cycle.

This section will analyze two indicators.

Collect according to CollectInterval period

Holmes collects application metrics at regular intervals and uses a fixed-size circular linked list to store them.

图片

Acquisition based on GC cycle

In some scenarios, we cannot save to the scene through timed memory dump. For example, an application allocates a large amount of memory in a CollectInterval cycle and reclaims it quickly. At this time, the memory usage collected by Holmes before and after the cycle does not fluctuate too much, which is inconsistent with the actual situation.

In order to solve this situation, Holmes has developed a profile type based on GC cycle, which will dump a profile in each of the two GC cycles before and after the high heap memory usage, and then developers can use the pprof --base command to compare the two profiles. The difference between heap memory at each moment.

The data collected according to the GC cycle is also placed in the circular list.

rule judgment

This section describes how Holmes judges the system to be abnormal according to the rules.

Threshold meaning

Each profile can be configured with min, diff, abs, coolDown four indicators, the meanings are as follows:

When the current indicator is less than min, it is not regarded as abnormal.

The current indicator is greater than (100+diff) 100% of the historical indicator, indicating that the system has fluctuated at this time, which is regarded as abnormal.

When the current indicator is greater than abs (absolute value), it is regarded as abnormal.

The two Profile types of CPU and Goroutine provide Max parameter configuration based on the following considerations:

The CPU's Profiling operation will have a performance loss of about 5%, so when the CPU is too high, the Profiling operation should not be performed, otherwise the system will be dragged down.

When the number of goroutines is too large, the cost of the goroutine dump operation is very high, and the STW operation will be performed, which will bring down the system. (For details, see the reference article at the end of the article)

Warming up

When Holmes starts, it collects various indicators ten times according to the CollectInterval period. The indicators collected during this period will only be stored in the circular linked list, and no rule judgment will be made.

extensions

In addition to basic monitoring, Holmes provides some extended features:

Incident reporting

You can achieve the following functions by implementing Reporter:

Send an alert message when Holmes triggers a Dump operation.

Upload Profiles elsewhere, in case the instance is destroyed, causing the Profile to be lost, or for analysis.

   type ReporterImpl struct{}
        func (r *ReporterImple) Report(pType string, buf []byte, reason string, eventID string) error{
            // do something  
        }
        ......
        r := &ReporterImpl{} // a implement of holmes.ProfileReporter Interface.
      h, _ := holmes.New(
            holmes.WithProfileReporter(reporter),
            holmes.WithDumpPath("/tmp"),
            holmes.WithLogger(holmes.NewFileLog("/tmp/holmes.log", mlog.INFO)),
            holmes.WithBinaryDump(),
            holmes.WithMemoryLimit(100*1024*1024), // 100MB
            holmes.WithGCHeapDump(10, 20, 40, time.Minute),
)

Dynamic configuration

You can update Holmes' configuration at runtime with the Set method. Its use is very simple, and it is the same as the New method during initialization.

Some configurations do not support dynamic changes, such as the number of Cores. If you change this parameter while the system is running, it will cause huge fluctuations in CPU usage, which will trigger a Dump operation.

 h.Set(
        WithCollectInterval("2s"),
        WithGoroutineDump(10, 10, 50, 90, time.Minute))

Landing case

Using the Set method of Holmes, you can easily connect to the configuration center of your company, for example, use Holmes as the data plane and the configuration center as the control plane. And docked with the alarm system (email/SMS, etc.) to build a simple monitoring system.

The specific structure is as follows:

图片

Holmes V1.0 version released

This article briefly introduces the usage and principles of Holmes. Hope Holmes can help you as you improve the stability of your app.

Holmes V1.0 was officially released a few weeks ago. As a contributor and user, I highly recommend everyone to try this small tool library. If you have any questions or questions, please come to the community to ask questions~

Holmes is an open-source GO language Continous Profiling component of the MOSN community. It can automatically detect exceptions in resources such as CPU, Memory, and Goroutine, and automatically dump abnormal on-site profiles for post-mortem analysis and positioning. It also supports uploading Profile to the automatic analysis platform to realize automatic problem diagnosis and alarm.

"Release Report": https://github.com/mosn/holmes/releases/tag/v1.0.0

"Introduction to the principle of Holmes": https://mosn.io/blog/posts/mosn-holmes-design/

This article briefly introduces the usage and principles of Holmes. Hope Holmes can help you as you improve the stability of your app.

"References"

[1] "Holmes Documentation" https://github.com/mosn/holmes

[2] "Unattended automatic dump (1)" https://xargin.com/autodumper-for-go/

[3] "Unattended automatic dump (2)" https://xargin.com/autodumper-for-go-ii/

[4] "Pprof heap profile implementation mechanism in go language" https://uncledou.site/2022/go-pprof-heap/

[5] "goroutines pprofiling STW" https://github.com/golang/go/issues/33250

[6] "Holmes Use Case Documentation" https://github.com/mosn/holmes/tree/master/example

[7] "go pprof performance loss" https://medium.com/google-cloud/continuous-profiling-of-go-programs-96d4416af77b

Recommended reading of the week

Invitation Letter | SOFA's 4th anniversary, open source is the time!

Nydus image acceleration plugin migrated to Containerd

Exploration and Practice of Heterogeneous Registration Center Mechanism in Industrial and Commercial Bank of China

Interview with SOFAArk Committer | If you don't like it, just change it!


SOFAStack
426 声望1.6k 粉丝

SOFAStack™(Scalable Open Financial Architecture Stack)是一套用于快速构建金融级分布式架构的中间件,也是在金融场景里锤炼出来的最佳实践。