Article | Junlong Liu

Shopee Digital Purchase & Local Services Engineering

Read this article in 1743 words in 6 minutes

Preface by Contributor

I learned about Holmes during the development process. In order to ensure the stability of the system, a performance inspection tool is needed, so a performance monitoring tool is also needed to keep the site. When I looked up open source libraries on the web, there weren’t many available. Later, I found Holmes in MOSN community, and found that this open source library has complete functions and high scalability. In particular, GCHeapDump, an industry-leading function, is very useful to solve the problem of memory increase.

Holmes component by the end of 2021, and then begin to understand the MOSN community of Holmes. As a performance inspection tool, The core function of Holmes is to discover performance indicator anomalies in time and perform Profiling on the system.

Because Holmes is still in its infancy, there isn’t much documentation outside of Readme. There are also some features that Holmes didn’t support at the time, such as dynamic configuration adjustments and reporting. Holmes had not released the first version at that time, but he was also interested and understanding in this aspect, so he put up several issues on GitHub, and the community responded very quickly. Subsequently, I proposed PR under the guidance of community seniors, and thus I learned a lot about the design concept of open source components through Holmes’ code design.

So I decided to participate in the open source community and contribute code to address real needs. After a certain understanding and experience, through discussion with rende predecessors, summed up such a shared article.

This article will introduce Holmes use scenarios, quick start cases, multiple monitoring types, design principles, extended functions and how to use Holmes to build a simple performance screening system, welcome to leave a message guidance.

Usage Scenarios of Holmes

Spikes for the performance of the system problem, we usually use the Go official built-in pprof package is analyzed, but the difficulty is for some of the “spikes”, developers, it is difficult to save the scene in time: when you receive the alarm information, climb up from the bed, open the computer link VPN, the system may have to restart on three or four.

Holmes of MOSN community is a lightweight performance monitoring system based on Golang. When the performance index of the application has abnormal fluctuations, Holmes will retain the scene in the first time, so that you can calmly drink wolfberry tea at work the next day, while tracing the root cause of the problem.

Quick Start

Using Holmes is as simple as adding the following code to your system initialization logic:

 // Configure rules
    h, _ := holmes.New(
        holmes.WithCollectInterval("5s"), // Indicator collection interval
        holmes.WithDumpPath("/tmp"),      // Profile saving path
    
        holmes.WithCPUDump(10.25.80.2 * time.Minute),  // Configure CPU performance monitoring rules
        holmes.WithMemDump(30.25.80.2 * time.Minute),// Configure Heap Memory performance monitoring rules
        holmes.WithGCHeapDump(10.20.40.2 * time.Minute), // Configure Heap Memory performance monitoring rules based on GC cycles
        holmes.WithGoroutineDump(500.25.20000.100*1000.2 * time.Minute),    // Configure a rule to monitor the number of Goroutines
    )
    // enable all
    h.EnableCPUDump().
    EnableGoroutineDump().
  EnableMemDump().
  EnableGCHeapDump().Start()
Copy the code

Holmes.withgoroutinedump (min, diff, abs, Max,2 * time.minute)

If the Goroutine indicator meets the following conditions, Dump is triggered.

When the number of Goroutines is greater than Max, Holmes skips the Dump operation because the cost of Goroutine Dump operation is high when the number of Goroutines is too large.

2 * time.Minute is the minimum time interval between Dump operations to avoid frequent Profiling.

See the Holmes Use Case documentation at the end of this article for more use cases.

Profile Types

Holmes supports the following five Profile types. You can configure them as required.

Mem: Memory allocation

CPU: indicates the CPU usage

Thread: the number of threads

Goroutine: coroutines

GCHeap: Memory allocation based on GC cycle monitoring

Metrics collected

Mem, CPU, Thread, and Goroutine collect current performance indicators at intervals based on the user-defined CollectInterval, while gcHeap collects performance indicators based on GC cycles.

This section examines two metrics.

Collect data periodically according to CollectInterval

Holmes collects application metrics at regular intervals and stores them using a fixed size circular linked list.

Collect data according to GC cycle

In some scenarios, we can’t save it to the scene through timed memory dump. For example, an application allocates a large amount of memory in a CollectInterval cycle and quickly reclaims it. The memory usage collected by Holmes before and after the period does not fluctuate greatly, which is inconsistent with the actual situation.

To address this situation, Holmes developed a Profile type based on GC cycles, which dumps a Profile in two GC cycles before and after a spike in heap memory usage. The developer can then use the pprof –base command to compare the heap memory difference between the two moments.

Data collected according to GC cycles is also placed in the cyclic list.

Rules to judge

This section describes how Holmes determines system exceptions based on rules.

Threshold value meaning

Each Profile can be configured with four indicators: MIN, DIFF, ABS, and coolDown.

If the current indicator is smaller than minutes, it is not regarded as an exception.

If the current indicator is greater than (100+ DIFF)100% of the historical indicator, the system fluctuates and is considered abnormal.

If the current indicator is greater than abs (absolute value), it is considered abnormal.

The CPU and Goroutine Profile types provide the Max parameter configuration based on the following considerations:

CPU Profiling has a performance cost of about 5%, so Profiling should not be done when the CPU is too high or it will drag down the system.

When the number of Goroutines is too large, Goroutine Dump operations are expensive and STW operations will be performed, which will bring down the system. (See reference article at the end of this article for details)

Warming up

When Holmes is started, indicators will be collected ten times according to the CollectInterval. Indicators collected during this period will only be stored in the circular linked list without rule judgment.

Extend the functionality

In addition to basic monitoring, Holmes offers some extended features:

Events reported

You can implement the following functions by implementing Reporter:

Alarm information is sent when Holmes triggers Dump.

Upload Profiles elsewhere in case the instance is destroyed, causing the Profile to be lost, or for analysis.

  type ReporterImpl struct{}
        func (r *ReporterImple) Report(pType string, buf []byte, reason string, eventID string) error{
            // do something  }... r := &ReporterImpl{}// a implement of holmes.ProfileReporter Interface.
      h, _ := holmes.New(
            holmes.WithProfileReporter(reporter),
            holmes.WithDumpPath("/tmp"),
            holmes.WithLogger(holmes.NewFileLog("/tmp/holmes.log", mlog.INFO)),
            holmes.WithBinaryDump(),
            holmes.WithMemoryLimit(100*1024*1024), // 100MB
            holmes.WithGCHeapDump(10.20.40, time.Minute),
)
Copy the code

Dynamic configuration

You can update the Holmes configuration while the application is running with the Set method. It is as simple to use as the New method at initialization.

Some configurations do not support dynamic changes, such as the number of cores. If you change this parameter while the system is running, the CPU usage will fluctuate greatly, triggering the Dump operation.

h.Set(
        WithCollectInterval("2s"),
        WithGoroutineDump(10.10.50.90, time.Minute))
Copy the code

The ground case

Using Holmes Set method, you can easily connect to the configuration center of your own company, for example, Holmes as the data plane, configuration center as the control plane. Connect to the alarm system (email/SMS) to set up a simple monitoring system.

The specific structure is as follows:

Holmes V1.0 is released

This paper simply introduces the use method and principle of Holmes. Hopefully Holmes will help you as you improve the stability of your application.

Holmes V1.0 was released a few weeks ago, and as a contributor and user, I highly recommend that you try out this small tool library. If you have any questions or questions, please feel free to join us in the community

Holmes is an open source GO Profiling component in the MOSN community. It can automatically discover CPU, Memory, Goroutine and other resource anomalies, and automatically Dump the anomaly field Profile for post-analysis and positioning. It also supports uploading Profile to automatic analysis platform to realize automatic problem diagnosis and alarm.

“Release report” : github.com/mosn/holmes…

IO /blog/posts/…

This paper simply introduces the use method and principle of Holmes. Hopefully Holmes will help you as you improve the stability of your application.

“References”

[1] Holmes Documentation github.com/mosn/holmes

[2] Unattended automatic dump(a) xargin.com/autodumper-…

[3] Unattended automatic dump(2) xargin.com/autodumper-…

[4] Uncledou.site /2022/ Go-PPr…

[5] Goroutines Pprofiling STW github.com/golang/go/i…

[6] Holmes Use Case Documentation github.com/mosn/holmes…

[7] “Go Pprof Performance Loss” medium.com/google-clou…

Recommended Reading of the Week

Invitation | SOFA 4th anniversary, open source right now!

Nydus image acceleration plugin moved to Containerd

Exploration and practice of heterogeneous registry mechanism in INDUSTRIAL and Commercial Bank of China

If you don’t like it, just change it!