preface

Recently, Huang has been working on something related to Double 11, so his blog and Github have not been updated much. During this period, he also made a lot of things in the company.

Let’s share some information about service monitoring.

Let’s start with the background.

A certain business did not provide real-time business kanban during the first wave of promotion on Double 11. At the conclusion meeting, technical colleagues were complained by relevant leaders and business staff, saying that they could not clearly understand the situation at that time and could not timely and effectively adjust corresponding strategies.

Later, Huang realized that the business was an old one with limited resources, so he dared not play against the database (single point) in real time for fear that the business would lose its cool if the database failed.

In order to avoid embarrassing status quo, do not want to be criticized again, can do.

Analysis of the status quo

Three applications, projects of the.NET Framework, are Windows servers without containerization.

Double 11 is only a few days away, can not change too much, and have to deal with the new needs of the business department.

I could think of at the time

  1. Business buried point, access Prometheus, combined with Grafana
  2. The business sends MQ, consumes data to ES, and makes a front panel
  3. Service burying point, access log service, combined with disk table

General analysis

  • Scenario 1: The business team has virtually no knowledge of Prometheus and it takes time to understand the concepts, pass
  • Scheme 2, MQ is currently using CMQ of Tencent Cloud, which has been cheated twice and cannot hold ES and pass well
  • In solution 3, logs are generated according to internal specifications. The service side only needs to add a line of logs in key places and submit logs to Logtail for collection and upload to the log service

So out of the three options, Huang chose plan three.

First of all, the log service has been connected to all internal systems, which is relatively familiar to the team. Secondly, it will not affect the main business, but only bury points in key places and add logs.

Although intrusive to the business code, this is undoubtedly the best solution at this stage.

The overall implementation logic is as follows.

Business is buried point

Business burying point, in fact, is a very simple, but also the most important step.

We have a corresponding logging helper class, so all we need to deal with here is the content of the log.

SerilogHelper.Info($"[field1] [field2] [field3]"."metrics_name");
Copy the code

The convention given by Huang here is that the field content should be placed inside [], and each field should be separated by Spaces.

The log is then stored in a specific directory for Logtail to collect.

Of course, there is a problem with the format of the log file. The format of the log file is UTF-8, and the resulting file is UTF-8 with BOM.

This causes the first line of the log to not be parsed correctly, so be careful.

Data access and presentation

The code level has been taken care of in the previous step, the next step is logging access.

Based on service scenarios, one indicator is used for each application. Therefore, you need to create three log libraries and select the storage time as required. The default value is permanent.

After building a log library to access the data source, Lao Huang here selected regular – text log.

Then there’s a bunch of general configurations.

The most important step is the Logtail configuration step.

Log path is the output path of program logs, so that logtail will collect the set path.

The regex is an important part of the log. Logtail parses the log based on the regex and extracts the regex into fields to facilitate subsequent queries.

Next comes the configuration of query analysis.

Here is to specify the statistics field, and another is to turn off the full text index, because in this scenario, open full text index is not meaningful, waste money.

At this point, we’re ready to collect the data.

The last thing to do is query the results. As long as the simple SQL, log service statistics is not a problem, the difficulty is relatively low.

Here’s how it works.

It’s a lot of coding, so I’ll make do.

Since the console provides automatic refresh and full screen functionality, the panel can be hung without human intervention.

conclusion

Monday night was complained, Tuesday morning out of the scheme, Tuesday afternoon open, Wednesday results, this wave of operation is really as fierce as a tiger.

It has to be said that ali cloud’s log service is indeed a simplification of a lot of tedious operations.

However, the way to extract logs is not perfect and is a bit awkward.