The logs of distributed system are scattered on each server, which is very unfavorable for monitoring and troubleshooting. We built a whole set of log collection, analysis and display system based on ELK.

Architecture diagram

The main idea

1. Organize Rails logs

We are most concerned with Rails access logs, but the format of Rails logs itself is problematic, for example

Started GET "/" for10.1.1.11 at 2017-07-19 17:21:43 +0800 Cannot render console from 10.1.1.11! Allowed networks: 127.0.0.1, : : 1, 127.0.0.0/127.255.255.255 Processing by Rails: : WelcomeController#index as HTMLRendering / home/vagrant /. RVM/gems/[email protected] / gems/railties - 5.1.2 / lib/rails/templates/rails/welcome/index. The HTML. Erb Rendered / home/vagrant /. RVM/gems/[email protected] / gems/railties - 5.1.2 / lib/rails/templates/rails/welcome/index. The HTML. Erb (2.5 ms) Completed 200 OKin184 ms (10.9 ms) Views:Copy the code

As you can see, the logs of a single request are scattered across multiple lines, and in the case of concurrency, the logs of different requests are interwoven. To solve this problem, we used Logstasher to regenerate a JSON log

{"identifier":"/ home/vagrant /. RVM/gems/[email protected] / gems/railties - 5.1.2 / lib/rails/templates/rails/welcome/index. The HTML. The erb." "."layout":null,"name":"render_template.action_view"."transaction_id":"35c707dd9d4cd1a79f37"."duration": 2.34."request_id":"bc291df8-8681-47d3-8e10-bd5d93a021a0"."source":"unknown"."tags": []."@timestamp":"The 2017-07-19 T09:29:05. 969 z"."@version":"1"}
{"method":"GET"."path":"/"."format":"html"."controller":"rails/welcome"."action":"index"."status": 200,"duration": 146.71."view": 5.5."ip":"10.1.1.11"."route":"rails/welcome#index"."request_id":"bc291df8-8681-47d3-8e10-bd5d93a021a0"."source":"unknown"."tags": ["request"]."@timestamp":"The 2017-07-19 T09:29:05. 970 z"."@version":"1"}
Copy the code

2. Use Logstash to collect logs

With a configuration file, Logstash describes where data comes from, what processing process it goes through, and where it is output. The whole process corresponds to input,filter, and Output concepts respectively.

Let’s verify this with a simple configuration

input {
  file {
    path => "/home/vagrant/blog/log/logstash_development.log"
      start_position => beginning
      ignore_older => 0
    }
}
output {
        stdout {}
}
Copy the code

In this configuration, we read from the log file generated in the previous step and output it to stdout, resulting in the following

The 2017-07-19 T09:59:01. 520 z precise64 {"method":"GET"."path":"/"."format":"html"."controller":"rails/welcome"."action":"index"."status": 200,"duration": 4.85."view": 3.28."ip":"10.1.1.11"."route":"rails/welcome#index"."request_id":"27b8e5a5-dd1d-4957-9c91-435347d50888"."source":"unknown"."tags": ["request"]."@timestamp":"The 2017-07-19 T09:59:01. 030 z"."@version":"1"}
Copy the code

Then modify the Logstash configuration file to change output to Elasticsearch

input {
  file {
    path => "/vagrant/blog/log/logstash_development.log"
      start_position => beginning
      ignore_older => 0
    }
}

output {
  elasticsearch {
    hosts => [ "localhost:9200" ]
    user => 'xxx'
    password => 'xxx'}}Copy the code

As you can see, the entire configuration file is very readable. The input source is the log file that we compiled and exported to Elasticsearch.

Then you can use Kibanana for log analysis.

3. Some practices of Kibana

Based on Kibana, we can customize Elasticsearch search to query some very valuable data

  • Example Query the requests of an interface
  • Query the ultra-slow interface that takes more than 500ms
  • Query the interface that reports 500
  • Statistics high-frequency interface……

4.Future

With the data provided by ELK, we have been able to conveniently complete the error screening in the distributed situation and high-frequency interface statistics, providing guidance for the next step of optimization. Instead of trying to guess from business logic what is 20% hot, we have actual data to back it up.

Problem of 5.

Of course, there have been some problems in the process of use. During the event, Elasticsearch consumed a lot of memory and crashed both machines due to the traffic surge. We solved this problem temporarily by temporarily shutting down logstash on several Web Servers. Some further tuning of the JVM is required.