Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This post was posted on cloud + Community by Michelmu

Elasticsearch is a mainstream distributed big data storage and search engine. It provides users with powerful full-text search capabilities and is widely used in log search and site-wide search. As a common real-time data collection engine for Elasicsearch, Logstash collects data from different data sources, processes the data, and outputs the data to multiple output sources. Logstash is an important part of the Elastic Stack. This paper starts with the working principle, usage example, deployment mode and performance tuning of Logstash, and provides a quick way to start Logstash. At the end of the article, we also provide some in-depth links to get to know more about Logstash.

1 Logstash working principle

1.1 Processing Procedure

As shown in the figure above, the Logstash data processing process mainly includes Inputs, Filters and Outputs. In addition, Codecs can be used in Inputs and Outputs to process data formats. These four parts all exist in the form of plug-ins. By defining pipeline configuration files, users can set input, filter, output and CODEC plug-ins needed to be used to achieve specific data collection, data processing, data output and other functions

  • Inputs: Inputs for obtaining data from data sources: file, syslog, Redis, beats, etc.
  • (2) Filters: for processing data such as format conversion, data derivation, etc. Common plug-ins such as Grok, Mutate, Drop, Clone, geoip, etc. [for details]
  • (3) Outputs: For data output, such as Elastcisearch, File, Graphite, STATSD, etc.
  • (4) Codecs: Codecs is not a separate process, but a module used for data conversion in plug-ins such as input and output, which is used for data encoding. Common plug-ins such as JSON and Multiline [For details]

You can click the detailed reference link at the end of each module to see the list of plug-ins and corresponding functions for that module

1.2 Execution Model:

  • (1) Each Input starts a thread to get data from the corresponding data source
  • (2) Input writes data to a queue: the default is a bounded queue in memory (unexpected stops can result in data loss). In order to prevent data loss, Logstash provides two features: Persistent Queues: Queues that prevent data loss, Dead Letter Queues: Queues that hold unprocessed events (only Elasticsearch is supported as the output source).
  • (3) There are multiple pipeline workers with Logstash. Each pipeline worker will fetch a batch of data from the queue and then execute filter and output (the number of workers and the amount of data processed each time are determined by the configuration).

2 Logstash Example

2.1 Logstash Hello world

The first example, Logstash, takes standard input and standard output as input and output, and does not specify filter

  • (1) Download the Logstash and unpack it (pre-installed JDK8 is required)
  • (2) CD to the Logstash root directory and run the following startup command:
    cd logstash6.4. 0
    bin/logstash -e 'input { stdin { } } output { stdout {} }'
Copy the code
  • (3) At this point, the Logstash has been started successfully. -e means that the pipeline configuration can be directly specified during startup. Of course, the configuration can also be written into a configuration file and then started by specifying the configuration file
  • (4) Enter “Hello world” on the console, you can see the following output:
    {
    "@version"= >"1"."host"= >"localhost"."@timestamp"= >2018- 0918T12:39:38.514Z,
    "message"= >"hello world"
    }  
Copy the code

Logstash automatically adds @version, host, and @TIMESTAMP fields to the data

In this case, the Logstash takes the data from the standard input and simply adds some simple fields to the data and outputs it to the standard output.

2.2 Log Collection

This example will use the Filebeat Input plugin (a lightweight data collector in the Elastic Stack) to collect local logs and output the results to standard output

  • (1) Download the log file [address] used by the example, unzip it and place the log in a certain location
  • (2) Install FileBeat, configure and start it [reference]

The configuration of Filebeat. yml is as follows (Paths is changed to the actual log location. The configuration of Beats may change with different versions, please adjust it according to the situation)

    filebeat.prospectors:
    - input\_type: log
        paths:
            - /path/to/file/logstash-tutorial.log 
    output.logstash:
        hosts: "localhost:5044"
Copy the code

Start command:

    ./filebeat -e -c filebeat.yml -d "publish"
Copy the code
  • (3) Configure the Logstash and start

1) Create a first-pipeline.conf file with the following contents (this file is the pipeline configuration file, used to specify input, filter, output, etc.) :

    input {
        beats {
            port => "5044"
        }
    }
    #filter {
    #}
    output {
        stdout { codec => rubydebug }
    }
Copy the code

Codec => Rubydebug to beautify output [Reference]

2) Verify the configuration (note that the path to the configuration file is specified) :

    ./bin/logstash -f first-pipeline.conf --config.test_and_exit
Copy the code

3) Start command:

    ./bin/logstash -f first-pipeline.conf --config.reload.automatic
Copy the code

–config.reload. Automatic enables dynamic overload configuration

4) Expected Results:

As you can see from the Logstash terminal display, the log file is read and processed into multiple pieces of data in the following format

    {
        "@timestamp"= >2018- 10- 09T12:22:39.742Z,
            "offset"= >24464."@version"= >"1"."input_type"= >"log"."beat"= > {"name"= >"VM_136_9_centos"."hostname"= >"VM_136_9_centos"."version"= >"5.6.10"
        },
              "host"= >"VM_136_9_centos"."source"= >"/data/home/michelmu/workspace/logstash-tutorial.log"."message"= >"86.1.76.62 - - [04/Jan/2015:05:30:37 +0000] \"GET /style2.css HTTP/1.1\" 200 4877 \ \ "http://www.semicomplete.com/projects/xdotool/\" "Mozilla / 5.0 (X11; Linux x86_64; The rv: 24.0) Gecko / 20140205 Firefox 24.0 Iceweasel / 24.3.0 \ ""."type"= >"log"."tags" => [
            [0] "beats_input_codec_plain_applied"]}Copy the code

Compared to Example 2.1, this example uses the FileBeat Input plug-in to fetch a row from the log, which is the most common way that Elastic Stack retrieves log data. Rubydebug Codec is also used to beautify the output data.

2.3 Log Format Processing

As you can see, although Example 2.2 uses FileBeat to read data from the log and output it to standard output, the log content as a whole is stored in the Message field, which makes subsequent storage and query very inconvenient. You can specify a Grok filter for the pipeline to handle the log format

  • (1) Add filter to first-pipeline.conf as follows
    input {
        beats {
            port => "5044"
        }
    }
    filter {
        grok {
            match => { "message"= >"%{COMBINEDAPACHELOG}"}
        }
    }
    output {
        stdout { codec => rubydebug }
    }
Copy the code
  • (2) Go to the root directory of FileBeat and delete the previously reported data history (for re-reporting data), and restart FileBeat
    sudo rm data/registry
    sudo ./filebeat -e -c filebeat.yml -d "publish"
Copy the code
  • (3) Since automatic update configuration is set when the Logstash is started earlier, there is no need to restart the Logstash. The log data can be obtained as follows:
    {
            "request"= >"/style2.css"."agent"= >"\" Mozilla / 5.0 (X11; Linux x86_64; The rv: 24.0) Gecko / 20140205 Firefox 24.0 Iceweasel / 24.3.0 \ ""."offset"= >24464."auth"= >"-"."ident"= >"-"."input_type"= >"log"."verb"= >"GET"."source"= >"/data/home/michelmu/workspace/logstash-tutorial.log"."message"= >"86.1.76.62 - - [04/Jan/2015:05:30:37 +0000] \"GET /style2.css HTTP/1.1\" 200 4877 \ \ "http://www.semicomplete.com/projects/xdotool/\" "Mozilla / 5.0 (X11; Linux x86_64; The rv: 24.0) Gecko / 20140205 Firefox 24.0 Iceweasel / 24.3.0 \ ""."type"= >"log"."tags" => [
            [0] "beats_input_codec_plain_applied"]."referrer"= >"\"http://www.semicomplete.com/projects/xdotool/\""."@timestamp"= >2018- 10- 09T12:24:21.276Z,
           "response"= >"200"."bytes"= >"4877"."clientip"= >"86.1.76.62"."@version"= >"1"."beat"= > {"name"= >"VM_136_9_centos"."hostname"= >"VM_136_9_centos"."version"= >"5.6.10"
        },
               "host"= >"VM_136_9_centos"."httpversion"= >"1.1"."timestamp"= >"04/Jan/2015:05:30:37 +0000"
    }
Copy the code

You can see that the data in message is parsed out in detail

2.4 Data derivation and enhancement

Some of the filters in Logstash can generate some new data based on existing data. For example, geoIP can generate latitude and longitude information based on IP

  • (1) Add geoIP configuration in first-pipeline.conf as follows
    input {
        beats {
            port => "5044"
        }
    }
     filter {
        grok {
            match => { "message"= >"%{COMBINEDAPACHELOG}"}
        }
        geoip {
            source => "clientip"
        }
    }
    output {
        stdout { codec => rubydebug }
    }
Copy the code
  • (2) Clear fileBeat historical data and restart as described in 2.3
  • (3) Of course the Logstash still doesn’t need to be rebooted, as you can see the output changes to the following:
    {
            "request"= >"/style2.css"."agent"= >"\" Mozilla / 5.0 (X11; Linux x86_64; The rv: 24.0) Gecko / 20140205 Firefox 24.0 Iceweasel / 24.3.0 \ ""."geoip"= > {"timezone"= >"Europe/London"."ip"= >"86.1.76.62"."latitude"= >51.5333."continent_code"= >"EU"."city_name"= >"Willesden"."country_name"= >"United Kingdom"."country_code2"= >"GB"."country_code3"= >"GB"."region_name"= >"Brent"."location"= > {"lon"= >0.2333."lat"= >51.5333
            },
               "postal_code"= >"NW10"."region_code"= >"BEN"."longitude"= >0.2333
        },
             "offset"= >24464."auth"= >"-"."ident"= >"-"."input_type"= >"log"."verb"= >"GET"."source"= >"/data/home/michelmu/workspace/logstash-tutorial.log"."message"= >"86.1.76.62 - - [04/Jan/2015:05:30:37 +0000] \"GET /style2.css HTTP/1.1\" 200 4877 \ \ "http://www.semicomplete.com/projects/xdotool/\" "Mozilla / 5.0 (X11; Linux x86_64; The rv: 24.0) Gecko / 20140205 Firefox 24.0 Iceweasel / 24.3.0 \ ""."type"= >"log"."tags" => [
            [0] "beats_input_codec_plain_applied"]."referrer"= >"\"http://www.semicomplete.com/projects/xdotool/\""."@timestamp"= >2018- 10- 09T12:37:46.686Z,
           "response"= >"200"."bytes"= >"4877"."clientip"= >"86.1.76.62"."@version"= >"1"."beat"= > {"name"= >"VM_136_9_centos"."hostname"= >"VM_136_9_centos"."version"= >"5.6.10"
        },
               "host"= >"VM_136_9_centos"."httpversion"= >"1.1"."timestamp"= >"04/Jan/2015:05:30:37 +0000"
    }
Copy the code

You can see a lot of geolocation data derived from IP

2.5 Importing Elasticsearch Data

As an important part of the Elastic stack, Logstash is most commonly used to import data into Elasticssearch. Exporting the Logstash data to Elasticsearch is also very easy. Add the output of Elasticsearch to the pipeline configuration file.

  • 1. Create Elasticsearch with your Logstash address
  • (2) Add Elasticsearch to first-pipeline.conf as follows
   input {
        beats {
            port => "5044"
        }
    }
     filter {
        grok {
            match => { "message"= >"%{COMBINEDAPACHELOG}"}
        }
        geoip {
            source => "clientip"
        }
    }
    output {
        elasticsearch {
            hosts => [ "localhost:9200"]}}Copy the code
  • (3) Clear fileBeat historical data and restart it
  • Select * from Elasticsearch; select * from Elasticsearch;
    curl -XGET 'http://172.16.16.17:9200/logstash-2018.10.09/_search? pretty&q=response=200'
Copy the code
  • (5) If Elasticsearch is associated with Kibana, you can use Kibana to check whether data is reported properly

Logstash provides a large number of Input, filter, output, and CODEC plug-ins. Users can use one or more components to achieve their own functions according to their own needs. Of course, users can also customize plug-ins to achieve more customized functions. For a custom plugin, see [Logstash Input plugin development]

3 the deployment Logstash

Now that we’ve demonstrated how to use Logstash quickly, let’s take a closer look at how it can be deployed.

3.1 installation

  • Install JDK: Logstash is written in JRuby and requires a JDK environment to run. Therefore, you need to install JDK before installing Logstash. (Currently 6.4 only supports JDK8)
  • Install the Logstash: you can install it by downloading the compressed package directly, or by using APT or YUM. In addition, you can install the Logstash into Docker. [Logstash Installation Reference]
  • Install the X-Pack: X-Pack will be installed with the Logstash in 6.3 and later versions, before which you will need to install it manually [see link]

3.2 Directory Structure

The logstash directories include root directory, bin directory, configuration directory, log directory, plug-in directory, and data directory

The default locations of different installation directories refer to [here]

3.3 Configuration Files

  • The Pipeline configuration file, whose name can be customized and is explicitly specified when Logstash is started, can be written in the same way as the previous example. For the configuration of the specific plug-in, refer to the description of the specific plug-in (must be configured when using Logstash) : defines a Pipeline, data processing, and output source
  • -stash. Yml: Command to control the command output in charge of The command output. -Stash. Use this configuration to configure multiple pipeline execution if there are multiple pipelines [reference] -jvm.options: JVMS – log4j2.properties:log4j 2 configuration for logging logstash run logs – startup.options: only for Lniux systems, used to set the system startup project!
  • To ensure the security of sensitive configurations, LogStash provides configuration encryption.

3.4 Startup and Shutdown Mode

3.4.1 track start

  • Command line startup
  • Start as a service on Debian and RPM
  • Start 3.4.2 Off in Docker
  • Close the Logstash
  • The Logstash shutdown closes the input to stop the input, processes all the ongoing events, and then completely stops to prevent data loss, but this also leads to delays or failures in the stop process.

3.5 extensions Logstash

When a single Logstash cannot meet the performance requirements, horizontal expansion can be adopted to improve the processing capacity of the Logstash. Horizontally extended multiple Logstash are independent of each other and adopt the same pipeline configuration. In addition, a LoadBalance can be added in front of these multiple Logstash to realize load balancing of multiple Logstash.

4 Performance Tuning

[Detailed tuning reference]

  • (1) Performance of Inputs and Outputs: When the performance of input and output sources has reached the upper limit, the performance bottleneck is not Logstash, and the performance of input and output sources should be prioritized for tuning.

  • (2)

    System Performance Indicators

    :

    • CPU: Determine whether the CPU usage is too high. If the CPU is too high, check the HEAP space usage of JVM first to see if GC is frequent. If GC is normal, it can be solved by adjusting the Logstash worker configuration.
    • Memory: Since Logstash runs on the JVM, be careful to adjust the JVM heap limit so that it has enough space to run. In addition, check whether other applications on the machine where the Logstash stash is located occupy a large amount of memory, resulting in frequent disk swaps.
    • The I/O utilization: 1)Disk I/o: Disk I/O saturation may be caused by the creation of a file output that causes disk I/O saturation. In addition, a large number of error logs generated by a Logstash error may also cause disk I/O saturation. In Linux, you can run iostat and dstat to check disk I/O status.Network IONetwork IO saturation generally occurs when plug-ins that have a lot of network operations are used. In Linux, you can use dstat or iftop to view network I/OS
  • (3)

    Check the JVM heap

    :

    • Setting the JVM heap size too small can lead to frequent GC, which can lead to high CPU utilization
    • A quick way to verify this is the double heap size to see if performance improves. Reserve at least 1GB space for the system.
    • To find problems precisely, you can use JMap or VisualVM. Reference []
    • Setting Xms and Xmx to the same value prevents the heap size from adjusting at run time, a process that can be very performance consuming.
  • (4) Logstash worker setting: the worker-related configuration is in Logstash. Yml, which mainly includes the following three: – pipeline. Workers: This parameter is used to specify the number of threads that execute filter and output in the Logstash. If the CPU usage has not reached the upper limit, you can adjust this parameter to provide higher performance for the Logstash. It is recommended that the number of workers exceed the number of CPU cores to reduce the impact of I/O wait time on processing. In actual tuning, you can use -w to specify this parameter first, and then write the value to the configuration file. – pipeline.batch.size: Specifies the batch number of flilter and output events executed by a single worker thread at one time. Increasing the value can reduce I/O times and improve the processing speed, but it also increases the consumption of resources such as memory. When used with Elasticsearch, this value can be used to specify the size of the Elasticsearch bluck operation. – pipeline.batch.delay: Specifies the timeout of worker waiting time. If the worker does not wait for pipeline.batch.size events within this time, the filter and output will be executed directly without waiting.

conclusion

As an important part of Elastic Stack, Logstash plays an important role in Elasticsearch data collection and processing. Through simple examples and basic knowledge of Logstash, this paper hopes to help first-time users to have an overall understanding of Logstash and learn it quickly. For the high-level use of Logstash, users still need to consult relevant resources in the process of use in combination with the actual situation. Of course, you are also welcome to actively exchange ideas and put forward valuable opinions on the mistakes in the article.

MORE:

  • A common example of Logstash data processing
  • Logstash Provides reference for configuring logs
  • Kibana manages the Logstash pipeline configuration
  • LogstashModule
  • Monitor the Logstash

reading

Introduction to the Monitoring system of Spark in big Data Basics series

Neutron LBAas proxy HTTPS practice

Machine learning in action! Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!