As ELK technology becomes more popular, Elasticsearch provides powerful search and analysis capabilities that allow you to handle a wide variety of data types. What follows is how to put various types of massive data into ES for its use in a general, convenient and efficient way. The traditional Logstash has this capability, but due to its inherent defects, its processing performance is low and it is difficult to develop and debug.

We are in urgent need of a universal data processing method to realize the whole process from data source to ES. Finally, we need to achieve:

  1. Versatility: rich interface, flow control, data type conversion, data processing
  2. Easy development: facilitate rapid development and debugging
  3. Easy to manage: Easy to find data problems, performance bottlenecks, clear processes
  4. High performance: High throughput for data processing and low resource consumption

Of course, we can develop special Spark Streaming application to realize data transfer to ES. However, due to limited conditions in terms of versatility and ease of use, this method has high cost, long cycle, and difficult to guarantee performance and stability.

Therefore, we select several general data processing methods for testing, compare their advantages and disadvantages, and find the scenarios they are suitable for:

  1. filebeat+ingest node
  2. logstash
  3. Apache Nifi
  4. Streamsets Data Collector (SDC)
  1. Test simple Apache logs for general log data parsing

A) Use the same Apache Access log data as the test sample

B) Perform simple data pattern matching, use unified COMBINEDAPACHELOG Grok pattern to parse the data, and input the results into ES

  1. Test Twitter data with complex JSON structure, transform, calculate and filter its fields, and deal with data processing with complex and large structure

A) Use json data from the same Twitter as the test sample

B) Since the original Twitter data is non-standard JSON data, it is processed as standard JSON array data before nifi is processed to generate data file tweet.nifi.json

  1. Each test was repeated three times, and the completion time and related operating indicators were recorded
  2. For ES performance data collection, the monitoring module built in Kibana is used to monitor the corresponding index
  3. Metricbeat was used to collect performance data of the operating system, and then its own template was used to show it in Kibana

3.1. Software and hardware environment

  1. VMware® Workstation 10.0.1 build-1379776
  2. Ubuntu 16.04.1 LTS Linux Version 4.4.0-59- Generic
  3. 2 CPU/10G MEM
  4. ELK 6.2.2 + xpack + 2 nodes
  5. Apache Nifi 1.5.0
  6. SDC 3.1.0.0

3.2. Data preparation

1. The Apache log data used in the test came from www.secrepo.com

See:

Github.com/zhan-yl/ELK…

Usage:

python3 download_data.py –start_date 2018-03-15 –output_folder ./data

Consolidate the downloaded log data into a log file:

-rw-rw-r– 1 zhanyl zhanyl 169984631 March 16 11:45 access.log

You can also use a log generator to generate data:

Github.com/kiritbasu/F…

2. The Twitter data used in the test came from the official website of Twitter and was downloaded by Python script

See:

Github.com/zhan-yl/ELK…

Usage:

Python2.7 tweet. Py @ realDonaldTrump

Python2.7 tweet. Py @ BBCWorld

Python2.7 tweet. Py @ BBCBreaking

Python2.7 tweet. Py @ the TIME

Python2.7 tweet. Py @ PDChina

Python2.7 tweet. Py @ CNN

Python2.7 tweet. Py @ CBSNews

Python2.7 tweet. Py @ elastic

Python2.7 tweet. Py @ golang

Python2.7 tweet. Py @ Docker

Python2.7 tweet. Py @ streamsets

The generated log files are as follows:

-rw-rw-r– 1 zhanyl zhanyl 277508582 April 2 10:11 tweet.json

-rw-rw-r– 1 zhanyl zhanyl 277561344 April 2 15:56 Tweet.nifi.json

3.3. Environment Configuration

3.3.1. Apache log test

3.3.1.1. Filebeat + ingest node

  1. Establish pipeline for INGest

See:

Github.com/zhan-yl/ELK…

Import:

curl -XPUT -H ‘Content-Type: application/json’ ‘lzyl1:9200/_ingest/pipeline/secrepo_pipeline? pretty’ -d @secrepo_pipeline.json -u elastic:zylelk

Check:

curl -XGET “http://lzyl1:9200/_ingest/pipeline? pretty&filter_path=secrepo_pipeline” -u elastic:zylelk

  1. Filebeat configuration file

See:

Github.com/zhan-yl/ELK…

3.3.1.2. Logstash

1. Create a configuration file

See:

Github.com/zhan-yl/ELK…

3.3.1.3. Apache nifi

1. Import template

See:

Github.com/zhan-yl/ELK…

2, description,

To avoid outofMemoryErrors and having a large number of files open at the same time, multiple split text concatenations are used in the process.

At the same time, the address resolution of GEO is not included in the process.

The use of Java heap and file handles in NIFI is an issue that needs to be handled carefully.

3.3.1.4. SDC

1. Import pipeline

See:

Github.com/zhan-yl/ELK…

3.3.2. Twitter log test

3.3.2.1. Filebaet + ingest node

Due to the complexity of data processing, this type is not tested

3.3.2.2. Logstash

Although similar operations can be performed using the Filter Plugin, this type is not tested due to complex operations and difficult debugging

3.3.2.3. Apache nifi

1. Import the template

See:

Github.com/zhan-yl/ELK…

2. Show

Since the Downloaded Twitter data is in a non-standard JSON format, sed is used to convert it to a standard JSON array data before testing to facilitate data segmentation.

At the same time, the address resolution of GEO is not included in the process.

3.3.2.3. SDC

1. Import pipeline

See:

Github.com/zhan-yl/ELK…

4.1. Apache log testing

4.4.1. Filebeat + ingest node

4.1.1.1. Run commands

Initializes file information to facilitate repeated execution

rm /var/lib/filebeat/registry

Activation:

/usr/share/filebeat/bin/filebeat -c /etc/filebeat/filebeat_secrepo.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data /var/lib/filebeat -path.logs /var/log/filebeat

4.1.1.2. Data monitoring

1. Total data

897123

2. Completion time

 

The start time

The end of time

Time (unit: minute)

1

furthermore

10:40:

20

2

10:45

11:03

18

3

his

The richest

19

3. Performance charts

A) ES performance

B) Operating system performance

C) ES data samples

__Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__{ "_index": "ingest-pipeline", "_type": "doc", "_id": "5AZoLmIBmeYOsZrYcv72", "_version": 1, "_score": null, "_source": { "request": "/twitter-icon.png", "geoip": { "continent_name": "Asia", "city_name": "Attock", "country_iso_code": "PK", "region_name" : "Punjab", "location" : {" lon ": 72.3873," lat ": 33.5937}}," offset ": 8706062," auth ": "-", "ident": "-", "verb": "GET", "source": "/home/zhanyl/examples-master/Graph/apache_logs_security_analysis/data/access.log", "message": "103.255.6.250 - - [09/Mar/2018: 02:24:56-0800] \"GET /twitter-icon.png HTTP/1.1\" 200 27787 \"http://www.secrepo.com/\" \ "Mozilla / 5.0 (X11; Ubuntu; Linux x86_64; The rv: 58.0) Gecko / 20100101 Firefox / 58.0 \ ""," referrer ":" \ "http://www.secrepo.com/\" ", "@ timestamp" : "The 2018-03-09 T10:24:56. 000 z", "response" : "200", "bytes" : "27787", "clientip" : "103.255.6.250", "beat" : {" hostname ": "Zylxpack", "name" : "zylxpack", "version" : "6.2.2"}, "httpversion" : "1.1", "the user_agent" : {" major ":" 58 ", "minor" : "0", "os": "Ubuntu", "name": "Firefox", "os_name": "Ubuntu", "device": "Other" } }, "fields": { "@timestamp": [" 2018-03-09T10:24:56.000z "], "sort": [ 1520591096000 ] } __Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__Copy the code

4.1.2. The Logstash

4.1.2.1. Run commands

Initializes file information for repeated execution:

rm /data/logstash/.sincedb

Activation:

/usr/share/logstash/bin/logstash -f /home/zhanyl/examples-master/Graph/apache_logs_security_analysis/logstash/secrepo_logstash.conf –path.settings=/etc/logstash –path.data /data/logstash

4.1.2.2. Data monitoring

1. Total data

897123

2. Completion time

 

The start time

The end of time

Time (unit: minute)

1

Shine forth

when

11

2

PM

when

11

3

14:13

To them

11

3. Performance charts

A) ES performance

B) Operating system performance

C) ES data samples

__Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__{ "_index": "secrepo-logstash", "_type": "doc", "_id": "vVswKGIBmeYOsZrYlb8C", "_version": 1, "_score": null, "_source": { "geoip": {"city_name": "Islamabad", "continent_code": "AS", "timezone": "Asia ", "longitude": 73.0113, "latitude": 33.6957, "country_name" : "Pakistan", "region_code" : "IS", "IP" : "103.255.6.250", "postal_code" : "44000", "country_code2": "PK", "region_name": "Islamabad Capital Territory", "country_code3": "PK", "location": {" lon ": 73.0113," lat ": 33.6957}}," os_name ":" Ubuntu ", "device" : "Other", "message" : "103.255.6.250 - - [09/Mar/2018: 02:24:56-0800] \"GET /twitter-icon.png HTTP/1.1\" 200 27787 \"http://www.secrepo.com/\" \ "Mozilla / 5.0 (X11; Ubuntu; Linux x86_64; Rv :58.0) Gecko/20100101 Firefox/58.0\" ", "path": "/home/zhanyl/examples-master/Graph/apache_logs_security_analysis/data/access.log", "major": "58", "@timestamp": "The 2018-03-09 T10:24:56. 000 z", "referrer" : "\" http://www.secrepo.com/\ ""," clientip ":" 103.255.6.250 ", "verb" : "GET", "minor": "0", "os": "Ubuntu", "request": "/twitter-icon.png", "@version": "1", "host": "zylxpack", "build": ""," the ident ":" - ", "auth" : "-", "response" : "200", "httpversion" : "1.1", "bytes" : "27787", "name" : "Firefox"}, "fields" : {" @ timestamp: "[" of" the 2018-03-09 T10:24:56. 000 z]}, "sort" : [ 1520591096000 ] } __Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__Copy the code

4.1.3. Apache nifi

4.1.3.1. Run commands

Initializes file information for repeated execution:

Activation:

Select the associated process, and then click Run

4.1.3.2. Data monitoring

1. Total data

897123

2. Completion time

 

The start time

The end of time

Time (unit: minute)

1

14:08

The number

26

2

15:07

29

3

15:08

and

29

3. Performance charts

A) ES performance

B) Operating system performance

C) ES data samples

__Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__{ "_index": "nifi", "_type": "nifi", "_id": "C5XrLGIBmeYOsZrYiU7w", "_version": 1, "_score": null, "_source": { "clientip": "103.255.6.250", "the ident" : "-", "auth" : "-", "verb" : "GET", "request" : "/ twitter - icon. PNG", "httpversion" : "1.1", "rawRequest ": null, "response": 200, "bytes": 27787, "referrer": "http://www.secrepo.com/", "agent": "Mozilla / 5.0 (X11; Ubuntu; Linux x86_64; Rv :58.0) Gecko/20100101 Firefox/58.0", "@timestamp": "2018-03-09T10:24:56.000z"}, "fields": {"@timestamp": [" 2018-03-09T10:24:56.000z "], "sort": [ 1520591096000 ] } __Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__Copy the code

4.1.4. SDC

4.1.4.1. Run commands

To establish the mapping.

Note: If @timestamp gives the standard TIMESTAMP string instead of timestamp type, mapping can be created dynamically

PUT /sdc

{

  “mappings”: {

    “sdc”: {

      “properties”: {

        “@timestamp”: {

          “type”: “date”

        },

        “geo”: {

          “type”: “geo_point”

        },

        “city”: {

          “type”: “text”,

          “index”: false

        }

      }

    }

  }

}

Initializes file information for repeated execution:

Activation:

4.1.4.2. Data monitoring

1. Total data

897070

The missing 53 records failed due to inability to resolve their geographical location, such as:

Clientip: 193.200.150.82

Clientip: 193.200.150.152

The Address ‘91.197.234.102’

The Address ‘103.234.188.37’

2. Completion time

 

The start time

The end of time

Time (unit: minute)

1

Not enough

16

2

Roar,

behold

11

3

separate

for

14

An exception occurs in the third test, but processing is not interrupted

See:

Github.com/zhan-yl/ELK…

3. Performance charts

A) ES performance

B) Operating system performance

C) ES data samples

__Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__{ "_index": "sdc", "_type": "sdc", "_id": "Wur8LWIBmeYOsZrYnL-S", "_version": 1, "_score": null, "_source": { "request": "/ twitter - icon. PNG", "agent", "\" Mozilla / 5.0 (X11; Ubuntu; Linux x86_64; The rv: 58.0) Gecko / 20100101 Firefox / 58.0 \ ""," auth ":" - ", "the ident" : "-", "verb" : "GET", "referrer" : "\" http://www.secrepo.com/\ ""," the response: "200," bytes: "27787," clientip ":" 103.255.6.250 ", "httpversion" : "1.1", "rawRequest ": NULL, "geo": {"lat": 33.6957, "lon": 73.0113}, "@timestamp": 1520591096000, "city": "Islamabad"}, "fields" : {" @ timestamp: "[" of" the 2018-03-09 T10:24:56. 000 z]}, "sort" : [ 1520591096000 ] } __Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__Copy the code

4.2. Twitter log test

2. The Apache nifi

4.2.1.1. Run the command

Initializes file information for repeated execution:

Start the

Select the associated process, and then click Run

4.2.1.2. Data monitoring

1. Total data

52759

2. Completion time

 

The start time

The end of time

Time (unit: minute)

1

However,

I did

5

2

prepare

The children

4

3

Agreement of

every

4

3. Performance charts

A) ES performance

B) Operating system performance

C) ES data samples

__Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__{ "_index": "twitter-nifi", "_type": "doc", "_id": "u7NThmIBROKaR930KYYi", "_version": 1, "_score": null, "_source": { "geo": null, "id": 980626639842357200, "id_str": "980626639842357249", "lang": "en", "user_mentions": [], "favorite_count": 23, "favorited": false, "retweet_count": 25, "text": "Cops: Man arrested with weapons cache, bump stock claimed secret government mission https://t.co/XaHdZOmV8z https://t.co/N3ca4nAUSv", "user": { "contributors_enabled": false, "created_at": "Thu Jun 05 00:54:31 +0000 2008", "default_profile": false, "default_profile_image": false, "description": "Your source for original reporting and trusted news.", "entities": { "description": { "urls": [] }, "url": { "urls": [ { "display_url": "CBSNews.com", "expanded_url": "http://CBSNews.com", "indices": [ 0, 23 ], "url": "https://t.co/VGut7r2Vg5" } ] } }, "favourites_count": 270, "follow_request_sent": false, "followers_count": 6490476, "following": false, "friends_count": 431, "geo_enabled": false, "has_extended_profile": false, "id": 15012486, "id_str": "15012486", "is_translation_enabled": true, "is_translator": false, "lang": "en", "listed_count": 47812, "location": "New York, NY", "name": "CBS News", "notifications": false, "profile_background_color": "D9DADA", "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/736106551/37bf1f784305fe4a9c7e9105772c6e1a.jpeg", "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/736106551/37bf1f784305fe4a9c7e9105772c6e1a.jpeg", "profile_background_tile": false, "profile_banner_url": "https://pbs.twimg.com/profile_banners/15012486/1519827973", "profile_image_url": "http://pbs.twimg.com/profile_images/645966750941626368/d0Q4voGK_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/645966750941626368/d0Q4voGK_normal.jpg", "profile_link_color": "B12124", "profile_sidebar_border_color": "FFFFFF", "profile_sidebar_fill_color": "EAEDF0", "profile_text_color": "000000", "profile_use_background_image": true, "protected": false, "screen_name": "CBSNews", "statuses_count": 168688, "time_zone": "Eastern Time (US & Canada)", "translator_type": "none", "url": "https://t.co/VGut7r2Vg5", "utc_offset": -14400, "verified": true }, "created_at": "Mon Apr 02 02:03:04 +0000 2018", "place": null, "total_count": 48, "@timestamp": "The 2018-04-02 T02:03:04. 000 z"}, "fields" : {" @ timestamp: "[" of" the 2018-04-02 T02:03:04. 000 z]}, "sort" : [ 1522634584000 ] } __Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__Copy the code

4.2.2. SDC

4.2.2.1. Run the following command

Initializes file information for repeated execution:

Activation:

4.2.2.2. Data monitoring

1. Total data

52759

2. Completion time

 

The start time

The end of time

Time (unit: minute)

1

At 21:00

21:05

5

2

21:06

Ye:

6

3

then

Lift up

5

3. Performance charts

D) ES performance

E) Operating system performance

F) ES data samples

__Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__{ "_index": "twitter-sdc", "_type": "doc", "_id": "VbVuhmIBROKaR930GV0_", "_version": 1, "_score": null, "_source": { "created_at": "Mon Apr 02 02:03:04 +0000 2018", "entities": { "user_mentions": [] }, "favorite_count": 23, "favorited": false, "geo": null, "id": 980626639842357200, "id_str": "980626639842357249", "lang": "en", "place": null, "retweet_count": 25, "text": "Cops: Man arrested with weapons cache, bump stock claimed secret government mission https://t.co/XaHdZOmV8z https://t.co/N3ca4nAUSv", "user": { "description": "Your source for original reporting and trusted news.", "favourites_count": 270, "followers_count": 6490476, "friends_count": 431, "id": 15012486, "location": "New York, NY", "name": "CBS News" }, "@timestamp": "The 2018-04-02 T10:03:04. 000 + 08", "lat" : null, "lon:" null, "total_count" : 48}, "fields" : {" @ timestamp ": [" 2018-04-02T02:03:04.000z "], "sort": [ 1522634584000 ] } __Fri May 04 2018 11:19:53 GMT+0800 (CST)____Fri May 04 2018 11:19:53 GMT+0800 (CST)__Copy the code

5.1. Data results

After the final data enters ES, although the number of records is equal, there are differences in data size and recorded content, which is also the reason for the difference in execution time

5.2. Comparison

5.2.1. Filebeat + ingest node

1, the advantages of

A) Relatively lightweight, less consumption of system resources

B) Ingest node is built into ES cluster without additional deployment

C) Pipeline processing can be realized through a simple JSON statement

2 and disadvantages

A) The functions supported are limited and only simple data can be processed. The set of functions that can be handled is currently only a subset of logstash

B) Ingest node itself does not have the function of pulling data independently and requires other tools to write data. If deployed on the Kafka back end, you must deploy tools that can work with its data writing

C) Visualization is not supported

5.2.2. Logstash

1, the advantages of

A) There are abundant plugins for various data processing and processing

B) It has a large number of interface support and relatively mature technology

2 and disadvantages

A) It is difficult to develop, debug and track, and cannot be visualized

B) For more complex data structures processing is difficult to implement, will rely on nested Ruby to handle

C) Performance bottlenecks cannot be detected by visual monitoring, which is alleviated to some extent by the new Kibana Pipeline

5.2.3 requires. Apache nifi

1, the advantages of

A) The whole processing process can be graphically edited and monitored

B) Single step debugging can be carried out, and the next processing link can be entered after the correct results are obtained in the previous processing link. The complete processing of the Data can be traced through Data Provenance.

C) There is a back pressure mechanism for data

D) During operation, each processing link in the process can be modified at any time to adjust the resource usage without interrupting the whole process

E) There are rich interfaces, expression language and data type conversion mechanism

F) Have REST interface, through which further application development can be carried out

G) Due to the use of files as carriers, the size of data processed is basically unlimited

H) It has certain advantages in big data structure and batch data processing

2 and disadvantages

A) Due to the use of files, files need to be landed and frequently opened and closed, which results in low performance and requires high-performance disk devices

B) At the same time, it consumes a large amount of memory and generates a large number of attributes. Easy to create OOM.

C) The number of files that can be opened at the same time is also strictly required, and a high fileno is required

5.2.4. SDC

1, the advantages of

A) Visual processing, monitoring indicators are displayed for each link

B) Preview the data processing process and check the data processing results

C) Have multiple processing processes and packages, which can form complex processing pipelines and use expression language, Groovy, JavaScript, Jyphon, etc

D) It has rest interface, which can be directly called and further application development can be carried out through this interface

E) Built-in exception alarm mechanism

F) High processing performance due to batch processing in memory

G) It has certain advantages for small data structure and real-time data processing

2 and disadvantages

A) Since the data is processed in memory, in order to avoid OOM, the event size is limited. Its buffer size is limited to 1048576 bytes, and the excess part will be truncated

B) Single step debugging cannot be carried out, and the processing of pipeline cannot be interfered in the operation process

The authors introduce

Zhan Yulin is a big data engineer at The System Management Center of China Minsheng Banking Corporation. He has worked as a R&D engineer to develop the bank’s core system and an IBM database support engineer. Now he focuses on real-time solutions of big data.

Stamp here
ArchSummit Global Architect Summit