If you have message-based log entries, but they are machine-generated, you first need to organize them into similar message types before using them for exception detection. This process is called cateogrization, and Elastic ML can help with this process. Categorization introduces structure to semi-structured data for analysis. The advantage of this is that you can find exceptions in the log without knowing what message contains beforehand.

A category of information that can be used for categorization

We need to be a little more strict in defining the type of message-based log lines to consider here. What we don’t consider are completely free-form log lines/events/documents, and most likely human-created results (emails, tweets, comments, etc.). Such messages are arbitrary and vary in structure and content.

Instead, we focus on machine-generated messages that are obviously emitted when the application encounters different conditions or exceptions, limiting its construction and content to a relatively discrete set of possibilities (note that it is possible to have some variable aspects of the message). For example, let’s look at the following lines of the application log:

18/05/2016 15:16:00 S ACME6 DB Not Updated [Master] Table 18/05/2016 15:16:00 S ACME6 REC Not INSERTED [DB TRAN] Table 18/05/2016 15:16:07 S ACME6 Using: 10.16.1.63! svc_prod#uid=demo; pwd=demo 18/05/2016 15:16:07 S ACME6 Opening Database = DRIVER={SQL Server}; SERVER = 10.16.1.63; network=dbmssocn; Address = 10.16.1.63, 1433; DATABASE =svc_prod; uid=demo; pwd=demo; AnsiNPW=No 18/05/2016 15:16:29s ACME6 DBMS ERROR: db=10.16.1.63! svc_prod#uid=demo; pwd=demo Err=-11 [Microsoft][ODBC SQL Server Driver][TCP/IP Sockets]General network error. Check your network documentation.Copy the code

Here, we can see that each message has a different text, but there is some structure. After the date/time stamp and server name of the message (in this case, ACME6), there is the actual content of the message, where the application notifies the outside world of what is happening at the time – whether something is being tried or an error has occurred.

 

Categorization process

To detect ordered patterns in unordered log files, Elastic ML uses string similarity clustering to group similar messages together. The heuristic method of the algorithm is roughly as follows:

  • Focus on dictionary words rather than mutable words (that is, network and address are dictionary words, but dbmssocn may be mutable strings-mutable)
  • Immutable dictionary words are passed through a string similarity algorithm (similar to Levenshtein distance) to determine how similar logarithmic rows are to past logarithmic rows
  • If the difference between the current log row and the existing category is small, the existing log row is grouped into that category
  • Otherwise, create a new category for the current log row

As a simple example, consider the following three messages:

Error writing file "foo" on host "acme6"
Error writing file "bar" on host "acme5"
Opening database on host "acme7"
Copy the code

The algorithm classifies the first two messages into the same category because they will be treated as Error writing files on message types, while the third message will be given its own (new) category.

The naming of these categories is simple: ML will simply call them MLcategory N, where N is an increasing integer. Therefore, in this example, the first two rows are associated with MLCategory 1, and the third row is associated with MLCategory 2. In real machine logs, thousands (or even tens of thousands) of categories might be generated due to the diversity of log messages, but the set of possible categories should be limited. However, if the number of categories starts to reach hundreds of thousands, it becomes clear that log messages are not a finite set of message types and therefore not suitable for this type of analysis.

 

How does it work?

Suppose we have a set of information:

Above is a set of Linux log information. Let’s look at how categorization works.

The first step is to get rid of mutable words, which can be modified:

Because the date, IP address and the hidden fields above are constantly changing. So that’s what it looks like up here.

The second step is to aggregate similar messages

If we take a closer look, we can see the following category:

You can see from the top that there are six different categories. There is a great deal of similarity in the data structure of each category. Some categories have one document, but others have many documents, such as category 6.

Third, statistics are made for each time bucket

Statistics are collected on the number of events at each time bucket. We can perform exception analysis on this count with ML.

 

Hands-on practice

Import experimental data

In the following practice, we want to use a concrete example to demonstrate. You can download the code at the following address:

git clone https://github.com/liu-xiao-guo/ml_varlogsecure
Copy the code

Let’s download the above source code. I’m going to use Filebeat to import data into Elasticsearch:

filebeat_advanced.yml

filebeat.inputs:
- type: log
  paths:
    - /Users/liuxg/data/ml_data/advanced/secure.log
  
output.elasticsearch:
  hosts: ["http://localhost:9200"]
  index: varlogsecure
  pipeline: varlogsecure
 
setup.ilm.enabled: false
setup.template.name: varlogsecure
setup.template.pattern: varlogsecure
Copy the code

The name of the index we imported is called VarlongSecure. Remember to change the path of secure.log above.

The event content of the secure.log file is as follows:

Oct 22 15:02:19 LOCALhost SSHD [8860]: Received disconnect from 58.218.92.41 port 26062:11: [preauth] Oct 22 15:02:19 localhost sshd[8860]: Disconnected from 58.218.92.41 port 26062 [preauth] Oct 22 18:17:58 LOCALhost SSHD [8903]: Reverse mapping checking getaddrInfo for host-41.43.112.199.tedata.net [41.43.112.199] failed-possible break-in ATTEMPT! Oct 22 18:17:58 LOCALhost SSHD [8903]: Invalid user admin from 41.43.112.199 port 41805Copy the code

We can see from above that there is no year information. To be able to process this information, we must run pipeline in Elasticsearch:

PUT /_ingest/pipeline/varlogsecure
{
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{MONTH:month} %{MONTHDAY:day} %{TIME:time}"
        ]
      }
    },
    {
      "set": {
        "field": "timestamp",
        "value": "2018 {{month}} {{day}} {{time}}"
      }
    },
    {
      "date": {
        "field": "timestamp",
        "target_field": "@timestamp",
        "formats": [
          "yyyy MMM dd HH:mm:ss"
        ]
      }
    },
    {
      "remove": {
        "field": ["timestamp", "day", "time", "month"]
      }
    }
  ]
}
Copy the code

Above, we define a pipeline called VarlogSecure. This pipleline is also referenced in filebeat_advance.yml above.

You can import data into Elasticsearch by starting Filebeat:

./filebeat -e -c filebeat_advanced.yml 
Copy the code

After running Filebeat, we can find the newly generated VarlogSecure index in Elasticsearch:

GET _cat/indices
Copy the code

We next create an index pattern for VarlogSecure. I don’t want to talk about it here. We can see it in Discover:

As can be seen from the above, our data is between October 22 and October 26 of 2018. Above we can see a field called Message, which contains all the text of our original message. This message field will be used for categorization below.

 

Machine learning-categorization

Choose machine learning applications:

Click On Manage Jobs or Create Jobs (if you have never created a machine learning job).

Click on the Create job:

Select varlogSecure index:

Select the Advanced:

Select Next:

In the Categorization field, we choose the Message field. Click Add detector:

In the By Field above, we select mlCategory. Click the Save button:

Click Next:

Click the Next button:

Click Next:

Click on the Create job:

Click the Start:

Go to Anomly Explorer:

Here, we can see some exceptions occur. Open the above on October 25th, 2018 this exceptional event:

We can see that there is an exception for MLCategory 7. It typically has a value of 0.00192, but in reality the total number of events it occurs is 7. Belongs to an abnormal event.

 

conclusion

In today’s exercise, we used Elastic’s categorization function to analyze our log information. The advantage is that we run machine learning to analyze the exceptions in the logs without having to know anything about the logs in advance.