Like the Grok processor, the Dissect processor also extracts structured fields from individual text fields in a document. However, unlike the Grok processor, parsing does not use regular expressions. This makes Dissect’s syntax simpler, and in some cases faster than Grok Processor.

Dissect matches individual text fields with defined patterns. We’ve covered Grok and Dissect processors in my previous article “Elastic Observability – Structuring Data using Pipeline.” In today’s article, we want to take a closer look at dissect processors. In today’s presentation, I’ll show you some examples.

 

Hands-on practice

A simple example

Let’s start with a simple example:

POST _ingest/pipeline/_simulate { "pipeline": { "description": "Example using dissect processor", "processors": [ { "dissect": { "field": "message", "pattern": "%{@timestamp} [%{loglevel}] %{status}" } } ] }, "docs": [ { "_source": {"message": "2019-09-29t00:39:02.912z [Debug] MyApp stopped"}}]}Copy the code

Above, we extract messages by pattern. In Disssect, a particular concern is the use of whitespace. If the Spaces do not match, an error will also occur. The results above are:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : 2019-09-29t00:39:02.912z ", "logLevel" : "Debug", "message" : [Debug] MyApp stopped", "status" : "MyApp stopped"}, "_ingest" : {"timestamp" : "2020-12-09T04.40:40.894589z"}}}]}Copy the code

Obviously it extracts loglevel, message and status. Note that we also lost the [and] characters inside.

 

Skip the field

Since dissect is an exact match, in practical use, we may not want a field to appear in our document, although it can be structured. Let’s take a look at the following example:

POST _ingest/pipeline/_simulate { "pipeline": { "description": "Example using dissect processor", "processors": [ { "dissect": { "field": "message", "pattern": "%{@timestamp} [%{?loglevel}] %{status}" } } ] }, "docs": [{"_source": {"message": "2019-09-29t00:39:02.912z [Debug] MyApp stopped"}}]}Copy the code

In the example above, we used %{? Loglevel}, which indicates that we do not need loglevel in our results:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : [Debug] MyApp stopped", "status" : "2019-09-29t00:39:02.912z ", "message" : "MyApp stopped"}, "_ingest" : {"timestamp" : "2020-12-09t04:47:24.7823z"}}}]}Copy the code

Obviously, the loglevel field is missing from this output.

 

Handling multiple Spaces

Dissect processors are very strict. It needs exactly matching whitespace, otherwise parsing will not succeed, as in:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Example using dissect processor",
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{@timestamp} %{status}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29  MyApp stopped"
      }
    }    
  ]
}
Copy the code

MyApp stopped = MyApp stopped = MyApp stopped = MyApp stopped = MyApp stopped = MyApp stopped

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019-09-29", "message" : "2019-09-29 MyApp stopped", "status" : "" }, "_ingest" : { "timestamp" : "2020-12-09T05:01:58.065065z"}}}]}Copy the code

As you can see from the results above, it doesn’t parse our message at all. The status field is empty. So how do we deal with this? We can use the right-facing padding modifier -> ignore the padding:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Example using dissect processor",
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{@timestamp->} %{status}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29  MyApp stopped"
      }
    }    
  ]
}
Copy the code

The result of the above run is:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019-09-29", "message" : "2019-09-29 MyApp stopped", "status" : "MyApp stopped" }, "_ingest" : { "timestamp" : "2020-12-09T05:07:23.294188z"}}}]}Copy the code

We can also use an empty key to skip unwanted Spaces:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Example using dissect processor",
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "[%{@timestamp}]%{->}[%{status}]"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "[2019-09-29] [MyApp stopped]"
      }
    },
    {
      "_source": {
        "message": "[2019-09-29]  [MyApp stopped]"
      }
    }    
  ]
}
Copy the code

Above we used %{->} to match unwanted Spaces. Above, we used two documents, one containing one space and one containing two Spaces. The results are as follows:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019-09-29", "message" : "[2019-09-29] [MyApp stopped]", "status" : "MyApp stopped" }, "_ingest" : { "timestamp" : "The 2020-12-09 T05: they. 752694 z"}}}, {" doc ": {" _index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019-09-29", "message" : "[2019-09-29] [MyApp stopped]", "status" : "MyApp stopped" }, "_ingest" : {"timestamp" : "2020-12-09t05:21:14.752701z"}}}]}Copy the code

 

Additional fields

In many cases, we can even append multiple fields to a single field, for example:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Example using dissect processor",
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{@timestamp} %{+@timestamp} %{+@timestamp} %{loglevel} %{status}",
          "append_separator": " "
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "Oct 29 00:39:02 Debug MyApp stopped"
      }
    }    
  ]
}
Copy the code

Above, our time expression is Oct 29 00:39:02. It is made up of three strings. We combine these three strings into an @timestamp field by %{@timestamp} %{+@timestamp} %{+@timestamp}. The result of running the above is:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "Oct 29 00:39:02", "loglevel" : "Debug", "message" : "Oct 29 00:39:02 Debug MyApp stopped", "status" : "MyApp stopped"}, "_ingest" : {"timestamp" : "2020-12-09t05:27:29.785206z"}}}]}Copy the code

Note that in the example above we used append_separator and configured it as an empty string. Otherwise the three strings will be cascaded in our result to become Oct2900:39:02. This may not be the result we want in practical use.

 

The key – value in advance

We can use %{*field} as key and %{&field} as value to match:

POST _ingest/pipeline/_simulate { "pipeline": { "description": "Example using dissect processor key-value", "processors": [ { "dissect": { "field": "message", "pattern": "%{@timestamp} %{*field1}=%{&field1} %{*field2}=%{&field2}" } } ] }, "docs": [ { "_source": { "message": "2019009-29T00: 39:02.912z host=AppServer status=STATUS_OK"}}]}Copy the code

The result of the above run is:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "@timestamp" : "2019009-29T00: 39:02.912z ", "host" : "AppServer", "message" : "2019009-29T00: 39:02.912z host=AppServer status=STATUS_OK", "status" : "STATUS_OK"}, "_INGest" : {"timestamp" : "2020-12-09T05:34:30.47561z"}}}]}Copy the code

 

Challenge yourself

From the above exercise, you may have felt that the Dissect processor is very useful, and also very simple to use. So let’s do a really useful example:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
      }
    }
  ]
}
Copy the code

Here is an example of haProxy. It’s a long message. How can we use the processor to process this information and make it into a structured document?

We can use the Dissect processor. Based on what we have learned above, we can do this first:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{timestamp} %{+timestamp} %{+timestamp}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
      }
    }
  ]
}
Copy the code

Above, we concatenate the first three strings into a timestamp field. Run the command above:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "timestamp" : """Mar2201:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""" }, "_ingest" : { "timestamp" : 2020-12-09T05:38:44.674567z "}}}]}Copy the code

Obviously the first three strings are conjoined into one string, and it’s greedy. It matches all the following strings into this string. We need to modify it again:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host}",
          "append_separator": " "
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
      }
    }
  ]
}
Copy the code

We add append_separator and use %{host} to match all subsequent strings:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "host" : """localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "timestamp" : "Mar 22 01:27:39" }, "_ingest" : {"timestamp" : "2020-12-09t05:41:53.667182z"}}}]}Copy the code

Obviously, this time we can see the timestamp field clearly, but the host field is still a long string. We continue with:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host} %{process}[%{id}]:%{rest}",
          "append_separator": " "
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
      }
    }
  ]
}
Copy the code

Above, we extract the process and its ID, and put the rest into %{rest} :

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "rest" : """ Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "process" : "haproxy", "host" : "localhost", "id" : "14415", "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "timestamp" : "Mar 22 01:27:39" }, "_ingest" : {"timestamp" : "2020-12-09t05:46:11.833548z"}}}]}Copy the code

From the rest above, we can see that the front part is a status and the back part is a KV type data. We can use the KV processor to process it.

Let’s first extract status:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host} %{process}[%{id}]:%{status}, %{rest}",
          "append_separator": " "
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
      }
    }
  ]
}
Copy the code

Run the command above:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "rest" : """reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "process" : "haproxy", "host" : "localhost", "id" : "14415", "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "status" : " Server updates /appServer02 is UP", "timestamp" : "Mar 22 01:27:39" }, "_ingest" : { "timestamp" : "2020-12-09T05:50:18.300969z"}}}]}Copy the code

Obviously, we can get the status field. The next rest field is obviously a key-value. We can use KV processor to process:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host} %{process}[%{id}]:%{status}, %{rest}",
          "append_separator": " "
        }
      },
      {
        "kv": {
          "field": "rest",
          "field_split": ", ",
          "value_split": ":"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
      }
    }
  ]
}
Copy the code

On top we added a processor called KV:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "rest" : """reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "reason" : " Layer7 check passed", "process" : "haproxy", "code" : "2000", "check duration" : "3ms.", "message" : """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms.""", "host" : "localhost", "id" : "14415", "status" : " Server updates /appServer02 is UP", "timestamp" : "Mar 22 01:27:39", "info" : "\" OK \ ""}," _ingest ": {" timestamp" : "the 2020-12-09 T06:00:37. 990909 z"}}}}]Copy the code

From the above results, we can see that we have all the fields we want. Let’s remove the unwanted message and REST fields:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{timestamp} %{+timestamp} %{+timestamp} %{host} %{process}[%{id}]:%{status}, %{rest}",
          "append_separator": " "
        }
      },
      {
        "kv": {
          "field": "rest",
          "field_split": ", ",
          "value_split": ":"
        }
      },
      {
        "remove": {
          "field": "message"
        }
      },
      {
        "remove": {
          "field": "rest"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": """Mar 22 01:27:39 localhost haproxy[14415]: Server updates /appServer02 is UP, reason: Layer7 check passed, code:2000, info:"OK", check duration:3ms."""
      }
    }
  ]
}
Copy the code

Above, I used the remove handler to remove the message and rest fields:

{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "reason" : " Layer7 check passed", "process" : "haproxy", "code" : "2000", "check duration" : "3ms.", "host" : "localhost", "id" : "14415", "status" : " Server updates /appServer02 is UP", "timestamp" : "Mar 22 01:27:39", "info" : "\" OK \ ""}," _ingest ": {" timestamp" : "the 2020-12-09 T05:59:44. 138394 z"}}}}]Copy the code

From the step-by-step process above, we can see how to structure an unstructured data.