Elasticsearch is a full text search engine with all the benefits you would expect, such as relevance scores, stems, synonyms, etc. And because it is a horizontally scalable distributed document store, it can process billions of rows of data without effort. For Elasticsearch professionals, most people prefer to use DSLS for searching, but for less professional people, SQL statements are more familiar. How to get them to query Elasticsearch data is a problem. With Elasticsearch SQL, you can access full text searches using familiar query syntax for super speed and easy scalability. X-pack includes an SQL feature that performs SQL queries against the Elasticsearch index and returns results in tabular format.

In today’s article, we will take a brief look at how to use Elasticsearch SQL to query our data. There are also specific use cases for Elasticsearch SQL described in the previous article “Kibana: A Canvas Primer”.

 

The installation

For developers who don’t already have their own Elasticsearch installed, see my previous post “Elastic: A Beginner’s Guide” to installing your own Elasticsearch and Kibana. I won’t bother here.

 

To prepare data

We opened Kibana first:

Click on the “Load a Data Set and a Kibana Dashboard” above:

Click Add Data above, so that we can complete the import of experimental data. In Elasticsearch we will find an index called KibanA_sample_datA_FLIGHTS.

 

SQL in field

Query which indexes are available

For Elasticsearch, you can use the following command:

POST /_sql? format=txt { "query": "SHOW tables" }Copy the code

The command above shows the result:

Retrieve Elasticsearch Schema information: DSL vs SQL

First, we determine the schema for the table/index and the fields available to us. We will do this via the REST interface:

POST /_sql
{
  "query": """
    DESCRIBE kibana_sample_data_flights
  """
}
Copy the code

The result of the above command:

{
  "columns" : [
    {
      "name" : "column",
      "type" : "keyword"
    },
    {
      "name" : "type",
      "type" : "keyword"
    },
    {
      "name" : "mapping",
      "type" : "keyword"
    }
  ],
  "rows" : [
    [
      "AvgTicketPrice",
      "REAL",
      "float"
    ],
    [
      "Cancelled",
      "BOOLEAN",
      "boolean"
    ],
    [
      "Carrier",
      "VARCHAR",
      "keyword"
    ],
    [
      "Dest",
      "VARCHAR",
      "keyword"
    ],
    [
      "DestAirportID",
      "VARCHAR",
      "keyword"
    ],
    [
      "DestCityName",
      "VARCHAR",
      "keyword"
    ],
    [
      "DestCountry",
      "VARCHAR",
      "keyword"
    ],
    [
      "DestLocation",
      "GEOMETRY",
      "geo_point"
    ],
    [
      "DestRegion",
      "VARCHAR",
      "keyword"
    ],
    [
      "DestWeather",
      "VARCHAR",
      "keyword"
    ],
    [
      "DistanceKilometers",
      "REAL",
      "float"
    ],
    [
      "DistanceMiles",
      "REAL",
      "float"
    ],
    [
      "FlightDelay",
      "BOOLEAN",
      "boolean"
    ],
    [
      "FlightDelayMin",
      "INTEGER",
      "integer"
    ],
    [
      "FlightDelayType",
      "VARCHAR",
      "keyword"
    ],
    [
      "FlightNum",
      "VARCHAR",
      "keyword"
    ],
    [
      "FlightTimeHour",
      "VARCHAR",
      "keyword"
    ],
    [
      "FlightTimeMin",
      "REAL",
      "float"
    ],
    [
      "Origin",
      "VARCHAR",
      "keyword"
    ],
    [
      "OriginAirportID",
      "VARCHAR",
      "keyword"
    ],
    [
      "OriginCityName",
      "VARCHAR",
      "keyword"
    ],
    [
      "OriginCountry",
      "VARCHAR",
      "keyword"
    ],
    [
      "OriginLocation",
      "GEOMETRY",
      "geo_point"
    ],
    [
      "OriginRegion",
      "VARCHAR",
      "keyword"
    ],
    [
      "OriginWeather",
      "VARCHAR",
      "keyword"
    ],
    [
      "dayOfWeek",
      "INTEGER",
      "integer"
    ],
    [
      "timestamp",
      "TIMESTAMP",
      "datetime"
    ]
  ]
}
Copy the code

You can also format the above response in tabular form with the URL parameter format = TXT. Such as:

POST /_sql? format=txt { "query": "DESCRIBE kibana_sample_data_flights" }Copy the code

The result of the above command query is:

column | type | mapping ------------------+---------------+--------------- AvgTicketPrice |REAL |float Cancelled |BOOLEAN |boolean Carrier |VARCHAR |keyword Dest |VARCHAR |keyword DestAirportID |VARCHAR |keyword DestCityName |VARCHAR  |keyword DestCountry |VARCHAR |keyword DestLocation |GEOMETRY |geo_point DestRegion |VARCHAR |keyword DestWeather |VARCHAR |keyword DistanceKilometers|REAL |float DistanceMiles |REAL |float FlightDelay |BOOLEAN |boolean FlightDelayMin  |INTEGER |integer FlightDelayType |VARCHAR |keyword FlightNum |VARCHAR |keyword FlightTimeHour |VARCHAR |keyword FlightTimeMin |REAL |float Origin |VARCHAR |keyword OriginAirportID |VARCHAR |keyword OriginCityName |VARCHAR |keyword OriginCountry |VARCHAR |keyword OriginLocation |GEOMETRY |geo_point OriginRegion |VARCHAR |keyword OriginWeather |VARCHAR |keyword dayOfWeek |INTEGER |integer timestamp |TIMESTAMP |datetimeCopy the code

Does it feel like the SQL era 🙂

Moving forward, we will use the tabular response structure shown above whenever we provide a sample response from the REST API. To implement the same query from the console, log in using the following command:

./bin/elasticsearch-sql-cli http://localhost:9200
Copy the code

We can see the following picture on the screen:

It’s amazing. We are looking directly at the SQL command prompt. On the command line above, we type the following command:

DESCRIBE kibana_sample_data_flights;
Copy the code

This result is the same as we got in Kibana.

The above schema is also returned with any query on the fields displayed in the SELECT clause, providing any potential driver with the type information necessary to format or manipulate the results. For example, consider a simple SELECT with a LIMIT clause to keep the response short. By default, we return 1000 lines.

We found that the index name kibanA_sample_data_FLIGHTS is long, so for convenience we will create an alias:

PUT /kibana_sample_data_flights/_alias/flights
Copy the code

In the future, when we use flights, we will actually operate on the index KibanA_SAMple_datA_FLIGHTS.

We execute the following command:

POST /_sql? format=txt { "query": "SELECT FlightNum FROM flights LIMIT 1" }Copy the code

Display result:

   FlightNum   
---------------
9HY9SWR        
Copy the code

The same REST request/response is used by the JDBC driver and console:

sql> SELECT OriginCountry, OriginCityName FROM flights LIMIT 1;
 OriginCountry | OriginCityName  
---------------+-----------------
DE             |Frankfurt am Main
Copy the code

Note that if the requested field does not exist at any time (case sensitive), the semantics of tabular and strongly typed stores mean that an error will be returned – unlike Elasticsearch behavior, in which the field is not returned at all. For example, modifying the above to use the field “OrigincityName” instead of “OrigincityName” yields a useful error message:

sql> SELECT OriginCountry, OrigincityName FROM flights LIMIT 1;
Bad request [Found 1 problem(s)
line 1:23: Unknown column [OrigincityName], did you mean any of [OriginCityName, DestCityName]?]
Copy the code

Similarly, if we try to use a function or expression on an incompatible field, we get a corresponding error. In general, profilers fail early in validating the AST. To do this, Elasticsearch must understand the index mapping and functionality of each field. Therefore, any client with secure access to the SQL interface needs the appropriate permissions.

If we continue to provide every request and reply accordingly, we will end up with a lengthy blog post! For brevity, here are some increasingly complex queries with interesting annotations.

Use WHERE and ORDER BY to SELECT

“Find the longest 10 flights in the U.S. longer than 5 hours.”

POST /_sql? format=txt { "query": """ SELECT OriginCityName, DestCityName FROM flights WHERE FlightTimeHour > 5 AND OriginCountry='US' ORDER BY FlightTimeHour DESC LIMIT 10 """ }Copy the code

The result is:

OriginCityName |   DestCityName    
---------------+-------------------
Chicago        |Oslo               
Cleveland      |Seoul              
Denver         |Chitose / Tomakomai
Nashville      |Verona             
Minneapolis    |Tokyo              
Portland       |Treviso            
Spokane        |Vienna             
Kansas City    |Zurich             
Kansas City    |Shanghai           
Los Angeles    |Zurich       
Copy the code

The operators that limit the number of rows vary from SQL implementation to SQL implementation. For Elasticsearch SQL, we implement the LIMIT operator in accordance with Postgresql/Mysql.

Math

Just random numbers…

SQL > SELECT ((1 + 3) * 1.5 / (7-6)) * 2 AS random; The random -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 12.0Copy the code

This represents an example of some post-processing of functionality on the server side. There is no equivalent Elasticsearch DSL query.

Functions & Expressions

“Look for all flights after February that are longer than five hours and rank them by longest.”

POST /_sql? format=txt { "query": """ SELECT MONTH_OF_YEAR(timestamp), OriginCityName, DestCityName FROM flights WHERE FlightTimeHour > 1 AND MONTH_OF_YEAR(timestamp) > 2 ORDER BY FlightTimeHour DESC LIMIT 10 ""}"Copy the code

The result is:

MONTH_OF_YEAR(timestamp)|OriginCityName | DestCityName  
------------------------+---------------+---------------
4                       |Chicago        |Oslo           
4                       |Osaka          |Spokane        
4                       |Quito          |Tucson         
4                       |Shanghai       |Stockholm      
5                       |Tokyo          |Venice         
5                       |Tokyo          |Venice         
5                       |Tokyo          |Venice         
5                       |Buenos Aires   |Treviso        
5                       |Amsterdam      |Birmingham     
5                       |Edmonton       |Milan     
Copy the code

These features usually require Painless distortion in Elasticsearch to achieve the same effect, and SQL feature declarations avoid any scripting. Notice also how we use this function in the WHERE and SELECT clauses. The WHERE clause component is pushed down to Elasticsearch because it affects the result count. The SELECT function is handled by the server-side plug-in in the demo.

Note that the list of available features can be retrieved by “SHOW FUNCTIONS”

sql> SHOW FUNCTIONS;
      name       |     type      
-----------------+---------------
AVG              |AGGREGATE      
COUNT            |AGGREGATE      
FIRST            |AGGREGATE      
FIRST_VALUE      |AGGREGATE      
LAST             |AGGREGATE      
LAST_VALUE       |AGGREGATE      
MAX              |AGGREGATE 
 ...
Copy the code

Combining this with our previous mathematical abilities, we can begin to formulate queries that will be very complex for most DSL users.

“Find the distance and average speed of the two fastest flights departing between 9am and 11am on Monday, Tuesday or Wednesday and over 500km. Round the distance and speed to the nearest whole number. If the speeds are equal, please display the maximum time first.”

First, in the DESCRIBE KibanA_sample_data_flights command output above, we can see that FlightTimeHour is a keyword. This is obviously not true because it’s a number. Maybe in the original design. We need to change this field to float data.

PUT flight1
{
  "mappings": {
    "properties": {
      "AvgTicketPrice": {
        "type": "float"
      },
      "Cancelled": {
        "type": "boolean"
      },
      "Carrier": {
        "type": "keyword"
      },
      "Dest": {
        "type": "keyword"
      },
      "DestAirportID": {
        "type": "keyword"
      },
      "DestCityName": {
        "type": "keyword"
      },
      "DestCountry": {
        "type": "keyword"
      },
      "DestLocation": {
        "type": "geo_point"
      },
      "DestRegion": {
        "type": "keyword"
      },
      "DestWeather": {
        "type": "keyword"
      },
      "DistanceKilometers": {
        "type": "float"
      },
      "DistanceMiles": {
        "type": "float"
      },
      "FlightDelay": {
        "type": "boolean"
      },
      "FlightDelayMin": {
        "type": "integer"
      },
      "FlightDelayType": {
        "type": "keyword"
      },
      "FlightNum": {
        "type": "keyword"
      },
      "FlightTimeHour": {
        "type": "float"
      },
      "FlightTimeMin": {
        "type": "float"
      },
      "Origin": {
        "type": "keyword"
      },
      "OriginAirportID": {
        "type": "keyword"
      },
      "OriginCityName": {
        "type": "keyword"
      },
      "OriginCountry": {
        "type": "keyword"
      },
      "OriginLocation": {
        "type": "geo_point"
      },
      "OriginRegion": {
        "type": "keyword"
      },
      "OriginWeather": {
        "type": "keyword"
      },
      "dayOfWeek": {
        "type": "integer"
      },
      "timestamp": {
        "type": "date"
      }
    }
  }
}  
Copy the code

We need the index reindex.

POST _reindex
{
  "source": {
    "index": "flights"
  },
  "dest": {
    "index": "flight1"
  }
}
Copy the code

So now in flight1’s data, the FlightTimeHour field will be of type float. We reset alias to FLIGHTS again:

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "flight1",
        "alias": "flights"
      }
    },
    {
      "remove": {
        "index": "kibana_sample_data_flights",
        "alias": "flights"
      }
    }
  ]
}
Copy the code

Now flights will be an alias pointing to Flight1.

We use the following SQL statement to query:

sql> SELECT timestamp, FlightNum, OriginCityName, DestCityName, ROUND(DistanceMiles) AS distance, ROUND(DistanceMiles/FlightTimeHour) AS speed, DAY_OF_WEEK(timestamp) AS day_of_week FROM flights WHERE DAY_OF_WEEK(timestamp) >= 0 AND DAY_OF_WEEK(timestamp) <= 2 AND  HOUR_OF_DAY(timestamp) >=9 AND HOUR_OF_DAY(timestamp) <= 10 ORDER BY speed DESC, distance DESC LIMIT 2; timestamp | FlightNum |OriginCityName | DestCityName | distance | speed | day_of_week ------------------------+---------------+---------------+---------------+---------------+---------------+--------------- The T10:2020-05-17 53:52. 000 z | LAJSKLT | through | Lima | | | 1 2020-04-27 T09 783.0:11398.0 30:39. 000 z | VLUDO2H | Buenos Aires | Moscow 8377.0 783.0 | | | 2Copy the code

A rather complicated and strange question, but hopefully you get the point. Also notice how we create field aliases and reference them in the ORDER BY clause.

Also note that you do not need to specify all the fields used in WHERE and ORDER BY in the SELECT clause. This may be different from the SQL implementation you used in the past. For example, the following is exactly right:

POST /_sql
{
  "query":"SELECT timestamp, FlightNum FROM flights WHERE AvgTicketPrice > 500 ORDER BY AvgTicketPrice"
}
Copy the code

It shows:

{ "columns" : [ { "name" : "timestamp", "type" : "datetime" }, { "name" : "FlightNum", "type" : "text" } ], "rows" : [[the 2020-04-26 T09:04:20. 000 "z", "QG5DXD3"], [2020-05-02 T23: depart. 000 "z", "NXA71BT"], [the 2020-04-17 T01: will. 000 z, "VU8K9DM"], [" : the 2020-04-24 T08 46:45. 000 z ", "UM8IKF8"],...Copy the code

Convert SQL queries into DSL

We’ve all tried an SQL query to express in the Elasticsearch DSL, or wondered if it was the best one. One of the compelling features of the new SQL interface is its ability to assist new adopters of Elasticsearch in solving such problems. Using the REST interface, we simply append/Translate to the “SQL” endpoint to get the Elasticsearch query that the driver will issue.

Let’s consider some previous queries:

POST /_sql/translate
{
  "query": "SELECT OriginCityName, DestCityName FROM flights WHERE FlightTimeHour > 5 AND OriginCountry='US' ORDER BY FlightTimeHour DESC LIMIT 10"
}
Copy the code

The equivalent DSL should be obvious to any experienced Elasticsearch user:

{ "size" : 10, "query" : { "bool" : { "must" : [ { "range" : { "FlightTimeHour" : { "from" : 5, "to" : Null, "include_lower" : false, "include_upper" : false, "boost" : 1.0}}}, {"term" : {"OriginCountry. Keyword ": {" value ":" US ", "boost" : 1.0}}}], "adjust_pure_negative" : true, "boost" : 1.0}}, "_source" : {" includes ": [ "OriginCityName", "DestCityName" ], "excludes" : [ ] }, "sort" : [ { "FlightTimeHour" : { "order" : "desc", "missing" : "_first", "unmapped_type" : "float" } } ] }Copy the code

The WHERE clause will be converted to range and term queries as you would expect. Notice how the OriginCountry. Keyword variant of the child field is used for an exact match with the parent OriginCountry (text type). There is no need for the user to know the difference in behavior of the underlying mapping – the correct field type will be selected automatically. Interestingly, the interface attempts to optimize retrieval performance by using docvalue_fields on the _source, for example for the exact types (numbers, dates, keywords) with doc values enabled. We can rely on Elasticsearch SQL to generate the best DSL for a given query.

Now consider the most complex query we used last time:

POST /_sql/translate { "query": """ SELECT timestamp, FlightNum, OriginCityName, DestCityName, ROUND(DistanceMiles) AS distance, ROUND(DistanceMiles/FlightTimeHour) AS speed, DAY_OF_WEEK(timestamp) AS day_of_week FROM flights WHERE DAY_OF_WEEK(timestamp) >= 0 AND DAY_OF_WEEK(timestamp) <= 2 AND  HOUR_OF_DAY(timestamp) >=9 AND HOUR_OF_DAY(timestamp) <= 10 ORDER BY speed DESC, distance DESC LIMIT 2 """ }Copy the code

The above response is:

{ "size" : 2, "query" : { "bool" : { "must" : [ { "script" : { "script" : { "source" : "InternalSqlScriptUtils.nullSafeFilter(InternalSqlScriptUtils.and(InternalSqlScriptUtils.gte(InternalSqlScriptUtils.date TimeChrono(InternalSqlScriptUtils.docValue(doc,params.v0), params.v1, params.v2), params.v3), InternalSqlScriptUtils.lte(InternalSqlScriptUtils.dateTimeChrono(InternalSqlScriptUtils.docValue(doc,params.v4), params.v5, params.v6), params.v7)))", "lang" : "painless", "params" : { "v0" : "timestamp", "v1" : "Z", "v2" : "HOUR_OF_DAY", "v3" : 9, "v4" : "timestamp", "v5" : "Z", "v6" : "HOUR_OF_DAY", "v7" : 10 } }, "boost" : 1.0}}, {"script" : {"script" : {"source" : "InternalSqlScriptUtils.nullSafeFilter(InternalSqlScriptUtils.and(InternalSqlScriptUtils.gte(InternalSqlScriptUtils.dayO fWeek(InternalSqlScriptUtils.docValue(doc,params.v0), params.v1), params.v2), InternalSqlScriptUtils.lte(InternalSqlScriptUtils.dayOfWeek(InternalSqlScriptUtils.docValue(doc,params.v3), params.v4), params.v5)))", "lang" : "painless", "params" : { "v0" : "timestamp", "v1" : "Z", "v2" : 0, "v3" : "timestamp", "v4" : "Z", "the v5" : 2}}, "boost" : 1.0}}], "adjust_pure_negative" : true, "boost" : 1.0}}, "_source" : {" includes ": [ "FlightNum", "OriginCityName", "DestCityName", "DistanceMiles", "FlightTimeHour" ], "excludes" : [ ] }, "docvalue_fields" : [ { "field" : "timestamp", "format" : "epoch_millis" } ], "sort" : [ { "_script" : { "script" : { "source" : "InternalSqlScriptUtils.nullSafeSortNumeric(InternalSqlScriptUtils.round(InternalSqlScriptUtils.div(InternalSqlScriptUti ls.docValue(doc,params.v0),InternalSqlScriptUtils.docValue(doc,params.v1)),params.v2))", "lang" : "painless", "params" : { "v0" : "DistanceMiles", "v1" : "FlightTimeHour", "v2" : null } }, "type" : "number", "order" : "desc" } }, { "_script" : { "script" : { "source" : "InternalSqlScriptUtils.nullSafeSortNumeric(InternalSqlScriptUtils.round(InternalSqlScriptUtils.docValue(doc,params.v0), params.v1))", "lang" : "painless", "params" : { "v0" : "DistanceMiles", "v1" : null } }, "type" : "number", "order" : "desc" } } ] }Copy the code

Don’t you think it’s complicated?

Our WHERE and ORDER BY clauses have been converted to painless scripts and are used in sorting and script queries provided BY Elasticsearch. These scripts are even parameterized to avoid compilation and take advantage of script caching.

As a side note, while the above represents the best translation of SQL statements, it does not represent the best solution to the broader problem. In fact, we want to encode the day of the week, hour of day, and speed in the document at index time, so we can just use a simple range query. This is probably better performance than using painless scripts to solve this particular problem. In fact, some of these fields actually even exist in the document for these reasons. This is a common theme that users should be aware of: although we can rely on the Elasticsearch SQL implementation to provide us with the best translation, it can only take advantage of the fields specified in the query and therefore does not necessarily provide the best solution for the larger problem query. To implement the best approach, you need to consider the advantages of the underlying platform, and the _Translate API may be the first step in this process.

If you want to learn more about Elasticsearch, please refer to the article “Elasticsearch: Introduction to Elasticsearch SQL”.

 

reference

【 1 】 elasticstack.blog.csdn.net/article/det…