As mentioned in the last article, there is a requirement to modify the existing storage structure, which involves consideration of query conditions and query efficiency. I read several articles related to index and HBase, recalled relevant knowledge, and discussed my understanding and summary based on project requirements.

The general table of contents is as follows. The first three sections are introduced in the previous part, which analyzes why index is fast, summarizes its advantages and classification, and the evolution process of index. The middle part will focus on index analysis methods and common index optimization.

  • Why you need an index
  • Category of index
  • MySQL > alter database alter database
  • MySQL > select * from database;
  • HBase is introduced
  • HBase Storage Structure
  • HBase Index
  • Business requirements and design

Part of the content is excerpted from several bloggers’ posts, with links at the end thanking them for their excellent analysis.

Through the introduction of the middle part, you will know:

  • MySQL > select * from ‘MySQL’;
  • Advanced query related concepts
  • Explain command details
  • Index optimization recommendations

MySQL > select * from ‘MySQL’;

To better optimize the query, we must first understand its overall query process, from the client to send the query request, to receive the query result, MySQL server has done a lot of work.

Logical architecture

The overall logical architecture of MySQL is divided into three layers, namely the client layer, the core service layer and the storage engine layer.

The top layer is the client layer, such as connection processing, authorization, security and other functions.

The middle layer is the core services of MySQL, including query parsing, analysis, optimization, caching, built-in functions (e.g., timing, math, encryption, etc.), in addition, all cross-storage engine functions are implemented in this layer: stored procedures, triggers, views, etc.

The lowest layer is the storage engine, responsible for data storage and extraction, and the service layer in the middle communicates with the storage engine through apis that mask the differences between the different storage engines.

Specific Execution process

Take a look at how MySQL optimizes and executes queries. A lot of query optimization is just following a few principles to make the MySQL optimizer work the way it’s supposed to.

Let’s start with the overall process:

  • The client sends a query SQL to the server;
  • The server checks the query cache first and returns the results stored in the cache immediately if a hit is made.
  • SQL parsing and preprocessing are performed on the server, and then the optimizer generates the corresponding execution plan.
  • The query execution engine invokes the storage engine’s API to execute the query according to the execution plan generated by the optimizer.
  • Returns the result to the client;

1. Client/server communication protocol

The communication protocol between the MySQL client and server is “half duplex” : at any one time, either the server sends data to the client or the client sends data to the server, which means that there is no traffic control.

Client with a single packet will query requests sent to the server, the server response to the user data is usually a lot of, is composed of multiple data packets, it is important to note when the server response to a client request, the client must complete receiving the returned results, not just only take a few front as a result, then stop the server to send.

2. Query the cache

If the query cache is open, the query statement is checked to see if it matches the data in the query cache, and if so, the result in the cache is directly returned after the user permissions are checked.

The query cache system keeps track of every table involved in the query, and MySQL must invalidate all caches of the corresponding table during any write operation. If the query cache is very large or fragmented, this operation can be very costly to the system.

In addition, any query statement must be checked before it is started. Even if the SQL statement will never hit the cache, if the query result can be cached, it will be cached after execution, resulting in additional system consumption.

Therefore, be careful when enabling caching. Caching improves system performance only when it saves more resources than it consumes. You can set query_cache_type to DEMAND.

3. Syntax parsing and preprocessing

The SQL statement is parsed by keywords to generate a parse tree. Pre-processing further checks whether the parse tree is valid according to MySQL rules.

4. Query optimization

A query can be executed in many ways, and the optimizer’s job is to find the best execution plan. MySQL uses a cost-based optimizer, which tries to predict the cost of a query using a certain execution plan and selects the one with the least cost.

5. Query the execution engine

The storage engine interface provides a lot of functionality, but there are only a few dozen interfaces underneath that do most of the work of a query like building blocks.

6. Return the result to the client

The return of the result set to the client is an incremental and gradual process, so that the server does not have to store too many results and consume too much memory, and the client can get the results in the first place.

SELECT execution order

Let’s look at the order in which the SQL query is executed. Each step generates a virtual temporary table as input to the next step.

The standard SQL syntax is as follows:

SELECT DISTINCT
    < select_list >
FROM
    < left_table > < join_type >
JOIN < right_table > ON < join_condition >
WHERE
    < where_condition >
GROUP BY
    < group_by_list >
HAVING
    < having_condition >
ORDER BY
    < order_by_condition >
LIMIT < limit_number >
Copy the code

But this is the order of execution:

FROM 
    <left_table>
ON <join_condition> <join_type> 
JOIN <right_table>
WHERE 
    <where_condition>
GROUP BY 
    <group_by_list>
HAVING 
    <having_condition>
SELECT 
DISTINCT 
    <select_list>
ORDER BY 
    <order_by_condition>
LIMIT 
    <limit_number>
Copy the code

1.FROM

When multiple tables are involved, the output of the left table is used as the input of the right table, and a virtual table VT1 is generated:

  • Calculate the Cartesian product (CROSS JOIN) of the two associated tables to generate the virtual table vt1-j1.
  • Based ON virtual table vt1-j1, the virtual table VT1-j2 is generated by filtering out all rows that meet ON predicate conditions.
  • If LEFT,RIGHT, and FULL join is used, the main table (reserved table) columns that do not meet the ON condition will be added to vt1-j2, resulting in virtual table vt1-j3.

2.WHERE

Temporary tables generated during VT1 are filtered, and columns that satisfy the WHERE clause are inserted into the VT2 table:

  • If there is an external join, ON filters the associated table. The main table returns all columns. If there is no external join, the effect is the same.
  • Filtering for the main table should be placed in WHERE;
  • For associated tables, ON is used for conditional query followed by join, and WHERE is used for conditional query followed by join.

3.GROUP BY

This clause groups the tables generated in VT2 BY the columns in GROUP BY to generate VT3 tables:

  • Columns used in subsequent processing statements, such as SELECT and HAVING, must be included in GROUP BY. For those that are not present, aggregate functions must be used.

4.HAVING

Different groups in VT3 table are filtered only for grouped data. Clauses that meet the HAVING condition are added to VT4 table.

5.SELECT

This clause processes the elements in the SELECT clause to generate the VT5 table:

  • Evaluate the expression in the SELECT clause to generate vt5-j1;
  • DISTINCT: A temporary memory table VT5-J2 is created, which is the same as the virtual table VT5-J1. The difference is that a unique index is added to the DISTINCT columns to divide the duplicate data.

6.ORDER BY

From the tables in VT5-j2, the VT6 table is generated BY sorting the results according to the conditions of the ORDER BY clause, which is the only place where aliases in SELECT can be used.

7.LIMIT

Selects the specified row starting at the specified location from the VT6 virtual table obtained in the previous step.

Advanced query related concepts

This section describes common advanced query concepts.

Join queries

A number of tables in accordance with a specified condition for data splicing, SQL will join query into four categories: internal join, external join, natural join, cross join, which natural join and cross join rarely used, but more introduction.

1. Inner join

Obtain each record from the left table and match it with all records in the right table. The matching record must meet the conditions in both the left and right tables. The matching record is retained; otherwise, the matching record is not retained.

2. Left /right join

There are two types of external connections:

  • Left join: left outer join (left join). The left table is the primary table
  • Right Join: the right outer join (right join), with the right table as the primary table

Take a certain table as the main table, fetch all records inside, no matter whether the condition can match, the main table will be retained, and then connect with another table, if not match, other table fields are NULL.

The subquery

A query is performed on top of a query result, that is, a SELECT statement contains another SELECT statement.

According to the location of the sub-query, it can be divided into:

  • From subquery: the subquery follows From;
  • Where subquery: subquery Where condition;
  • Exists subquery: a subquery occurs in exists.

Here are a few examples:

Find all employees whose department name prefix is “xiaomi” :

 SELECT name , sex ,  sal
        FROM emp
        WHERE no in ( 
            SELECT no FROM dept 
                WHERE name LIKE 'millet %');Copy the code

View the salaries of all employees and sort them by salary:

SELECT name , sal
       FROM (
           SELECT name , sal 
              FROM emp ORDER BYSal);Copy the code
The joint query

The query will be repeated, the result will be concatenated, the number of fields will not increase, each SELECT statement must obtain the same number of fields.

The syntax is as follows:

Selectstatements1

Union [unionOptions]Selectstatements2..Copy the code

The Union options:

  • All: keep All;
  • Distinct: deduplication, default option.

Write much again, add one more, medium to be continued…

Reference article:

  1. MySQL Optimization Principle
  2. Step by step: SQL parsing order

Please scan the QR code below and follow my personal wechat official account for more articles