This article is shared from Huawei cloud community “DO SQL performance optimization is really staring”, author: Shi Zhenzhen’s grocery store.

A lot of big data calculations are implemented in SQL, run slowly to optimize SQL, but often encounter people stare. For example, the stored procedure has three statements that look something like this executing slowly:

Select a,b,sum(x) from T group by a,b where... ; Select c,d, Max (y) from T group by c,d where... ; Select a,c,avg(y),min(z) from T group by a,c... ;Copy the code

Here T is a huge table with hundreds of millions of rows, grouped in three different ways, each with a small result set.

Group operation to traverse the table, the three SQL statements will traverse the large table three times, traversal hundreds of millions of rows of data a time is not short, let alone three times. In this grouping operation, the CPU computation time is negligible relative to the time spent traversing the hard disk. If you can compute multiple subtotals in one traversal, you can multiply your speed by dramatically reducing the amount of data your disk reads, even though the CPU doesn’t do less. If SQL supports syntax like this:

Select a,b,sum(x) group by a,b where (x, x, x) Select c,d, Max (y) from c,d where Max (y) = Max (y) where Max (y) = Max (y) where Max (y) = Max (y) where Max (y) = Max (y) Select a,c,avg(y),min(z) group by a,c where avg(y),min(z) ; Third grouping in traversalCopy the code

The ability to return more than one result set at a time can significantly improve performance.

Unfortunately, without the SQL syntax, write such statements, can only use a flexible approach, is to give the group a, b, c, d more detailed grouping method to calculate the first result set, but should be put first into a temporary table, to use SQL to calculate the target results further. SQL is as follows:

Create table T_temp as select a,b,c,d, sum(case when... Then x else 0 end) sumx, Max (case when... Then Y else NULL end) sum(case when... Then y else 0 end) sumy, case when... Then 1 else NULL end) county, min(case when... then z else null end) minz group by a,b,c,d; Select a,b,sum(sumx) from T_temp group by a,b where... ; Select c,d, Max (maxy) from T_temp group by c,d where... ; Select a,c,sum(sumy)/sum(county),min(minz) from T_temp group by a,c where... ;Copy the code

This only takes one walk, but it’s a lot more complicated and computation-intensive to transfer different WHERE conditions to the previous case WHEN. In addition, when calculating a temporary table, the number of grouped fields becomes too large, resulting in a large result set. In the end, the temporary table is traversed multiple times, and the calculation performance is not fast. Group calculation of large result sets requires hard disk caching, and its performance is poor.

You can also use the database cursor of the stored procedure to fetch data out one by one, but this requires a full implementation of WHERE and GROUP actions, too cumbersome to write, database cursor traversal data performance will only be worse! Can only stare!

TopN operations will also encounter this kind of frustration. For example, using Oracle SQL to write top5 looks like this:

select * from (select x from T order by x desc) where rownum<=5
Copy the code

SQL > select * from table T where 1 billion entries are stored; select * from table T where 1 billion entries are stored; Big sort cost is very high, the amount of data is very large memory can not fit, there will be many hard disk data switching, computing performance will be very poor!

Is not difficult to avoid big sort in memory to keep a small set of 5 records, traversing the data, will have been calculated data stored in the top five small collection, take to the new data if is greater than the current 5th, is inserted into and lose now fifth, if smaller than the current fifth, do not do the action. By doing this, you only need to traverse a billion pieces of data once, and the memory footprint is very small, and the performance is greatly improved.

In essence, this algorithm treats TopN as an aggregation operation like summation and counting, but returns a set instead of a single value. Select top(x,5) from T;

Unfortunately, SQL has no explicit collection data type, and aggregate functions can only return single values. You can’t write such statements!

However, the good news is that the TopN of the whole set is relatively simple. Although SQL is written that way, databases are usually optimized in engineering to avoid large sorts. So Oracle is not slow to compute that SQL.

However, if the TopN case is too complex, the optimization engine usually doesn’t work when used in subqueries or mixed with joins. For example, to calculate the TopN of each group after grouping is a little difficult to write in SQL. Oracle SQL:

select * from
       (select y,x,row_number() over (partition by y order by x desc) rn from T)
where rn<=5
Copy the code

At this time, the database optimization engine is dizzy, will not use the above TopN to understand the method of aggregation operation. Can only go to do sort, the result of the operation speed drop!

If the SQL grouping TopN could be written like this:

select y,top(x,5) from T group by y
Copy the code

Treating top as an aggregate function like sum is not only more readable, but also easier to compute at high speed. Unfortunately, no.

Still staring!

Associative computing is also very common. For example, if an order is associated with multiple tables, then the filter calculation is performed. SQL looks like this:

select o.oid,o.orderdate,o.amount from orders o left join city ci on o.cityid = ci.cityid left join shipper sh on o.shid=sh.shid left join employee e on o.eid=e.eid left join supplier su on o.suid=su.suid where ci.state='New York' and  e.title = 'manager' and ...Copy the code

Order tables have tens of millions of data, cities, shippers, employees, suppliers, and so on. Filter criteria fields may come from these tables and are passed from the front end to the back end, changing dynamically.

SQL generally implements these associations using the HASH JOIN algorithm, which computs HASH values and compares them. Only one JOIN can be resolved at a time. There are N joins to perform N times, and the intermediate results need to be kept for the next round after each association. The calculation process is complicated, and the data will be traversed for many times, resulting in poor calculation performance.

Typically, these associated code tables are small enough to be read into memory first. If each associated field in the order table is pre-ordinated, for example, the value of the employee number field is converted to the serial number of the corresponding employee table record. When calculating, you can use the employee number field value (i.e. the employee table number) to fetch the employee table in memory, which is much faster than HASH JOIN. Moreover, you only need to iterate through the order table once, and the speed can be significantly improved. That is, you can write SQL like this:

Select o.o id, o.o rderdate, o.a mount from the orders o left the join city c on o. chua id = c # - orders table cities number by serial number # associated table left join shipper Sh on o.hid =sh.# left join employee e on =e.# left join supplier su Where ci.state='New York' and e.test =' manager' and...Copy the code

Unfortunately, SQL using the concept of unordered collection, even if the number is the serial number, the database can’t use this feature, not on the corresponding relation table the unordered collection using serial number fast positioning mechanism, will have to search for, using the index database and does not know what number is the serial number, will still be to compute the HASH value and the comparison, Performance is still poor!

There are good ways can not be implemented, can only stare again!

There are also high concurrent account queries, which are quite simple:

Select id, amt, tdate,... from T where id='10100' and tdate>= to_date('2021-01-10', 'yyyy-MM-dd') and tdate<to_date('2021-01-25', '- dd yyyy - MM) and...Copy the code

In the hundreds of millions of historical data of T table, it can quickly find several to thousands of details of an account. It is not complicated to write SQL, but the difficulty is that the response speed should reach the second level or even faster when large concurrency. In order to improve the response speed of the query, the id field of the T table is generally indexed:

create index index_T_1 on T(id)
Copy the code

Indexing individual account lookups in a database is fast, but becomes significantly slower when there are many concurrent transactions. The reason is based on the SQL disorder theory mentioned above. The total amount of data is too large to be read into memory, and the database cannot guarantee that the data of the same account is stored continuously physically. The hard disk has a minimum read unit. When discontinuous data is read, a lot of irrelevant content is retrieved, which slows down the query. Each query with high concurrent access is a little slower, and the overall performance will be poor. At a time of great emphasis on experience, who dares to make users wait more than ten seconds?

It is easy to think of a way to prioritize hundreds of millions of data in advance by account, to ensure that the data of the same account is stored continuously, and when querying, the data blocks read from the hard disk are almost always the target value, and the performance will be greatly improved.

However, relational databases using SQL architecture do not have this awareness and do not enforce the physical order of data storage! This problem is not caused by SQL syntax, but it is also related to the theoretical basis of SQL, and it is still impossible to implement these algorithms in a relational database.

To do that? Can only stare?

Instead of SQL and relational databases, use another computing engine.

The open source aggregator SPL, based on innovative theoretical foundations, supports more data types and operations to describe new algorithms in the scenarios described above. Using simple and convenient SPL to write code can greatly improve computing performance in a short time!

The above problems are written with SPL code sample for example:

A traverse computes multiple groupings

Calculate the Top5 by aggregation

Complete Set Top5 (Multi-threaded parallel computing)

Grouping Top5 (multi-threaded parallel computing)

SPL code associated with serial numbers:

System initialization

The query

SPL code for high concurrent account queries:

Data preprocessing, ordered storage

Account query

More in addition to these simple example, SPL can achieve high performance algorithm, such as orderly merge to realize the link between the order and detail, pre associated technology of multi-dimensional analysis of multi-layer associated dimension table, a storage technology to realize the thousands of tag statistics, Boolean set filter technology to implement multiple enumerated values of query speed, temporal grouping techniques to implement complex funnel analysis and so on.

For those who are struggling with SQL performance optimization, let’s discuss it with us:

SPL download address:…

SPL open Source address:…

Click to follow, the first time to learn about Huawei cloud fresh technology ~