Abstract:This article first describes the commonly used indexes, and introduces the two indexes of B-Tree and PSORT in detail. The advantages and disadvantages of indexes are given below. In addition to index, it also introduces partitioning, PCK and other ways to speed up queries. Finally, the use scenarios of various indexing and tuning methods are given.

This article is shared from the Huawei cloud community “DWS Index of the correct” open posture “, the original author: Hoholy.

What does an index do, in a word: query acceleration. Common indexes include the following:

1. Introduction to common indexes

1.1 B – btree index

B-tree storage structure is shown as follows:

  • B-tree is a balanced tree, which stores index KEY and TID in order.
  • For the filter condition on the index, the corresponding leaf node can be found quickly through the KEY, and then the actual record can be found through the TID.
  • The data in the index is stored in a non-descending order (both between pages and within pages), and the data pages in the same level are joined by a bidirectional linked list.
  • Support single-column index and composite index (multi-column), multi-column composite index is suitable for multi-column combined query, B-tree index has requirements for the order of query conditions;
  • B-tree index can handle the query of equivalence and range;
  • Index pages do not store transaction information;

Here’s an example of how to create a B-tree index in a database:

1.2 Psort index

The PSORT index data structure is shown in the following figure:

  • PSORT index itself is an inventory table, containing index columns and TIDs. It is locally sorted on index columns, and uses MIN/MAX block filtering to accelerate TID acquisition.
  • The Psort index itself is visible, but deleting or updating data does not affect the Psort index.
  • PSORT index is more suitable for range filtering, and the point query speed is slower.
  • Valid for bulk import scenarios, but not for single import;

Horizontal comparison of B-tree and Psort is as follows:

1.3 Special Indexes

Expression index

Select * from test1 where lower(col1) = ‘value’; select * from test1 where lower(col1) = ‘value’; Create index on test1(Lower (col1)); create index on test1(Lower (col1)); For subsequent filtering conditions like the lower(col1) expression, this index can be used directly to speed up the query. For other expressions, the index does not apply to the query. Note, however, that index expressions are expensive to maintain because they have to be recalculated for each row that is inserted or updated.

Part of the index

Create index idx2 on test1(IP) where not (IP > ‘10.185.178.100’ and IP < ‘10.185.178.200’); Select from test1 where IP = ‘10.185.178.150’ ‘select from test1 where IP =’ 10.185.178.150 ‘ ‘ This index cannot be used for the query “select from test1 where IP = ‘10.185.178.50’”. Partial indexes are used to reduce the size of the index, exclude data that is not of interest to the query, and speed up the retrieval efficiency of the index.

The only index

(1) Only B-tree index supports unique index;

(2) When an index is declared as unique, it is not allowed for multiple table rows to have the same index value;

(3) Null values are considered different, and a multi-column unique index will reject rows that have the same combined values on all index columns;

(4) A unique index is automatically created for the primary key column;

(5) Uniqueness check will affect index insert performance;

1.4 Advantages and Disadvantages of Indexes

The advantages of indexing are as follows:

  • Point query speed significantly, direct positioning to the desired location, reduce invalid IO;
  • Multi-condition combined query, filtering a large number of data, reduce the scanning range;
  • Using inverted index to accelerate full text retrieval;
  • The efficiency of multi-table join can be improved by using the advantage of fast query speed of equivalent conditional index and NESTLOOP.
  • Provide primary keys and unique constraints to meet business needs;
  • Optimize the query plan by using the natural order characteristic of btree index;

The index’s disadvantages are as follows:

  • Index pages take up extra space, causing some disk bloat.
  • Every time the data is imported, the index needs to be updated, which affects the import performance.
  • The index page has no visibility, there is garbage data, need to clean up periodically;
  • Indexed scan performance is not always better than sequential scan performance, and if the optimizer is wrong, query performance may deteriorate in reverse.
  • The index needs to record Xlog, increase the log amount;
  • Each index at least one file, increase the cost of backup, recovery, capacity expansion and other operations;
  • Given that the use of indexes can be a double-edged sword, be careful to create indexes only for columns that are needed and not for conditional columns that filter large amounts of data

Do not create indexes. Are there any other optimizations for the storage layer besides indexes to optimize query efficiency? Here are a few more ways to speed up DWS queries.

2. Speed up DWS queries

2.1 partition

Partition is one of the most commonly used means to speed up, and the effect is very good, we are recommended to use more combined with the scene.

  • At present, the partition supported is Range partition, partition support Merge, Split, Exchange and other operations;
  • Partitions are created on the columns with certain data rules such as time dimension or space dimension. The filtering conditions on the partition column will be partitioned pruning first to reduce the amount of physical scanning.
  • In contrast to indexing, partitioning physically divides the raw data directly. Once partition pruning takes effect, it can significantly reduce IO.
  • There is no conflict between using a partition and using an index. You can create an index for a partition.

Note the following when using partitions:

  • The effect of partitioning on imports is to increase memory usage (when running out of memory, the footer), but not to incur additional disk usage;
  • When using partitions, we must pay attention to the selection of partition columns and the control of the number of partitions. Too many partitions will lead to small file problems. The number of partitions is recommended not to exceed 1600.
  • Regional pruning is suitable for range query, but the efficiency of point query is limited.

For example, if you create a partitioned table and a non-partitioned table with the same data type, and import 6.4 million pieces of the same data, you will see a 7-fold improvement in the performance of partition pruning using the same query.

Test1 is a partitioned table, Test2 is a non-partitioned table, Test1 can scan the query 6ms, Test2 can scan the query 46ms, the difference is more than 7 times:

2.2 PCK (partial cluster key)

The essence of PCK is to improve the efficiency of query filtering through sorting. When creating a table, specify PCK column, and the data on this column will be locally sorted. Orded data will bring better data clustering, and sparse indexes such as min/ Max of each data block can play a better role, coarse filtering out a large number of data and improving IO efficiency. By default, 4.2 million rows of data are locally sorted.

Notes are as follows:

  • Only the inventory table supports PCK, and partial sorting will take effect on the batch data imported each time, instead of full sorting.
  • PCK is more suitable for range query, and the combination of PCK and index can achieve the best results in point checking scenarios.
  • Imports with PCK will use more memory due to sorting, which will affect the import speed and need to balance the import and query performance.

SELECT * FROM TAB WHERE COL > 65 SELECT * FROM TAB WHERE COL > 65 SELECT * FROM TAB WHERE COL > 65 SELECT * FROM TAB WHERE COL > 65 SELECT * FROM TAB WHERE COL > 65 SELECT * FROM TAB WHERE COL > 65

Then compare the performance of PCK horizontally with the data from the above partition:

(1) List storage table, non-partitioned, no PCK, scan time 46ms

(2) List storage table, non-partitioned, with PCK, scan time 1.7ms

(3) Save the table with PCK, then create the Btree index, and the scan takes 0.1ms

PCK, in combination with indexing, can improve the performance of such queries by more than 100 times.

2.3 Intelligent filtering

The inventory table data will be read from the file and fed back to the implementation layer, which will be intelligently identified and automatically multi-layer filtering and completely transparent to the user, as shown in the figure below:

3. Recommended index usage scenarios

Click on the attention, the first time to understand Huawei cloud fresh technology ~