GIN, GiST, SP-GIST, RUM Indexing principles and technical background

background

Fuzzy query is a big demand, but also a very difficult demand for database.

For pre-obfuscation (like ‘% XXX ‘), you can use the inverted B-tree index; for post-obfuscation (like ‘% XXX ‘), you can use the b-tree index.

The usual queries supported by b-tree indexes include >, <, =, <=, >=, and sorting. Currently, most databases support the B-tree index method.

But for a forward-and-backward blur (like ‘% XXXX %’), for a forward-and-backward blur regular expression (~ ‘.ab? CD [e-f]{1,10}-0.’), many databases have no way to start, can not optimize, can only scan the whole table, each record for separate processing.

The open nature of the PostgreSQL database makes this possible, and it makes it possible to perform index searches of fuzzy-forward, regular-expression queries in the database.

The reason is PostgreSQL open index interface, data type. Support for GIN, GIST, RUM, custom indexing methods, for example, is one of PG’s unique feats, and no other database currently supports this feature.

In fact, I have introduced a lot of fuzzy query index optimization methods before, many first contact students will think that opened the door to a new world.

PostgreSQL 9.3 PG_TRGM Imporve Support Multi-bytes char and gist,gin Index for REG-exp Search PostgreSQL 9.3 PG_TRGM Imporve Support Multi-bytes char and gist, Gin Index for REG-exp Search

Chinese Fuzzy Query Performance Optimization by PostgreSQL TRGM

“PostgreSQL Full Text Search Speeds up to Friendless RUM Indexing Interface (Pandora’s Box)”

Talk about the technology behind Singles’ Day – What is millisecond word segmentation? Try regularity and similarity.

PostgreSQL internal

PostgreSQL index internal

At present, there are three indexes that can participate in the optimization of fuzzy query, which one is better?

This article will share with you the principle behind gin, GIST, RUM index, everyone know how to choose, look at the next year to welcome the New Year, wish everyone a happy New Year in advance, next year and step by step promotion.

GIN

Gin indexes take values from columns (such as arrays, full-text search types) and store them in a TREE structure (similar to B+TREE, value + row number S). For high frequency values, row number S is stored in a separate page to reduce TREE depth.

GIN fashupdate

Because GIN stores element indexes, when a record is inserted or updated, many elements may be involved, and in the case of GIN indexes, many ITEM changes may be involved.

To improve insert, update, and delete performance, PostgreSQL supports buffers, similar to MySQL’s index organization tables, that are first written to buffers and then merged into trees.

Better than MySQL index organizing tables, queries do not block merges, nor do they block writes. Instead of waiting for the BUFFER to merge into the tree, you query the BUFFER directly (if the BUFFER is very large, the query speed may be affected). ,

The user can control the size of buffers with parameters, and GIN will automatically merge buffers when they grow to a certain level. Or VACUUM to merge.

So a complete GIN index would look like this

GIN Usage Note

1. In order to improve the UPDATE speed, the FASTER UPDATE technology is used. When the BUFFER is very large (you can set it yourself), the query speed may be slow. Therefore, it is recommended to set a reasonable BUFFER size to balance insertion and query.

2. Only support bitmap query, that is, get all the row numbers, sort, and then retrieve. The advantage is obviously to reduce the random HEAP PAGE scan, but the negative is that when the rows involved are very large (for example, each row contains a certain element), the sorting is expensive and time-consuming. The time between execution and obtaining the first item is long, and if the user uses LIMIT, it will have to wait for the sorting to end.

GIN Application Examples

“Welcome trillion-level Marketing (circle people) Gracefully into the millisecond era – Trillion-user_Tags Real-time recommendation System Database Design”

Effective Similarity Search in PostgreSQL

GiST

The Generalized Search Tree, or inductive Tree, is used to solve some of the data reduction problems that B-Tree Gin has difficulty solving, such as whether ranges intersect, whether there are points that intersect in a geographic location, or Search nearby points by points, but of course it can do more than that.

Using range types as an example, each line segment in the following figure represents the range covered by a range type stored above a record or field.

Find the record S whose storage range intersects

Sort by minimum value of range (lvalue) (please ignore the red box)

Sort by maximum value of range (rvalue) (please ignore the existence of red boxes)

GiST is an organization

First group them into different groups (somewhat similar to what K-means does) (please ignore the red boxes)

The aggregated data, you can think of as the information contained in a single index page of GiST.

GiST Single index page looks like this:

Key + line number (index and record one-to-one correspondence)
Unordered store in index.

In the blue box, the value in the left column represents KEY, and the value in the right column represents the row number (HEAP PAGE, number of records within it).

GiST two-level indexes look like this, with the upper level representing a large range of individual INDEX pages in the lower level.

For example, how to search the range [55,60]?

GiST summary

GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST GiST

The aggregated scope is stored as a level 1 structure in GiST entries for easy retrieval.

Since GiST is the soul of clustering, the performance of GiST is closely related to its clustering algorithm. PostgreSQL leaves this interface up to users to customize GiST data types and implement GiST indexes themselves.

GIST already has the range, Geometry, and other types built into PostgreSQL. You just need to add new types. For example, if you add a type to store body structures, a type to store pictures, or a type to store X-rays, how to quickly retrieve them. That’s the GIST index aggregation part you want to implement.

Performance depends on how well the userdefined  
Picksplit and Choose functions can  
group keys  
Copy the code

GiST Application Examples

Application of PostgreSQL in Video, Image deduplication and Image Search

Talk about the technology behind Singles’ Day – Logistics, Dynamic Path Planning

PostgreSQL Geolocation Data Nearest Neighbor Query Performance

SP-GiST

Space-Partitioned GIST

This can be interpreted as an extension of GiST, which has the following features

1. Nodes have no intersection, GiST has intersection, just aggregation, but nodes have content that has intersection.

2. Index depth is variable

3. Each physical Index page may correspond to multiple Nodes

The search types supported by SP-GIST

1. Kd-tree , points only ; ( because shapes might overlap )

2. prefix tree for text

Sp-gist application example

This is similar to the GiST scenario

RUM

RUM references GIN’s implementation and addresses some of GIN’s weaknesses in full text retrieval, such as:

GIN does not store lexem location information for full text retrieval, so it cannot support indexing-level ranking. You need to scan the HEAP PAGE and get it by CPU.

It is need position information about lexems to ranking. GIN index doesn’t store positions of lexems. So after index scan we need additional heap scan to retreive lexems positions.

GIN does not support indexing-level phrase search, for example, ‘speed’ <1> ‘passion’. GIN does not support indexing-level phrase search. GIN does not support indexing-level phrase search. Or ‘China :100’ does not support index-level searches.)

This problem relates with previous problem. It is need position information to perform phrase search.

3. Slow ordering by timestamp. GIN only stores tsvector TOKEN without any field information (such as full text retrieval + index field double field index). Both require heap page scanning and CPU processing.

GIN index can’t store some related information in index with lexemes. So it is necessary to perform additional heap scan.

RUM’s improved method supports the above query by adding additional information in INDEX, such as TOKEN location. Support for dual field indexes (e.g. Tsvector +timestamp)

RUM solves this problems by storing additional information in posting tree.

For example, positional information of lexemes or timestamps.

You can get an idea of RUM by the following picture:

The problems brought by RUM, such as the time for index establishment and data change, are longer than GIN. This has been implemented in RUM’s TODO and will be the focus of improvement in the future.

Drawback of RUM is that it has slower build and insert time than GIN.

It is because we need to store additional information besides keys and because RUM uses generic WAL.

RUM Scenario Example

Talk about the technology behind Singles’ Day – What is millisecond word segmentation? Try regularity and similarity.

Talk about the technology behind Singles’ Day – Word segmentation and Search

What indexing methods does PostgreSQL currently support

B-Tree

HASH

GIN

GiST

SP-GiST

BRIN

Streaming Applications for “Internet of Things” – Real-time Processing with PostgreSQL (trillions per day)

How does PostgreSQL handle hundreds of terabytes of data per day

PostgreSQL 9.5 New feature – BRIN (Block Range Index) index PostgreSQL 9.5 New feature – BRIN (Block range index) Index

PostgreSQL 9.5 New feature-lets BRIN be Used with R-tree-like indexing Strategies For “inclusion” opclasses

RUM

“PostgreSQL Full Text Search Speeds up to Friendless RUM Indexing Interface (Pandora’s Box)”

BLOOM

PostgreSQL 9.6 Black Technology Bloom algorithm index, one index to support arbitrary column combination query

There are also open interfaces that allow you to customize your indexing methods, see the bloom index implementation.

Back to the requirement for fuzzy queries

PostgreSQL can also use b-tree to search for pre-obfuscation and post-obfuscation.

We also have index support for before and after obfuscation and regular expressions

Blurring before and after:

gin, gist, rum

Regular expression:

gin, gist

Here are some examples

Talk about the technology behind Singles’ Day – What is millisecond word segmentation? Try regularity and similarity.

Let’s compare the performance of GIN, GIST and RUM under different input conditions, and how do we choose

To test this, we need to create multiple indexes on a column, using PG_HINt_plan to select the appropriate index.

Multiple indexes can be built into a column

PostgreSQL allows you to create multiple indexes on a single column, for example in our next test, we will use gin, GIST, and RUM to create different indexes.

Select the index freely, pg_HINt_plan

Usage of PostgreSQL PG_HINt_plan plug-in in Aliyun

Hints at Critical Moments – Parameter Optimization and Execution Plan Solidification CASE for PG Optimizer

The Use of PostgreSQL SQL Hints

PostgreSQL Plan Hint

Converts a string into a lexeme for full-text retrieval

In order to test rum’s support for fuzzy query, we need to convert the string to the full-text retrieval type. Meanwhile, tsquery is required for retrieval, so it also needs to be converted to the tsquery type.

Find the ASCII value of the escape character \

Select CHR (I) from generate_series as t(I);Copy the code

Convert a string to a tsvector marked with position

create or replace function string_to_tsvector(v text) returns tsvector as ?  
declare  
  x int := 1;  
  res text := '';  
  i text;  
begin  
  for i in select regexp_split_to_table(v,'')   
  loop  
    res := res||' '||chr(92)||i||':'||x;  
    x := x+1;  
  end loop;  
    return res::tsvector;  
end;  
? language plpgsql strict immutable;  
Copy the code

Converts a string to a positional tsquery

create or replace function string_to_tsquery(v text) returns tsquery as ?  
declare  
  x int := 1;  
  res text := '';  
  i text;  
begin  
  for i in select regexp_split_to_table(v,'')   
  loop  
    if x>1 then  
      res := res||' <-> '||chr(92)||i;  
    else  
      res := chr(92)||i;  
    end if;  
    x := x+1;  
  end loop;  
    return res::tsquery;  
end;  
? language plpgsql strict immutable;  
Copy the code

Prepare the environment

Pg_trgm, pg_hint_plan, rum

postgres=# \dx List of installed extensions Name | Version | Schema | Description -- -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- dblink | | 1.2 Public | connect to other PostgreSQL databases from within a database pg_hint_plan | 1.1.3 | hint_plan | pg_trgm | 1.3 | public | text similarity measurement and the index searching -based on trigrams PLPGSQL | | 1.0 pg_catalog | PL/pgSQL Procedural language rum | | 1.0 public | rum index access method (5 rows)Copy the code

The test table

The string in the info field is the field we will be fuzzing next

create table test(id int , info text);   
Copy the code

The test data

Write 1 million test data

Insert into test select id, md5(random()::text) from generate_series(1,1000000) t(id);Copy the code

Create gin index

CREATE INDEX trgm_idx1 ON test USING GIN (info gin_trgm_ops); This operation takes 19 seconds and occupies 102MB spaceCopy the code

Creating gist Indexes

CREATE INDEX trgm_idx2 ON test USING GiST (info gist_trgm_ops); It takes 31 seconds and occupies 177MB spaceCopy the code

Create rum function index

CREATE INDEX rum_idx1 ON test USING rum ( string_to_tsvector(info) rum_tsvector_ops); This operation takes 133 seconds and occupies 86MB spaceCopy the code

What does the transformed tsVector look like?

create table test(id int , c1 text, c2 tsvector); insert into test select id, md5(rn), string_to_tsvector(md5(rn)) from (select random()::text as rn, Id from generate_series(1,1000000) t(id) t; select * from test limit 10; postgres=# select * from test limit 10; id | c1 | c2 ----+----------------------------------+-------------------------------------------------------------------------------- ------------------------------------------------------------------------ 1 | f09f48ac93fbb69169f63973a516dcd1 | '0':2 16,27,32 '1', '3' : 10,21,24 '4', 5 '5', '6' 26:14,17,20,28 '7', '8' : 6 '9' : 3,9,15,18,22 'a' : 7, 25 'b' : 12, 13 'c' : 8, 30 'd' : 29, 31 'f' : 1,4,11,19 2 7554 a9e3b95c7ff01a54587f914ea9af | | '0', '1' 16, 17, 26: '3' 8 '4', 4,20,27 '5' : 2,3,11,19,21 '7' : 1,13,23 '8', '9' 22:6,10,25,30 'a' : 5,18,29,31 'b', 'c' : 'e' 12:7, 28 'f' : 14,15,24,32 | 3 Aa21991b6507275b86e43f2ec67ffc57 | '0', '1' 11:4, 7 '2', 3,13,23: '3' and '4', '5' 20:10,15,31 '6' : 9,18,26 '7' : 12,14,27,32 '8' : 17 '9' 5, 6 'a' : 1, 2, 'b', 8 'c' : 25, 30 'e' : 19, 24 'f' : 22,28,29 4 | c5e0ff17b63ab43c1b17d7a8c457c341 | '0' : 4 ,17,19,32 '1', 7 '3' : 11,15,30 '4' : 14,26,31 '5' : 2, 27 '6', '7' : 8,20,22,28 '8' : 'a' : 24 June 12 'b' : 9,13,18 'c' : 1,16,25,29 'd' : 21 'e', 3 'f' : 5 or 6 5 70 b83685fc0a4cbaf08423f686845977 | | '0' : 2,11,18 '2', '3' 21:5, 2 '4' : 13,20,28 '5' : August 29 ,24,26 '6', 6 '7' : 1,31,32 '8' :,7,19,25,27 4 '9', 'a' : 30 12 dec 'b' : 3, 15 'c' : 10, 14 'f' : 9,17,23 | 6 5,8,19,20,22,24 98761 ca17c8665787a11c1819a93cf44 | '1', '3', '4' : 28 to 31, 32 '5', 14 '6' : 4,12,13 '7' : 3,9,15,17 '8' : 2,11,16,23 1,25,27 '9' : 'a' : 7,18,26 'c' : six,10,21,29 'f' : 30 7 | 9 e432254f5de26ddd881709ac49f435b | '0' : 22 '1', '2' 20:5,6,13 '3', 4, 30 '4' : 3,8,26,29 '5' : 7,10,31 '6', 14 '7' : 21 '8', '9' 3:18 1,23,27 'a', 'b' 24, 32 'c' : 25 'd' : 11,15,16,17 'e' : 2, 2 'f' 12:8 | 9 p 383 a710642cfe6ff9d65a587e3158371 | '0', 7 '1' :,27,32 6 '2', '3' 10:1,3,26,30: '4' 9 '5', 20,22,28 '6', 8,14,19 '7' : 5,24,31 '8:2,23,29' 9 ', 'a' : 4, 'c' : 'd' : 18 'e' : 13, 25 'f' : 12,15,16 9 | 9 aa45e3355c7d0564e990466e06b9a49 | '0' : 14,21,26 '3' : 7, 8 '4', 4,17,22,31 '5' : 5,9,10,15 '6' : 16,23,24,27 '7', '9' : 1,19,20,29,32 'a' : 2,3,30 'b' : 28 'c' : 'd' : 13 'e' : 6,18,25 10 | f61a4af88c1e9f08af6597d9f93c25f4 | '0', 15 '1', 3, '2', '3' 29:27 '4', 5 '5' : 20, 30 ', 6 ', 2 '7' 3:22 '8' : 8,9,16 '9' : 13,21,24,26 'a' : 4,6,17 'c' : 10, 28 'd' : 23: 'e' 12 'f' : 1,7,14,18,25,31 (10 rows)Copy the code

A, the test case1

The intermediate result set is very large (low matching accuracy, many records that meet the conditions), and the returned result is also very large, so LIMIT is not used

1. gin

postgres=# set pg_hint_plan.debug_print=on;  
postgres=# set client_min_messages ='log';  
Copy the code

postgres=# /*+ BitmapScan(test trgm_idx1) */ select count(*) from test where info ~ 'a';  
LOG:  available indexes for BitmapScan(test): trgm_idx1  
LOG:  pg_hint_plan:  
used hint:  
BitmapScan(test trgm_idx1)  
not used hint:  
duplication hint:  
error hint:  

 count    
--------  
 873555  
(1 row)  
Time: 3426.308 ms  
Copy the code

Complete execution plan, pay attention to the number of evaluation rows, evaluation cost, later can be used as we can evaluate which index method is good criteria.

postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ BitmapScan(test trgm_idx1) */ select * from test where info ~ 'a'; LOG: available indexes for BitmapScan(test): trgm_idx1 LOG: pg_hint_plan: used hint: BitmapScan(test trgm_idx1) not used hint: duplication hint: error hint: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Bitmap Heap Scan on public. The test (cost = 28878.04.. 4794.37 ROWS =858586 width=37) (actual time=2393.340.. Loops =1) Output: id, info Recheck Cond: (test.info ~ 'a'::text) Rows Removed by Index Recheck: 126445 Heap Blocks: exact=8334 Buffers: shared hit=21335 -> Bitmap Index Scan on trgm_IDx1 (cost=0.00.. 28663.39 rows=858586 width=0) (actual time= 2391.634.. 2391.128 rows=100000 loops=1) Index Cond: (test.info ~ 'a'::text) Buffers: shared hit=13001 Planning time: 0.464 MS Execution Time: 3513.761 ms (11 rows)Copy the code

2. gist

postgres=# /*+ IndexScan(test trgm_idx2) */ select count(*) from test where info ~ 'a';  
LOG:  available indexes for IndexScan(test): trgm_idx2  
LOG:  pg_hint_plan:  
used hint:  
IndexScan(test trgm_idx2)  
not used hint:  
duplication hint:  
error hint:  

 count    
--------  
 873555  
(1 row)  
Time: 1692.881 ms  
Copy the code

Complete execution plan, pay attention to the number of evaluation rows, evaluation cost, later can be used as we can evaluate which index method is good criteria.

postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ IndexScan(test trgm_idx2) */ select * from test where info  ~ 'a'; LOG: available indexes for IndexScan(test): trgm_idx2 LOG: pg_hint_plan: used hint: IndexScan(test trgm_idx2) not used hint: duplication hint: error hint: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the Index Scan using trgm_idx2 on public. The test (cost = 0.41.. Rows = 426562 width= 426562) (actual time= 0.267.. Loops =1) Output: id, info Index Cond: (test.info ~ 'a'::text) Rows Removed by Index Recheck: 126445 Buffers: shared hit=962577 Planning time: 0.353ms Execution time: 9629426ms (7 rows)Copy the code

3. rum

postgres=# /*+ IndexScan(test rum_idx1) */ select count(*) from test where string_to_tsvector(info) @@  string_to_tsquery('a');  
LOG:  pg_hint_plan:  
no hint  
 count    
--------  
 873555  
(1 row)  
Time: 623.930 ms  
Copy the code

Complete execution plan, pay attention to the number of evaluation rows, evaluation cost, later can be used as we can evaluate which index method is good criteria.

postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ IndexScan(test rum_idx1) */ select * from test where string_to_tsvector(info) @@ string_to_tsquery('a'); LOG: pg_hint_plan: used hint: not used hint: IndexScan(test rum_idx1) duplication hint: error hint: LOG: pg_hint_plan: used hint: not used hint: IndexScan(test rum_idx1) duplication hint: error hint: LOG: pg_hint_plan: used hint: not used hint: IndexScan(test rum_idx1) duplication hint: error hint: LOG: available indexes for IndexScan(test): rum_idx1 LOG: pg_hint_plan: used hint: IndexScan(test rum_idx1) not used hint: duplication hint: error hint: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ -- -- -- -- -- -- -- -- -- -- -- -- -- -- the Index Scan using rum_idx1 on public. The test (cost = 4.00.. 3940.50 rows = 5000 width = 37) (actual time = 303.718. Loops =1) Output: id, info Index Cond: (string_to_tsvector(test.info) @@ "" a" ::tsquery) Buffers: Shared hit=9027, temp read=1315 written=1315 Planning time: 1.165 ms Execution Time: 674.612 ms (6 rows)Copy the code

Second, the test case2

The intermediate result set is very large (low matching accuracy, many records that meet the conditions), and few results are returned, so LIMIT is used

If this is good, then using paging is also effective

1. gin

Complete execution plan, pay attention to the number of evaluation rows, evaluation cost, later can be used as we can evaluate which index method is good criteria.

postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ BitmapScan(test trgm_idx1) */ select * from test where info ~ 'a' limit 100; LOG: available indexes for BitmapScan(test): trgm_idx1 LOG: pg_hint_plan: used hint: BitmapScan(test trgm_idx1) not used hint: duplication hint: error hint: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - Limit (cost = 28878.04.. 28880.26 rows=100 width=37) (actual time=2404.025.. Rows =100 loops=1) Output: id, info Buffers: Shared HIT =13002 -> Bitmap Heap Scan on public.test (cost=28878.04.. 47944.37 ROWS =858586 width=37) (actual time=2404.023.. 2404.139 Rows =100 loops=1) Output: id, info Recheck Cond: (test.info ~ 'a'::text) Rows Removed by Index Recheck: 10 Heap Blocks: exact=1 Buffers: shared hit=13002 -> Bitmap Index Scan on trgm_IDx1 (cost=0.00.. 28663.39 rows=858586 width=0) (actual time=2402.336.. Rows =1000000 loops=1) Index Cond: (test.info ~ 'a'::text) Buffers: shared hit=13001 Planning time: 0.437 MS Execution Time: 2404.330 ms (14 rows)Copy the code

2. gist

Complete execution plan, pay attention to the number of evaluation rows, evaluation cost, later can be used as we can evaluate which index method is good criteria.

postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ IndexScan(test trgm_idx2) */ select * from test where info  ~ 'a' limit 100; LOG: available indexes for IndexScan(test): trgm_idx2 LOG: pg_hint_plan: used hint: IndexScan(test trgm_idx2) not used hint: duplication hint: error hint: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Limit (cost = 0.41.. Rows =100 width= 100) (actual time= 0.352.. 0.69 rows=100 loops=1) Output: id, info Buffers: Shared hit=125 -> Index Scan using trgm_IDx2 on public.test (cost=0.41.. Rows = 426562 width= 426562) (actual time= 0.262.. 0.593 rows=100 loops=1) Output: id, info Index Cond: (test.info ~ 'a'::text) Rows Removed by Index Recheck: 12 Buffers: Shared hit=125 Planning time: 0.359 ms Execution time: 0.680 ms (10 rows)Copy the code

3. rum

Complete execution plan, pay attention to the number of evaluation rows, evaluation cost, later can be used as we can evaluate which index method is good criteria.

postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ IndexScan(test rum_idx1) */ select * from test where string_to_tsvector(info) @@ string_to_tsquery('a') limit 100; LOG: pg_hint_plan: no hint QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Limit (cost = 4.00.. Rows =100 width= 100) (actual time= 299.392.. Rows =100 loops=1) Output: id, info Buffers: Shared hit=693, temp read=2 written=1315 -> Index Scan using rum_idx1 on public.test (cost=4.00.. 3940.50 rows = 5000 width = 37) (actual time = 299.323. 299.49 Rows =100 loops=1) Output: id, info Index Cond: (string_to_tsvector(test.info) @@ "a" ::tsquery) Buffers: Shared hit=693, temp read=2 written=1315 Planning time: 0.353 ms Execution time: 303.170 ms (9 rows)Copy the code

Third, the test case3

Small intermediate result set (high matching accuracy)

postgres=# select * from test where info ~ '2e9a2c';  
  id   |               info                 
-------+----------------------------------  
    31 | e5e51acb3802e9a2c6318f6f2554ee1e  
 47924 | b1562e9a2cced50c2ea8629d03b64416  
(2 rows)  
Time: 295.222 ms  
Copy the code

1. gin

postgres=# /*+ BitmapScan(test trgm_idx1) */ select count(*) from test where info ~ '2e9a2c';  
LOG:  available indexes for BitmapScan(test): trgm_idx1  
LOG:  pg_hint_plan:  
used hint:  
BitmapScan(test trgm_idx1)  
not used hint:  
duplication hint:  
error hint:  

 count   
-------  
     2  
(1 row)  
Time: 3.835 ms  
Copy the code

Complete execution plan, pay attention to the number of evaluation rows, evaluation cost, later can be used as we can evaluate which index method is good criteria.

postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ BitmapScan(test trgm_idx1) */ select * from test where info ~ '2e9a2c'; LOG: available indexes for BitmapScan(test): trgm_idx1 LOG: pg_hint_plan: used hint: BitmapScan(test trgm_idx1) not used hint: duplication hint: error hint: QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test (cost=20.77.. 122.03 rows=100 width= 100) (actual time=2.488.. Rows =2 loops=1) Output: id, info Recheck Cond: (test.info ~ '2e9a2c'::text) Heap Blocks: exact=2 Buffers: Shared hit=26 -> Bitmap Index Scan on trgm_IDx1 (cost=0.00.. Rows =100 width=0) (actual time=2.467.. Rows =2 loops=1) Index Cond: (test.info ~ '2e9a2c'::text) Buffers: shared hit=24 Planning time: 0.549 ms Execution Time: 2.543 ms (10 rows)Copy the code

2. gist

postgres=# /*+ IndexScan(test trgm_idx2) */ select count(*) from test where info ~ '2e9a2c';  
LOG:  available indexes for IndexScan(test): trgm_idx2  
LOG:  pg_hint_plan:  
used hint:  
IndexScan(test trgm_idx2)  
not used hint:  
duplication hint:  
error hint:  

 count   
-------  
     2  
(1 row)  
Time: 296.805 ms  
Copy the code

Complete execution plan, pay attention to the number of evaluation rows, evaluation cost, later can be used as we can evaluate which index method is good criteria.

postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ IndexScan(test trgm_idx2) */ select * from test where info  ~ '2e9a2c'; LOG: available indexes for IndexScan(test): trgm_idx2 LOG: pg_hint_plan: used hint: IndexScan(test trgm_idx2) not used hint: duplication hint: error hint: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ ------- Index Scan using trgm_IDx2 on public.test (cost=0.41.. 105.16 rows=100 width=37) (actual time=55.568.. 293.998 rows=2 loops=1) Output: id, info Index Cond: (test.info ~ '2e9a2c'::text) Buffers: Shared hit=15888 Planning time: 0.423 ms Execution time: 294.103 ms (6 rows) time: 295.444 msCopy the code

3. rum

postgres=# /*+ IndexScan(test rum_idx1) */ select count(*) from test where string_to_tsvector(info) @@  string_to_tsquery('2e9a2c');  
LOG:  pg_hint_plan:  
no hint  
 count   
-------  
     2  
(1 row)  
Time: 891.093 ms  
Copy the code

Complete execution plan, pay attention to the number of evaluation rows, evaluation cost, later can be used as we can evaluate which index method is good criteria.

postgres=# explain (analyze,verbose,timing,costs,buffers) /*+ IndexScan(test rum_idx1) */ select * from test where string_to_tsvector(info) @@ string_to_tsquery('2e9a2c'); LOG: available indexes for IndexScan(test): rum_idx1 LOG: pg_hint_plan: used hint: IndexScan(test rum_idx1) not used hint: duplication hint: error hint: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ ----- Index Scan using rum_idx1 on public.test (cost=20.00.. 22.01 rows=1 width=37) (actual time=833.852.. Loops =2 rows= 1) Output: id, info Index Cond: (string_to_tsvector(test.info) @@ '''2'' <-> ''e'' <-> ''9'' <-> ''a'' <-> ''2'' <-> ''c'''::tsquery) Buffers: Shared hit=5027 Planning time: 0.432 ms Execution time: 864.190 ms (6 rows)Copy the code

How do you evaluate which index method to choose

GIN indexes are recommended when there are few intermediate result sets (high precision of input criteria).

GIST indexing is recommended when the intermediate result set is large (the input criteria are less precise), whether it is paginated, whether LIMIT is used, or whether a cursor is used.

When should I use RUM? RUM is recommended when full-text retrieval is really required or when tsVector + TIMESTAMP composite query is required.

Proposed approach

Setting the Statistics Granularity

alter table test alter column SET STATISTICS 1000;

vacuum analyze test;
Copy the code

Find the corresponding NODE and evaluate the number of intermediate results

explain $QUERY;  
Copy the code

View the rows of the corresponding NODE, and if there is no LIMIT, select the rows of the top-level NODE, or if there is LIMIT, select the rows of the second NODE.

For more complex queries, such as multiple conditional queries, it is best to use hint to find the node using the index of the hint.

Such as

postgres=# explain /*+ BitmapScan(test trgm_idx1) */ select * from test where info ~ '2e9a2c'; LOG: available indexes for BitmapScan(test): trgm_idx1 LOG: pg_hint_plan: used hint: BitmapScan(test trgm_idx1) not used hint: duplication hint: error hint: QUERY PLAN --------------------------------------------------------------------------- Bitmap Heap Scan on test (cost = 20.77.. 122.02 ROWS =100 width=37) Recheck Cond: (info ~ '2e9a2c'::text) -> Bitmap Index Scan on trgm_IDx1 (cost=0.00.. Rows =100 width=0) Index Cond: (info ~ '2e9a2c'::text) (4 rows) postgres=# explain /*+ BitmapScan(test trgm_idx1) */ select * from test where info ~ '2e9'; LOG: available indexes for BitmapScan(test): trgm_idx1 LOG: pg_hint_plan: used hint: BitmapScan(test trgm_idx1) not used hint: duplication hint: error hint: QUERY PLAN ---------------------------------------------------------------------------- Bitmap Heap Scan on test (cost = 59.30.. 5079.87 ROWS =7006 width=37) Recheck Cond: (info ~ '2E9 '::text) -> Bitmap Index Scan on trgm_IDx1 (cost=0.00.. 57.54 rows=7006 width=0) Index Cond: (info ~ '2e9'::text) (4 rows)Copy the code

Following the intermediate result count, as well as the advice I gave earlier, select the appropriate HINT and start executing the QUERY.

reference

PostgreSQL 9.3 PG_TRGM Imporve Support Multi-bytes char and gist,gin Index for REG-exp Search PostgreSQL 9.3 PG_TRGM Imporve Support Multi-bytes char and gist, Gin Index for REG-exp Search

Chinese Fuzzy Query Performance Optimization by PostgreSQL TRGM

“PostgreSQL Full Text Search Speeds up to Friendless RUM Indexing Interface (Pandora’s Box)”

Talk about the technology behind Singles’ Day – What is millisecond word segmentation? Try regularity and similarity.

PostgreSQL internal

PostgreSQL index internal

“Welcome trillion-level Marketing (circle people) Gracefully into the millisecond era – Trillion-user_Tags Real-time recommendation System Database Design”

Effective Similarity Search in PostgreSQL

Application of PostgreSQL in Video, Image deduplication and Image Search

Talk about the technology behind Singles’ Day – Logistics, Dynamic Path Planning

PostgreSQL Geolocation Data Nearest Neighbor Query Performance

Talk about the technology behind Singles’ Day – What is millisecond word segmentation? Try regularity and similarity.

Talk about the technology behind Singles’ Day – Word segmentation and Search

Streaming Applications for “Internet of Things” – Real-time Processing with PostgreSQL (trillions per day)

How does PostgreSQL handle hundreds of terabytes of data per day

PostgreSQL 9.5 New feature – BRIN (Block Range Index) index PostgreSQL 9.5 New feature – BRIN (Block range index) Index

PostgreSQL 9.5 New feature-lets BRIN be Used with R-tree-like indexing Strategies For “inclusion” opclasses

“PostgreSQL Full Text Search Speeds up to Friendless RUM Indexing Interface (Pandora’s Box)”

PostgreSQL 9.6 Black Technology Bloom algorithm index, one index to support arbitrary column combination query

Talk about the technology behind Singles’ Day – What is millisecond word segmentation? Try regularity and similarity.

Usage of PostgreSQL PG_HINt_plan plug-in in Aliyun

Hints at Critical Moments – Parameter Optimization and Execution Plan Solidification CASE for PG Optimizer

The Use of PostgreSQL SQL Hints

PostgreSQL Plan Hint

GIN, GiST, SP-GIST, RUM Indexing principles and technical background

background

GIN

GIN fashupdate

GIN Usage Note

GiST

GiST is an organization

GiST summary

GiST Application Examples

SP-GiST

Sp-gist application example

RUM Scenario Example

What indexing methods does PostgreSQL currently support

Back to the requirement for fuzzy queries

Multiple indexes can be built into a column

Converts a string into a lexeme for full-text retrieval

The test table

Create gin index

Create rum function index

1. gin

3. rum

1. gin

3. rum

1. gin

3. rum

Proposed approach

Related Posts