Advanced SQL features PostgreSQL users should master

PostgreSQL databases have many rich features in SQL and NoSQL. This article will start with advanced SQL features.

PostgreSQL advanced SQL features **

This section introduces PostgreSQL’s advanced SQL features, such as WITH queries, batch inserts, RETURNING modified data, UPSERT, data sampling, aggregation functions, and window functions.

PostgreSQL supports one of the advanced SQL features. This feature is often called Common Table Expressions (CTE). The WITH query defines an auxiliary statement in a complex query. This feature is often used in complex or recursive query scenarios.

Take a look at the WITH query WITH a simple CTE example, as follows:

WITH t as (SELECT generate_series(1,3)) SELECT * FROM t; Run the following command: generate_series ----------------- 1 2 3Copy the code

(3 rows)

In this simple CTE example, you define a helper statement t to fetch a number, and then query T in the main query. The helper statement is defined like a temporary table. For complex queries, if CTE is not used, you can simplify SQL by creating views.

An important attribute of WITH queries is RECURSIVE, which allows you to reference your own output to achieve recursion, and is typically used in hierarchical or tree-structured scenarios.

For example, there is a table that contains the following data:

Id Name FatherID 1 China 0 2 Liaoning 1 3 Shandong 1 4 Shenyang 2 5 Dalian 2 6 Jinan 3 7 Heping District 4 8 Shenhe District 4Copy the code

Use the PostgreSQL WITH query to retrieve all parent nodes WITH ID 7 and above as follows:

WITH RECURSIVE r AS (

       SELECT * FROM test_area WHERE id = 7

     UNION   ALL

       SELECT test_area.* FROM test_area, r WHERE test_area.id = r.fatherid

     )

 SELECT * FROM r ORDER BY id;
Copy the code

The query results are as follows:

Id | name | fatherid - + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 1 | 0 2 | | China liaoning | 1 2 7 | | | shenyang heping district 4 | 4 (4 rows)Copy the code

2. Batch insert

Batch insert is used to insert multiple data at a time to improve data insertion efficiency. PostgreSQL can implement batch insert in several ways:

INSERT INTO... SELECT.Copy the code

Most relational databases support this method. The syntax is as follows:

INSERT INTO table_name SELECT... INSERT INTO VALUES (),(),... (a)Copy the code

This type of batch INSERT involves inserting multiple records in an INSERT statement through the VALUES keyword, which is easily understood by an example like the following:

mydb=> CREATE TABLE tbl_batch3(id int4,info text);

CREATE TABLE



mydb=> INSERT INTO tbl_batch3(id,info) VALUES (1,'a'),(2,'b'),(3,'c');

INSERT 0 3
Copy the code

The data are as follows:

mydb=> SELECT * FROM tbl_batch3;

    id  | info

  ----+------

      1 | a

      2 | b

      3 | c

(3 rows)
Copy the code

This batch insert method is very unique. Compared with the method of inserting multiple rows of data into one SQL database, this method can reduce the interaction with the database, reduce the generation of WAL (write-Ahead Logging) logs, and improve the insert efficiency. Few developers are generally aware of PostgreSQL’s bulk inserts.

Method 3: COPY or \COPY meta-commandCopy the code

The COPY or COPY meta-command can import file data in a certain format to the database. It is more efficient than the INSERT command. Usually, files with a large amount of data can be imported by using the COPY command as the PostgreSQL superuser on the database server host.

Import the 10 million data from the file tBL_batch4.txt into the table as follows:

mydb=# TRUNCATE TABLE pguser.tbl_batch4;

TRUNCATE TABLE



mydb=# COPY pguser.tbl_batch4 FROM '/home/pg10/tbl_batch4.txt';

COPY 10000000
Copy the code

3. RETURNING modified data

The RETURNING OF PostgreSQL RETURNING RETURNING DATA on RETURNING SITE (DML) is described in three scenarios: INSERT statement RETURNING data on RETURNING site (RETURNING DATA on RETURNING site) and UPDATE statement RETURNING new data on RETURNING site (RETURNING data on RETURNING site). The DELETE statement returns deleted data followed by a RETURNING attribute. The advantage of this feature is that no additional SQL is required to obtain these values, which can be easily developed for application development and then demonstrated by example.

RETURNING inserted data

The INSERT statement returns the values inserted following the RETURNING property, and the following creates the test table and returns the full row of data that has been inserted.

mydb=> CREATE TABLE test_r1(id serial,flag char(1)); CREATE TABLE mydb=> INSERT INTO test_r1(flag) VALUES ('a') RETURNING *; Id | flag - | + -- -- -- -- -- - 1 a (row 1) INSERT 0 1 after RETURNING back to update the dataCopy the code

Return the updated value of the UPDATE statement as follows:

mydb=> SELECT * FROM test_r1 WHERE id=1; id | flag ----+------ 1 | a (1 row) mydb=> UPDATE test_r1 SET flag='p' WHERE id=1 RETURNING *; Id | flag - | + -- -- -- -- -- - 1 p (row 1) UPDATE 1 RETURNING back to delete dataCopy the code

The DELETE followed by RETURNING property returns deleted data, as shown below:

mydb=> DELETE FROM test_r1 WHERE id=2 RETURNING *;

   id | flag

----+------

    2 | b

(1 row)

DELETE 1
Copy the code

4, UPSERT

PostgreSQL’s UPSERT feature refers to INSERT… ON CONFLICT UPDATE (ON CONFLICT UPDATE) is used to resolve data conflicts during data insertion, such as user – defined constraint violation. If one of the data in a batch insert transaction violates the constraint, the entire insert transaction will be rolled back. PostgreSQL’s UPSERT feature solves this problem.

To understand UPSERT’s capabilities, use an example to define a user login log table and insert a piece of data as follows:

mydb=> CREATE TABLE user_logins(user_name text primary key,

login_cnt int4,

last_login_time timestamp(0) without time zone);

CREATE TABLE



mydb=> INSERT INTO user_logins(user_name,login_cnt) VALUES ('francs',1);

INSERT 0 1
Copy the code

Alter table user_logins alter table user_logins alter table user_logins alter table user_name alter table user_logins alter table user_name

mydb=> INSERT INTO user_logins(user_name,login_cnt)

VALUES ('matiler',1),('francs',1);

ERROR:  duplicate key value violates unique constraint "user_logins_pkey"

DETAIL:  Key (user_name)=(francs) already exists.
Copy the code

The SQL above attempts to insert two data. While you would know that matiler data does not violate primary key conflict, you would say this data violates primary key conflict. As a result, you cannot insert either data. PostgreSQL UPSERT can handle conflicting data, such as inserting conflicting data without reporting an error and updating conflicting data, as shown below:

mydb=> INSERT INTO user_logins(user_name,login_cnt)

VALUES ('matiler',1),('francs',1)

ON CONFLICT(user_name)    

DO UPDATE SET

login_cnt=user_logins.login_cnt+EXCLUDED.login_cnt,last_login_time=now();

INSERT 0 2
Copy the code

The above INSERT statement inserts two pieces of data and sets the rule:

Last_login_time, ON CONFLICT(user_name) Where CONFLICT type is user_name, DO UPDATE SET where CONFLICT type is user_name Note that the SET command references the user_loins table and the built-in table EXCLUDED. By referring to the original table user_Loins to access conflicts that already exist in the table. The built-in table EXCLUDED references values that it is trying to insert. As follows:

mydb=> SELECT * FROM user_logins ;

 user_name | login_cnt |   last_login_time   

  -----------+-----------+---------------------

     matiler   |              1 |

      francs    |              2 | 2017-08-08 15:23:13

    (2 rows)
Copy the code

On the one hand, you’d know this data of testing is updated with the LOGIN_CNT and last_LOGin_TIME fields. On the other hand, the new data matiler record has been inserted normally.

5. Data sampling

TABLESAMPLE Data sampling Is often used in data processing. It is common to query a certain number of records in a table. PostgreSQL has provided TABLESAMPLE data sampling in version 9.5. Before version 9.5, data sampling was usually implemented BY ORDER BY Random (). Although this method functionally meets the requirement of randomly returning specified row data, its performance is very low, as follows:

After 9.5, PostgreSQL supports TABLESAMPLE data sampling. The syntax is as follows:

The SELECT... FROM table_name TABLESAMPLE sampling_method ( argument [, ...]  ) [ REPEATABLE ( seed ) ]Copy the code

Sampling_method refers to two sampling methods: SYSTEM and BERNOULLI. And I’m going to talk more about these two methods of sampling.

SYSTEM sampling mode

The SYSTEM sampling method is to randomly extract data from data blocks in the table. Theoretically, each data block in the sampled table has the same probability of being retrieved. The SYSTEM sampling method is based on the level of data blocks, followed by sampling parameters, and all data on the selected blocks will be retrieved.

Create test_SAMPLE test table, insert 1.5 million data, set the sampling factor to 0.01, which means return 1500000*0.01%=150 records, execute the following SQL:

On the one hand, the Sample Scan (sampling mode is SYSTEM) is performed, and the execution time is 0.166 ms, showing good performance.

The optimizer, on the other hand, expected to access 150 records and actually returned 107.

BERNOULLI sampling method

BERNOULLI sampling randomly selects rows from a table and returns a specified percentage of data. BERNOULLI sampling is based on the level of rows. Theoretically, each row in the sampled table has the same probability of being retrieved, so BERNOULLI sampling is more random than SYSTEM sampling. However, the performance is much lower than the SYSTEM sampling method. The BERNOULLI sampling method is also based on the test_SAMPLE test table.

Set the sampling method to BERNOULLI and sampling factor to 0.01, as shown below:

Execute the following query several times to see the change in the number of returned records, as shown below:

Mydb => SELECT count(*) FROM test_sample TABLESAMPLE BERNOULLI(0.01); Count ------- 151 (1 row) mydb=> SELECT count(*) FROM test_sample TABLESAMPLE BERNOULLI(0.01); count ------- 147 (1 row)Copy the code

It can be seen from the above that the amount of data returned by BERNOULLI sampling method is very close to the percentage of sampled data, while the data returned by SYSTEM sampling method is in the unit of data blocks, and all data on the sampled blocks are returned. Therefore, the amount of data returned by SYSTEM sampling method has a large deviation.

The SYSTEM and BERNOULLI sampling methods are demonstrated here. Based on the data block level, the SYSTEM sampling method randomly extracts records on table data blocks. Therefore, the randomness of records extracted in this method is not very good, but the returned data is in the unit of data blocks. For example, the sample size of the log table is GB. BERNOULLI sampling method is based on data lines. Compared with SYSTEM sampling method, BERNOULLI sampling method has better randomness of data, but worse performance than SYSTEM sampling method. BERNOULLI sampling method is suitable for scenarios where randomness of sampling is preferred.

Aggregate function

Aggregation functions can be used to calculate the result set. Common aggregation functions include AVg (), sum(), min(), Max (), and count(). This section describes two special aggregation functions of PostgreSQL and provides test examples.

Before introducing the two aggregate functions, let’s consider a scenario where a table has the following data:

Country | city -- -- -- -- -- -- -- -- - + -- -- -- -- -- - | | Hong Kong Chinese Taipei China Osaka, Tokyo, Japan | Japan | | Shanghai (5 rows)Copy the code

The following result set is required:

Taipei, Hong Kong, Shanghai, Tokyo, OsakaCopy the code

So how do you write this SQL?

String_agg function

Let’s start with the string_agg function, which has the following syntax:

string_agg(expression, delimiter)
Copy the code

Simply put, the string_agg function concatenates all lines of a field in the result set into a string, delimited by the specified delimiter. Expression indicates the character type data to be processed. The type of the parameter is (text, text) or (bytea, bytea). The type returned by the function is the same as the type of the input parameter. Bytea is a binary type and is rarely used.

Concatenate the city field into a string as follows:

mydb=> SELECT string_agg(city,',') FROM city; String_agg -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - Taipei, Hong Kong, Shanghai, Tokyo, Osaka (1 row)Copy the code

As you can see, the string_agg function concatenates the output result set into strings separated by the specified comma delimiter. Back to the problem at the beginning of this article, it is implemented with the following SQL, as shown below:

mydb=> SELECT country,string_agg(city,',') FROM city GROUP BY country;
Copy the code

Country | string_agg + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | Japan Tokyo, Osaka | Chinese Taipei, Hong Kong and ShanghaiCopy the code

The array_agg function

Array_agg is similar to string_agg except that it returns an array of the same type as the input parameter. Array_agg supports two syntax types, the first of which is as follows:

Array_agg (expression) – The input parameter is any non-array type

The input argument can be any non-array type, and the return result is a one-dimensional array. The array_AGg function joins all rows of a field in the result set into an array, performing the following query:

mydb=> SELECT country,array_agg(city) FROM city GROUP BY country; Country | the array_agg + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | Japan |} {Tokyo, Osaka China {}, Hong Kong, Taipei, ShanghaiCopy the code

The array_agg function outputs an array of character types, otherwise indistinguishable. The main advantage of using array_agg is that you can use array-related functions and operators.

7. Window functions

PostgreSQL provides built-in window functions, such as row_num(), rank(), and lag(). In addition to built-in window functions, aggregate functions and custom functions followed by the OVER attribute can also be used as window functions.

The window function call syntax is slightly more complex, as follows:

function_name ([expression [, expression ... ]]) [ FILTER ( WHERE filter_clause ) ] OVER ( window_definition )
Copy the code

Where window_definition syntax is as follows:

[ existing_window_name ]

[ PARTITION BY expression [, ...] ]

[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]

[ frame_clause ]
Copy the code

OVER represents the keyword of the window function.

The PARTITON BY attribute groups the result set returned BY the query, after which the window function processes the grouped data.

The ORDER BY property sets the ordering of grouped data in the result set.

Row_number () window function

Create a score table and insert the test data as follows:

CREATE TABLE score ( id serial primary key, subject character varying(32), stu_name character varying(32), Score numeric (3, 0)); INSERT INTO score ( subject,stu_name,score ) VALUES ('Chinese','francs',70); INSERT INTO score ( subject,stu_name,score ) VALUES ('Chinese','matiler',70); INSERT INTO score ( subject,stu_name,score) VALUES ('Chinese','tutu',80); INSERT INTO score ( subject,stu_name,score ) VALUES ('English','matiler',75); INSERT INTO score ( subject,stu_name,score ) VALUES ('English','francs',90); INSERT INTO score ( subject,stu_name,score ) VALUES ('English','tutu',60); INSERT INTO score ( subject,stu_name,score ) VALUES ('Math','francs',80); INSERT INTO score ( subject,stu_name,score ) VALUES ('Math','matiler',99); INSERT INTO score ( subject,stu_name,score ) VALUES ('Math','tutu',65);Copy the code

Avg () OVER() window function

A window function with an aggregate function followed by the OVER attribute represents applying an aggregate function on a query result set. This section demonstrates a window function with an avG () aggregate function followed by the OVER attribute, which is used to calculate the average value of grouped data.

Query each student’s academic performance and show the average score of the course, usually calculate the average score of the course, and then the score table and the average score table associated query, as shown below:

Rank () window function

The rank() window function is similar to the row_number() window function. The main difference is that when the field values of a row in the group are the same, the row numbers are repeated and gaps (interpreted as gaps in the manual) are created, as follows:

Dense_rank () window function

The dense_rank () window function is similar to the rank () window function. The main difference is that when the values of a row in the group are the same, although the row numbers are repeated, there is no gap (interpreted as gaps in the manual) in the row numbers, as follows:

PostgreSQL also supports many other built-in window functions, such as, LAG (), first_values(), and last_values().

Second, the summary

PostgreSQL supports advanced SQL features such as WITH queries, batch inserts, RETURNING DML modified data, UPSERT, data sampling, aggregation functions, and window functions. These features can simplify SQL code and improve development efficiency. In addition, PostgreSQL implements functions that are not easy to implement in ordinary queries. We hope that by reading this chapter, you will be able to apply the advanced SQL features in your actual work and explore other advanced SQL features in PostgreSQL.

This article is from the cloud community partner “DBAplus Community”. For relevant information, you can pay attention to “DBAplus Community”.

Advanced SQL features PostgreSQL users should master

Related Posts

To the World in 2025

A basic guide to colors, color Spaces, and color transformations

Time series analysis of mathematical modeling