Recently in writing SQL, food such as old dog I learned to use some convenient SQL functions to replace the previous complex writing method under the guidance of the elder brother, for the common data scene is still very useful.

SELECT, COUNT(), SUM(), JOIN, GROUP BY, etc. For more complex data scenarios, always try to use GROUP BY and JOIN to achieve, but do not know the similar function of SQL functions.

Here’s an example of some SQL functions I learned and ways to simplify SQL, using Hive SQL as a template. Because SQL functions and syntax are mostly similar, the principle of general, in the use of other SQL reference.

A chestnut

(Pretend) there is an order table t_ORDER_detail, and its structure looks like this:

The field name The field
order_id Order ID, the unique identification of the order
user_id User ID, which identifies the user to which the order belongs
merchant_id Merchant ID, identifying the merchant ID to which the order belongs
state Order status: confirmed (0), Completed (1), cancelled (2), etc
create_time The creation time of the record
data_version Version number: When inserting or updating the same order, the version number will be updated accordingly

Every time an order is inserted or updated in the database, the system will report it and add a new flow to T_ORDER_detail. For example, if an order with order_id 1001 is created and the state is subsequently reversed to completed, the following record will exist in the flow table:

order_id user_id merchant_id state create_time data_version
10001 user_1 merchant_1 0 1622249031 1
10001 user_1 merchant_1 1 1622249082 2

Then, the product dad wants to know about orders and users, and he needs the following data:

  • Total orders for the day
  • Total number of users purchased on that day
  • The total order status of the day and the order status grouped by merchants are the completed order number and the number of users
  • The order status for the day in total and by merchant group is the number of canceled orders and the number of users

We translate the demand of the product father into the following data of the day:

  • Total number of records to be deduplicated based on order_ID, total_ORDER_day.
  • The total number of records to be deduplicated by user_ID is total_user_day.
  • State =1, group by merchant_id and total:
    • Total number of records to be deduplicated based on order_ID, total_finish_ORDER_day.
    • Total_finish_user_day Specifies the total number of records to be deduplicated based on the user_id.
  • State =2, group by merchant_id and total:
  • Total_cancel_order_day Specifies the total number of records to be deduplicated based on order_ID.
  • Based on the user_id, the total number of records to be deduplicated is total_cancel_user_day.

Suppose the record in our table is like this now. There are three orders in total. User_1 has two orders under merchant_1 and merchant_2, and the final state of the orders is 1 and 2 respectively. User_2 has one order under merchant_1, and the final state of the order is 2:

order_id user_id merchant_id state create_time data_version
10001 user_1 merchant_1 1 1622249082 2
10001 user_1 merchant_1 0 1622249031 1
10002 user_1 merchant_2 0 1622249011 1
10002 user_1 merchant_2 2 1622249022 2
10003 user_2 merchant_1 0 1622249031 1
10003 user_2 merchant_1 2 1622249082 2

According to the above requirements, it can be divided into the following steps:

  • Since we only focus on the most recent record of each order (that is, the record with the largest version number), we can first clean the table to get the new table data, T_clean_ORDER_detail.
  • Get the data from the new table.

How do I get the latest record for each order_ID

The ideal data after cleaning is:

order_id user_id merchant_id state create_time data_version
10001 user_1 merchant_1 1 1622249082 2
10002 user_1 merchant_2 2 1622249022 2
10003 user_2 merchant_1 2 1622249082 2

The idea is to get the group based on order_ID, and fetch the one with the largest data_version from each group.

When it comes to grouping, GROUP BY is the only thing in my barren SQL database.

Use the GROUP BY notation

-- t_latest_record: retrieve data for the day, grouped by order_id, order_ID within each group, and maximum version number. -- t_total_record: obtains the flow data of all orders in the day. -- ${today_BEGIN_time} : a variable that represents the time stamp of 00:00 on the day. -- ${today_end_time} : variable that represents the timestamp for 23:59 on the day. SELECT * FROM ( SELECT ${today} AS today, order_id, MAX(data_version) AS data_version FROM t_order_detail WHERE create_time >= ${today_begin_time} AND create_time <= ${today_end_time} GROUP BY order_id ) t_latest_record LEFT JOIN ( SELECT * FROM t_order_detail WHERE create_time >= ${today_begin_time} AND create_time <= ${today_end_time} ) t_total_record ON t_latest_record.order_id = t_total_record.order_id AND t_latest_record.data_version = t_total_record.data_version;Copy the code

The overall idea is to group each order by order_ID, get the order_id and the maximum data_version, and then use the left link to get the record where order_id and data_version are equal, that is, the latest record we want.

That’s not too complicated, but we did it twice, we had two temporary tables do a left join. It looks a little tedious. Is there a function that directly replaces this?

Use the ROW_NUMBER() OVER notation

-- t_sorted_order_detail: Order flow data grouped by order_id and sorted by row_num SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY order_id ORDER BY data_version DESC ) AS row_num FROM t_order_detail WHERE create_time >= ${today_begin_time} AND create_time <= ${today_end_time} )t_sorted_order_detail WHERE row_num = 1;Copy the code

Does SQL look much cleaner?

instructions

  • ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_clause )

What it does is, it groups by a certain field, sorts by the field, and gets the first record sorted. PARTITION BY assumes the role of GROUP BY, that is, GROUP according to certain fields. ORDER BY is to sort, which is to sort the data of each group according to certain fields. The ROW_NUMBER() OVER function then returns the sorted number within the group for each record.

So, we get t_sorted_ORDER_detail table as follows:

order_id user_id merchant_id state create_time data_version row_num
10001 user_1 merchant_1 1 1622249082 2 1
10001 user_1 merchant_1 0 1622249031 1 2
10002 user_1 merchant_2 2 1622249022 2 1
10002 user_1 merchant_2 0 1622249011 1 2
10003 user_2 merchant_1 2 1622249082 2 1
10003 user_2 merchant_1 0 1622249031 1 2

Select row_num from each group where row_num is 1.

Obtain data from the cleaned table

Now we have the cleaned table: T_clean_ORDER_detail. Its data is as follows (row_num is useless here, so it is removed) :

order_id user_id merchant_id state create_time data_version
10001 user_1 merchant_1 1 1622249082 2
10002 user_1 merchant_2 2 1622249022 2
10003 user_2 merchant_1 2 1622249082 2

Let’s move on to the second step: get the statistics from the new table.

How do I reduce redundant operations

First look at how the two totals are obtained:

  • Total number of records to be deduplicated based on order_ID, total_ORDER_day.
  • The total number of records to be deduplicated by user_ID is total_user_day.

Before me: two data, two operations, each temporary table corresponding to a data, and then two temporary table JOIN to obtain two data.

In fact, the two data sources are the same and the calculation logic is similar. When this happens, you can merge operations. This not only improves efficiency, but also makes the SQL more concise.

For example, the SQL to get these two totals:

SELECT
        COUNT(1) AS total_order_day,
        COUNT(DISTINCT (user_id)) AS total_user_day
    FROM
        t_clean_order_detail
    WHERE
        create_time >= ${today_begin_time}
        AND create_time <= ${today_end_time};
Copy the code

instructions

  • COUNT()

COUNT(1) returns the total number of rows.

  • DISTINCT

De-weight by field.

How to do volume analysis of data

The first volume, in human terms, aggregated the data to get the total value. In the next four pieces of data, you need to get not only the subtotal grouped by merchant_id, but also the total. So let’s see how we get that.

Because they are also logically similar and come from the same data. So let’s just take two of them:

  • State =2, group by merchant_id and total:
  • Total_cancel_order_day Specifies the total number of records to be deduplicated based on order_ID.
  • Based on the user_id, the total number of records to be deduplicated is total_cancel_user_day.

For example, for t_clean_order_detail, our ideal data is:

total_cancel_order_day total_cancel_user_day merchant_id
1 1 mechant_1
1 1 merchant_2
2 2 The total

I used to think: GROUP BY; I want a total, and I want to compute a total, and I want to UNION it, perfect.

However, when I learned about the multidimensional analysis statements and functions that SQL comes with, I realized how naive I was: I didn’t have to do two calculations, and I had to integrate the results myself. With a combination, a single calculation is enough.

First take a look at the SQL to get total_cancel_user_day and total_cancel_user_day regardless of merchant_id grouping:

SELECT COUNT(IF ( state = 2, 1, null )) AS total_cancel_order_day, COUNT(DISTINCT ( IF ( state = 2, user_id, null ) )) AS total_cancel_user_day FROM t_clean_order_detail WHERE update_time >= ${today_begin_time} AND update_time <=  ${today_begin_time};Copy the code

The SQL above is actually very similar.

Next consider the SQL for grouping and getting totals based on merchant_id. Using the multi-component parsing functions provided BY SQL, we used GROUP BY to GROUP according to merchant_id, WITH ROLLUP to get the subtotals and totals for each GROUP, and GROUPING to distinguish the subtotals and totals rows for each GROUP. Use DECODE to name the counting dimensions (merchant_id and total).

SQL:

SELECT COUNT(IF ( state = 2, 1, null )) AS total_cancel_order_day, COUNT(DISTINCT ( IF ( state = 2, user_id, Null)) AS total_cancel_user_day, DECODE(GROUPING(merchant_id), 1, merchant_id ) AS merchant_id FROM t_clean_order_detail WHERE update_time >= ${today_begin_time} AND update_time <= ${today_begin_time} GROUP BY merchant_id WITH ROLLUP;Copy the code

instructions

  • IF(boolean testCondition, T valueTrue, T valueFalseOrNull)

Conditional return. Return valueTrue if testCondition is true or not NULL; Otherwise, valueFalseOrNull is returned.

  • DECODE(expression , search , result [, search , result]… [, default])

The DECODE function is similar to a nested series of if-then-else statements. Expression is compared with search. If expression matches the ith search entry, the ith result is returned. If expression does not match any search value, default is returned.

  • ROLLUP

ROLLUP extends the groupbyClause to make the SELECT statement calculate multiple subtotals based on the dimension of the group and calculate the total.

  • GROUPING

GROUPING uses a column in ROLLUP as an argument. The GROUPING function returns 1 when it encounters a NULL value generated by ROLLUP. GROUPING returns 1 if the column is a subtotal or a total, and 0 otherwise. It can only be used in a ROLLUP or CUBE query.

conclusion

Let’s review the previous three points

  • How do I get the latest record for each order_ID
  • How do I reduce redundant operations
  • How to do volume analysis of data

These three questions, do you have any ideas? If not, I’m sorry. (low)

Finally, I am full of enthusiasm and patience to teach my girlfriend to write SQL, false one compensate ten bags of teaching package will, I would like to ask my girlfriend where to get?