Introduction to SQL

This series is the beginning of the SQL series, covering some macro and basic topics.

What is SQL?

SQL is a structured query language used to manage relational databases. 90% of what we are exposed to is query syntax, but in fact it contains full add, delete, change, query and transaction processing capabilities.

Declarative property

SQL is a declarative programming language, whereas modern general-purpose programming languages are generally imperative. But don’t idolize declarative languages, for example, as they will replace low-level imperative languages in the future, because declarative languages have their own shortcomings and similarities with imperative languages.

Why do we think declarative programming languages are more advanced? Because the declarative language is more abstract, such as SELECT * from Table1 only describes to query data from Table1, but the specific steps of the query is not mentioned, behind this may exist complex index optimization and locking mechanism, but we do not need to care, this is the highest level of programming.

So why is all common business code now imperative? Because imperative gives us the opportunity to describe concrete implementations, programming in the general domain needs to be based on rigorous implementation details. For example, in the case of verifying user permissions, even though AI programming provides the ability to translate “logged in users can only access resources with permissions” into code, it is not clear what resources are or who owns them during the transfer of permissions.

SQL retains declarative features entirely because it locks in the specific domain of relational data management, where the need to standardize and enumerate makes declarative possible.

Declarative capabilities can also be extended based on imperative languages. For example, many ORMs provide syntax such as SELECT ({}).from({}).where({}). Even a login() function is a form of declarative programming because the caller does not care how he or she logged in. Anyway, that’s the whole point of declarative, isn’t it?

The grammar classification

As a relational database management tool, SQL needs to define, manipulate, and control data.

Data definition is the modification of database and table level structures, which are data structures, or data metadata, that do not represent specific data but describe attributes of the data.

Data manipulation means modifying lines of specific data, adding, deleting, changing and checking.

Data control refers to the management and control of transactions and user permissions.

Data definition

Data Definition Language (DDL) Data Definition, including the CREATE DROP ALTER method.

Data manipulation

Data Manipulation Language (DML) Data Manipulation, including SELECT INSERT UPDATE DELETE.

Data control

Data Control Language (DCL) Data Control, including COMMIT and ROLLBACK.

All SQL operations are around these three types, in which data manipulation accounts for almost 90% of the code, after all, the appeal of data query is far more than writing, data writing corresponds to data collection, and data query corresponds to data analysis, data analysis can play far more patterns than data collection.

PS: In some cases, the most important SELECT is referred to the Data Query Language (DQL) classification, so that the classification becomes four.

Set operations

The first citizen of the SQL world is a collection, just as the first citizen of the JAVA world is an object. SQL can only be better understood by looking at it from a collection perspective.

What is the collective perspective, that is, all queries and operations are performed in two-dimensional data structures, rather than the addition, subtraction, multiplication and division of individual numbers in elementary school arithmetic.

The operations of sets generally include UNION, EXCEPT and INTERSECT, all of which operate in the unit of behavior, while various JOIN statements operate in the unit of column, which is also the JOIN query mentioned later.

As long as you think in a two-dimensional data structure, operations are nothing more than horizontal or vertical operations.

The data model

The data paradigm is divided into five layers, each more demanding than the last, and therefore a paradigm that can be followed progressively. The data paradigm requires data to be increasingly decoupled and less redundant.

For example, the first normal form requires that each column be atomic, that is, the smallest unit of data that is indivisible. If a column as a string data acquisition, storage, and with “|” said provinces, segmentation, then it does not have atomicity.

Of course, the actual production process often does not follow this standard, because the table is not isolated, in the data processing flow, the column may be atomized in a certain link, and the original data in order to compress the volume, column merge processing.

Want to violate the paradigm also is not only the underlying table, big data processing scenarios, now more and more business with big wide table structure, data redundancy and even deliberately to improve the query efficiency, column storage engine is designed for this scenario, so data paradigm in big data scenarios can be flexible, but still worth learning.

The aggregation

When grouping aggregated data using GROUP BY, if you want to filter against aggregate values, you cannot qualify with WHERE because WHERE filters on a row basis, not on a combination. GROUP BY groups data into groups we call “groups”, so we need to use filter statements HAVING for groups:

SELECT SUM(pv) FROM table
GROUP BY city
HAVING AVG(uv) > 100
Copy the code

In this example, it makes no sense to replace HAVING with WHERE, because adding aggregate criteria to WHERE requires all data to be merged and does not match the level of detail of the current view. The view detail level is explained in detail in my close reading what LOD Expressions are.

Aggregation is so important because we have to look at the data at a high LEVEL, and we can’t see trends in the detailed data. Complex requirements are often accompanied by filtering criteria with aggregation, and it is important to understand how SQL supports this.

CASE expression

CASE expressions are divided into simple and search CASE expressions. Simple expressions:

SELECT CASE pv WHEN 1 THEN 'low' ELSE 'high' END AS quality
Copy the code

The above example uses a simple CASE expression to form a new field. This pattern is equivalent to generating a business custom temporary field, which is useful when processing data from the current table. The ability to search for CASE expressions completely overrides simple CASE expressions:

SELECT CASE WHEN pv < 100 THEN 'low' ELSE 'high' END AS quality
Copy the code

As you can see, search CASE expressions can be used to describe conditions “expression”, can easily complete more complex tasks, and even in the expression of the use of sub-query, aggregation and other means, these are masters of SQL idiomatic skills, so CASE expressions are very worth in-depth learning.

Complex queries

SELECT is the most complex part of SQL, which contains three complex query modes, namely join query and subquery.

Join queries

LEFT JOIN, RIGHT JOIN, INNER JOIN

When we introduced aggregation, we mentioned that join queries are essentially column extensions, and two tables do not combine for no reason, so there must be a foreign key as a relational bond:

SELECT A.pv, B.uv
FROM table1 as t1 LEFT JOIN table2 AS P t2
ON t1.productId = t2.productId
Copy the code

Join queries extend not only columns but also rows, depending on the type of query being joined. In addition to joining a query from another table, you can also join a query from yourself, for example:

SELECT t1.pv AS pv1, P2.pv AS pv2
FROM tt t1, tt t2
Copy the code

The result of the seed join query is its own Cartesian product, which can be filtered by WHERE.

Subqueries and views

A subquery is a SELECT within a SELECT. Generally, a SELECT is executed from inside to outside, and only in associative subquery mode is executed from outside to inside.

If the subquery is saved, it is a view, which is not an entity table, so it is flexible and changes with the original table data:

CREATE VIEW countryGDP (country, gdp)
AS
SELECT country, SUM(gdp)
FROM tt
GROUP BY country
Copy the code

Then the countryGDP view can be used as a temporary table.

This pattern actually goes against the declarative nature of SQL, because defining a view is like defining a variable, and if you continue writing, you can’t avoid some imperative logic.

The transaction

When SQL executes a series of operations, it is inevitable to encounter the problem of dirty data before the execution is complete, so transactions can ensure the atomicity of operations. While each DML operation is typically a built-in TRANSACTION, THE START TRANSACTION provided by SQL allows us to customize the TRANSACTION scope so that a series of business operations can be wrapped together as an atomic operation.

For SQL, atomic operations are very safe, that is, if they fail, they leave no trace, and if they succeed, they all succeed, with no intermediate state.

OLAP

OLAP (OnLine Analytical Processing), namely real-time data analysis, is the basis for the implementation of the computing engine behind BI tools.

More and more SQL databases now support window function implementations for business functions such as runningSum or runningAvg, which are common in data analysis.

Take runningSum as an example. For example, the data in the real-time table of Double Eleven is real-time GMV in minutes. If we want to make a GMV summary line chart accumulated to the current time, the expression running_sum(GMV) should be supported on the Y-axis, which may be realized through window functions.

Of course, not all business functions are directly provided by SQL, the business layer still needs to implement a large number of in-memory functions, which are calculated in the JAVA layer, some of which need to be pushed down to SQL execution, only the combination of in-memory functions and push-down functions can form the complex calculation field effect we see in BI tools.

conclusion

SQL is a declarative language, a seemingly simple query statement often has a complex implementation at the engine level, which is why SQL is so important and so popular.

Although SQL is easy to learn, to systematically understand it requires a mental shift from the concept of structured data and collections.

Do not look down upon CASE syntax, it is not only easy to be confused with the CASE syntax of programming languages, itself combined with expressions for conditional branching judgment, is the longest routine used by many data analysts in their daily work.

There are fewer and fewer scenarios using simple SQL to create applications, but there are more and more SQL-based enhanced expression scenarios in BI scenarios. This series was created with the goal of understanding query expressions in BI scenarios, hoping to apply what I have learned.

The discussion address is: Intensive reading introduction to SQL · Issue #398 · Ascoders /weekly

If you’d like to participate in the discussion, pleaseClick here to, with a new theme every week, released on weekends or Mondays. Front end Intensive Reading – Helps you filter the right content.

Pay attention to the front end of intensive reading wechat public account

Copyright Notice: Freely reproduced – Non-commercial – Non-derivative – Remain signed (Creative Commons 3.0 License)