background

Recently, I plan to learn some data analysis content. Although there are a lot of advertisements in the picture below, many skills do not overpower me. They all say that everyone should have a little product thinking and be sensitive to data.

I have read the introduction of some training institutions, which involves a lot of knowledge, including tools, thinking, practical operation and the final report. You can’t eat a fat man in one mouthful, so learn it slowly.

Data analysis framework

The following is a complete data analysis program, which is divided into five steps: problem identification, data acquisition, data cleaning, data analysis and report presentation.

SQL language plays an important role in the field of data analysis, including data access, data cleaning, data mining will use SQL statement. Today I’m going to share how I learned SQL. Learning SQL language is divided into three stages: primary, middle and advanced:

Primary, roughly learn the basic grammar of SQL language, and then combined with some scenarios for practice, can solve many problems in daily work and life
Intermediate and senior SQL syntax, some commonly used SQL statement specification, here also does not need to know about the kernel itself is how to store data, how to create an index, how to parse SQL statements, and we only need according to the previous experience and the specification to write complex SQL can avoid a lot of pit, at this stage in terms of data analysis, researchers also has very advanced, You can handle most scenarios
Advanced, have a certain understanding of the underlying operating system kernel, file system, database kernel, know how to efficiently define the structure to create indexes, know how to execute the executor more efficiently, can analyze and tune the length of SQL execution. This is left to a professional computer or database system engineer to do, after all, data analysts have their own professional things to do.

It can be summed up in one sentence: keep learning the theory into practice, and then learn and practice again. So the following content is divided into two parts: theory and practice

Theory (SQL language content more, do not have to look up, keep the line, you can directly jump to the following practice)

SQL does more than just query things, including defining data structures, modifying data in a database, and specifying security constraints

An overview of SQL query languages

The American National Organization for Standardization (ANSI) and the International Organization for Standardization (ISO) published the SQL standard. The SQL language has several parts:

Data-Definition LanguageDL, which defines, modifies, and removes relational schemas
Data manipulation language (DML) to query information, insert tuples, modify tuples, and delete tuples
Integrity. SQL DDL contains commands to define integrity constraints. Data stored in the database must meet the definition of integrity constraints
View Definition. SQL DDL contains commands to define views
Transaction Control, SQL includes commands that define the start and end of a transaction
Embedded SQL and Dynamic SQL define how SQL is embedded into a general purpose programming language
Authorization, which defines access to relationships and views

The SQL features described in this chapter are based on the SQL-92 standard

SQL data definition

Basic types of

Char (n) : A fixed length string, of length n specified by the user. You can also use the full term character
Varchar (n) : A variable length string, the maximum length specified by the user is n, which is equivalent to the full character varying
Int: Integer type (a finite self of machine-specific integers), equivalent to the full integer
Unmeric (p, d) : Fixed point number, longitude specified by user. This number has p-bits (plus a sign bit), where the d-bit is to the right of the decimal point.
Real, double precision: Floating-point numbers and double precision numbers. Longitude depends on the machine
Float (n) : A floating point char of at least n bits accuracy, with trailing Spaces padded when the actual length is less than the specified length, whereas varchar does not

Basic schema definition

The integrity constraints are listed below:

primary key
Foreign Key (A1, A2, An) Reference
not null

The order of values in the INSERT statement needs to be the same as the order of attributes in the relational table. DROP TABLE deletes the entire relational table, while DELETE TABLE deletes only the records in the table

The basic structure of an SQL query

The section structure of an SQL query consists of three clauses: SELECT, FROM, and WHERE. The relationship that the input of the query is listed in the FROM clause

Single relation query

SQL allows for duplication in relationships as well as in the results of SQL expressions. SELECT DISTINCT (A1, A2, AN) FROM TABLE ··· SELECT DISTINCT (A1, A2, AN) FROM TABLE ··· ·

The SELECT clause can take arithmetic expressions containing the +, -, *, / operators, which can be constants or properties of a tuple. The WHERE clause allows you to pick out those tuples that satisfy a specific predicate in the result relationship of the FROM clause allowing the logical conjunctions AND, OR, AND NOT in the WHERE clause. The comparison operators can be used to compare strings, arithmetic expressions, and special types, such as date types

Multirelational query

When a property appears in multiple relational tables at the same time, the relational specification is used as a prefix to indicate which property we are using. This convention can be problematic in some cases, such as when you need to combine information from two different tuples of the same relationship. An SQL query can return three types of clauses:

The SELECT clause is used to list the attributes required in the query results
The FROM clause is a list of relationships that need to be accessed in a query evaluation
The WHERE clause is a predicate applied to the properties of the relationship in the FROM clause. The order of operations of a query is first FROM, then WHERE, and finally SELECT. Generally, the meaning of an SQL query can be understood as follows:
Generates Cartesian product (all possible permutations of each tuple in multiple relationships) for the relationship listed in the FROM clause
The predicate specified in the WHERE clause is applied to the result of Step 1
For each tuple in the result of Step 2, the output is the attribute specified in the SELECT clause and in the actual implementation of SQL, it would be optimized by (as far as possible) producing only Cartesian product elements that satisfy the WHERE clause predicate.

A natural connection

The match conditions in the FROM clause most commonly need to be equal on the attributes of all match names. To simplify writing SQL statements in such cases, SQL supports an operation called natural join. The Natural Join operation operates on two relationships and produces one as the result. Unlike the Cartesian product on two relations, he connects every tuple of the first relation to all tuples of the second relation; Natural join considers only pairs of tuples that have the same value on attributes that occur in both relational patterns. To avoid the danger of unnecessary attribute columns with the same name, SQL allows the user to specify which columns are required to be equal by joining R2 using (A1, A2)

Additional basic operations

Its operation

The attribute name in the result is derived from the attribute name of the relationship in the FROM clause, but there are cases where the name cannot be derived using this method. The reasons are:

First, there may be properties of the same name in both relationships of the FROM clause;
Second, arithmetic expressions are used in the SELECT clause, so the result attribute has no name.
Again, you want to change the attribute name in the result. SQL provides a way to rename attributes in the result relationship using the AS clause
```
old-name as new-name
```

The AS clause can also rename relational tables, which are renamed for two reasons:

Replace long relational names with short ones that can be used elsewhere in the query
In order to apply to the case where you need to compare tuples in the same relationship, such as doing a Cartesian product between yourself and yourself

The identifiers used to rename relationships are referred to in the SQL standard as correlation names, or table alias, or correlation variables, or tuple variables.

String operation

SQL uses a pair of single quotes to mark a string, or two single quote characters if a single quote is part of a string. In the SQL standard, equality operations on strings are case sensitive. Some database systems are implemented in a case-insensitive manner, which by default can be changed at the database level or at the specific attribute level. SQL also allows multiple functions on strings, such as concatenation, extraction of Zichean, calculation of string length, conversion of case, and removal of whitespace after the string for pattern matching:

Percent sign (%) : Matches any substring
Underline (_) : Matches any character

The pattern uses the escape keyword to define the escape character in case-sensitive LIKE comparisons, for example

Like 'ab\%d' where 'ab\% CD '= 'ab\% CD' where 'ab\%d' = 'ab\% CD

SQL allows you to use the NOT LINKE comparison operator to search for mismatches. SQL1999 also provides similar to to operations, with syntax similar to regular expressions in UNIX.

Property description in the SELECT clause

The type “*” means “All Properties”, and the asterisk can be preceded by the table name. Represents all properties in a table, for example

select instructor.* from instructor, teaches where instructor.ID = teaches.ID;

The display order of the array tuple

By default, the ORDER BY clause uses ascending order, which can be represented by DESC or ASC for ascending order

WHERE clause predicate

The between comparison operator indicates that a value is less than or equal to one value and greater than or equal to another. The NOT BETWEEN WHERE clause adds an additional condition to allow comparison operators to be used on tuples in lexicographic order, such as:

select name, course_id
from instructor,teaches
where (instructor.ID, dept_name) = (teaches.ID, 'Biology');

Set operations

The UNION, INTERSECT, EXCEPT operations performed by SQL on relationships correspond to the UNION, INTERSECTION, AND DIFFERENCE operations in mathematical set theory

And operation

UNION automatically removes duplicates; if you want to keep all duplicates, you need to use UNION ALL

(select course_id
from section
where semster = 'Fall' and year = 2009)
union all
(select course_id
from secition
where semestr = 'Spring' and year = 2010);

In operation

If you want to preserve duplicates, you must use INTERSECT ALL instead of INTERSECT

(select course_id
from section
where semster = 'Fall' and year = 2009)
intersect all
(select course_id
from section
where semester = 'Spring' and year = 2010);

Poor operation

If you want to preserve duplicates, you must use except all instead of except

(select course_id
from section
where semster = 'Fall' and year = 2009)
except all
(select course_id
from section
where semester = 'Spring' and year = 2010);

A null value

The result of an arithmetic expression (+, -, *, /) is null if either of its inputs is null
SQL treats the result of any comparison operation controlled by the design as unknown, which creates a third logical value in addition to true and false
In Boolean operation:
- And: true and unknown result is unknown, false and unknown result is false, unknown and unknown result is unknown
- Or: true or unknown results in true, false or unknown results in unknown, unknown or unknown results in unknown
- The result of NOT: NOT UNKNOWN is UNKNOWN
If the WHERE clause predicate evaluates to FALSE or UNKNOWN for a tuple, that tuple cannot be added to the result set
The special keyword null is used in the predicate to test the null value
The predicate IS NOT NULL is true if it operates on a value that is not null
Some SQL implementations also support IS UNKNOWN and IS NOT UNKNOWN to test whether the result of an expression is UNKNOWN
SELECT DISTINCT: If two values compared are non-null and equal, or both are null, then they are considered to be the same, while “null=null” in the predicate ranges UNKNOWN. This applies to union, intersection, and difference operations of sets

Aggregation function

A function that takes a collection (set or multiple sets) of values as input and returns a single value. SQL provides five intrinsic aggregation functions

Average value: AVG
Minimum value: min
Maximum value: Max
Synthesis: the sum
Count: the count

Basic gathered

select count(*)
from course;

select count(distinct ID)
from teaches
where semester = 'Spring' and year = 2010;

select avg(aalary)
from instructor
where dept_name = 'Comp.Sci.';

Grouped together

Tuples that have the same value on all the properties in the GROUP BY clause will be grouped into one group and any properties that do not appear in the GROUP BY clause if they appear in the SELECT clause, It can only appear inside the aggregation function. Each teacher in a particular group (defined by Dept_Name) has a different ID, and since each group outputs only one tuple, there is no way to determine which ID value to choose as the output. As a result, SQL does not allow this to happen

Having clause

The predicates in the HAVING clause come into play after grouping is formed. Select dept_name, dept_name, dept_name, dept_name, dept_name, dept_name, dept_name avg(salary) as org_salary from instructor group by dept_name having avg(salary) > 42000; …

The meaning of a query with a GROUP BY or HAVING clause can be defined by the following sequence of operations:

Similar to queries without clustering, a relationship is first computed based on the FROM clause
If a WHERE clause is present, the predicates in the WHERE clause are applied to the result relationship of the FROM clause
If a GROUP BY clause is present, the tuples that satisfy the WHERE predicate are grouped by the GROUP BY clause. Without a GROUP BY clause, the entire set of tuples satisfying the WHERE predicate is treated as a grouping
If there is a clause having, it is applied to each group; Groups that do not satisfy the HAVING clause predicate will be discarded
The SELECT clause uses the remaining groups to produce a tuple in the query result, that is, an aggregation function is applied to the group to get a single result group

The aggregation of null and Boolean values

Rule of Null Handling: All aggregate functions except count(*) ignore null values in the input set. Specifies that the count operation for an empty set has a value of 0. All other aggregate operations return a null value for an empty set. SQL1999 has three types of Boolean input: true, false, and unknown. There are two aggregation functions: some and every

Nested subqueries

The subquery checks into the select-from-where expression in another query.

Set membership

The subquery appears in the WHERE clause and tests whether the tuple is a member of the collection by using the conjunction IN or NOT IN.

select distinct course_id
from section
where semester = 'Fall' and year = 2009 and
    course_id in ( select course_id
                           from section
                           where semester = 'Spring' and year = 2010);

Comparison of sets

SQL > allow

=some = < > and some some. (at least bigger, smaller, equal to) =some =some <> =some = not in Find the names of all the teachers who make at least more money than one of the teachers in the Biology department
,>

select name
from instructor
where salary > some (select salay
                                   from instructor
                                   where dept_name = 'Biology');

SQL allows < ALL, <= ALL, >= ALL, = ALL and <> ALL, <> ALL are equivalent to NOT IN, but = ALL are not equivalent to IN

select name
from instructor
where salary >all (select salay
                                   from instructor
                                   where dept_name = 'Biology');

Space relation test

The exists structure returns true if the subquery as an argument is not null. Example: Find all courses that are offered in both the Fall 2009 and Spring 2010 semesters

select corse_id
from section as S
where semester = 'Fall' and year = 2009 and
       exists (select *
                  from section as T
                  where semester = 'Spring' and year = 2010 and 
                  S.course_id = T.course_id);

A related name from the outer query (S in the above query) can be used in the subquery of the WHERE clause. Subqueries that use related names from outer queries are referred to as correlated subqueries. We can use related names within a subquery only that are defined by the subquery itself, or in any query that contains it. If a related name is defined in both the subquery and the query that contains the subquery, the definition in the subquery is valid. This rule is similar to the common variable rules used in programming languages. The NOT EXISTS structure tests whether there are no tuples in the result set of a subquery. The NOT EXISTS structure can be used to simulate a collection inclusion (that is, superset) operation. You could write “not exists (B except A)” instead of “relationship A includes relationship B.” For example, find the students who are taking all the courses offered in the Biology department

select S.ID, S.name
from student as S
where not exists((select course_id
                             from course
                             where dept_name = 'Biology')
                              except
                              (select T.course_id
                                from takes as T
                                 where S.ID = T.ID));

In this case, subquery

(select course_id
from course
where dept_name = 'Biology')

Find the collection of all courses starting in the Biology department, subquery

(select T.course_id
from takes as T
where S.ID = T.ID)

Find out all the courses that S.ID is taking. Thus, the outer SELECT tests each student to see if all the sets of courses he or she elects include all the sets of courses started by the Biology department

This logical feeling cannot be understood!!

Repeat the tuple existence test

The unique structure is used to determine if there are duplicate tuples in the result of the subquery as an argument. If it does not return true, it returns flase. Example: Find all classes that started at most once in 2009

select T.course_id
from course as T
where unique (select R.course_id
                      from section as R
                      where T.course_id = R.course_id and
                              R.year = 2009);

Note that if there are two tuples in the relationship t1 and t2, t1=t2 is false if a field in t1 or t2 is empty. Although there are multiple copies of a tuple, as long as a tuple is empty, The Unique test could be true

A subquery in the FROM clause

Any select-from-where expression returns a result that is a relationship, so it can be inserted into another select-from-where where any relationship can occur. For example: Find out which departments have teachers whose average salary is more than 4200

select dept_name, avg_salary
from (select dept_name, avg(salary) as avg_salary
    from instructor
    group by dept_name)
where avg_salary > 42000;

For example: find out which class has the largest total payroll of all classes, in which case having clause cannot be resolved. Use the subquery from clause

select max(tot_salary)
from (select dept_name, sum(salary)
         from instructor
         group by dept_name) as dept_total(dept_name, totl_salary);

SQL2003 allows subqueries in the FROM clause to be prefixed with the keyword lateral to access properties in the table preceding it in the FROM clause or in the subquery. Example: Print the name of each teacher, along with their salary and the average job in the department

select name, salary, avg_salary
from instructor I1, lateral(select avg(salary) as avg_salary
                                         from instructor I2
                                         where I2.dept_name = I1.dept_name);

The with clause

The WITH clause provides a way to define temporary relationships only for queries that contain the WITH clause. For example, find the department with the largest budget value

with max_budget(value) as
       (select max(budget)
        from department)
select budget
from department, max_budget
where department, budget = max_budget.value;

Example: find all the departments whose total payroll is greater than the average payroll of all the departments

with dept_total(dept_name, value) as
        (select dept_name, sum(salary)
         from instructor
         group by dept_name),
        dept_total_avg(value) as
        (select avg(value)
          from dept_total)
select dept_name
from dept_total, dept_total_avg
where dept_total.value >= dept_total_avg.value;

Scalar quantum query

SQL allows a subquery to appear anywhere an expression that returns a single value can, as long as the subquery returns only a single tuple containing a single attribute; Select * from dept_name where dept_name = dept_name where dept_name = dept_name where dept_name = dept_name where dept_name = dept_name

      (select count(*)
        from instructor
         where department.dept_name = instructor.dept_name )
        as num_instructors

from department; ··· This example guarantees that only a single value will be returned because it uses the count(*) aggregation function without group by. Scalar quantum queries can appear in SELECT, WHERE, HAVING clauses. It is not always possible to determine at compile time whether a subquery returns multiple tuples or not. If a subquery is executed and results in more than one tuple, a runtime error is generated.

Note that technically the result type of a scalar quantum query is still a relation, even though it contains only a single tuple. Then, when a scalar query is used in an expression where a single value appears, the SQL retrieves the corresponding value from the unit group that contains a single attribute in the relationship and returns it.

Database modification

delete

You can only delete a single tuple, not just values on certain attributes

delete from r
where P;

The delete command intelligently applies to one relationship. If you want to remove tuples from more than one relationship, you must use one delete command on each relationship. You can nest select-from-where in the WHERE clause of DELETE

Example: delete records of teachers whose salaries are below the average university salary

delete from instructor
where salary < (select avg(salary)
                           from instructor);

It is important to test all tuples before deleting them. In the above example, if some tuples are deleted before the rest are tested, the average salary will change, so that the final result of delte will depend on the order in which the feet are processed.

insert

You can specify the tuple to be inserted, or you can write a query to generate a collection of tuples to be inserted. SQL allows you to insert into course(course_id, title, dept_name, credis) by specifying attributes in an INSERT statement in order that users may not remember the sequence of relational attributes.

            values('CS-437', 'Database System', 'comp Sei', 4);

··· The salary of each MUSIC student who completes 144 credits to become a MUSIC teacher is $18,000

insert into instructor
         select ID, name, dept_name, 18000
         from student
         where dept_name = 'Music' and tot_crd > 144;

update

Update Instructor Set Salary = Salary * 1.05 WHERE Salary > 100000; Update Instructor Set Salary = Salary * 1.05 WHERE Salary <= 100000;

You can replace it with a syntax with a case

Update instructor set salary = case when salary <= 100000 then salary * 1.05 else salary * 1.03 end

Case statements generally have the following format:

Case when pred1 then result1 when pred2 then result2...... when predn then resultn else result0; end

A scalar quantum query can be used as an example in the SET clause: Set the tot_cred property of each Student tuple to the sum of the credits for the courses that the student successfully completed, assuming that the student’s grade in a course is neither “F” nor empty

update student S
set tot_cred = (
                   select sum(credits)
                   from takes natural join course
                   where S.ID = takes.ID and
                             takes.grade <> 'F' and
                             takes.grade is not null);

If you do not want to set the totl_cred attribute to null, you can use the following method

select case 
        when sum(credis) is not null then sum(credits)
         else 0
         end

conclusion

SQL consists of several parts
- Data Definition Language (DDL) to define, delete, and modify relational schemas
- Data manipulation language (DML), including the query language, and the command to insert tuples into the database, remove tuples from the database, and modify tuples from the database
SQL’s data definition language is used to create relationships with specific schemas, and allows you to declare integrity constraints, such as primary and foreign code constraints, in addition to declaring the names and types of relationship attributes
SQL provides a variety of language constructs for querying databases, including SELECT, FROM, AND WHERE clauses. SQL supports natural join operations
SQL also provides a mechanism to rename attributes and relationships, and to sort query results by specific attributes
SQL supports basic geometric operations on relationships, including union, intersection, and search operations
SQL handles relationships and queries that contain null values by adding “unknown” to the generic true and false values
SQL supports aggregation, you can group relationships, you can apply aggregation on each group individually, and SQL also supports set operations on groups
SQL supports nested subqueries within the WHERE and FROM clauses of the outer query. Scalar quantum queries are also supported anywhere an expression returns a single value
SQL provides structures for updating, inserting, and deleting information

practice

We need to find a large number of scenarios to do exercises, their structure is too difficult for many platform still have question, I was using a cow from the question bank, he compiled SQL project https://www.nowcoder.com/ta/sql, there are more than 90 items can be done, I also made the study column https://juejin.cn/column/6973… To record the process of their own practice, you can also use different ways of writing to solve the problem, convenient for later summary and improvement,

Here’s a concrete example

Find all employees who are not the department leader EMP_NO

For the sake of practice and history, I used MemfireDB, a free cloud database. Once I logged in, there was a web version of the editor that I could write code directly to execute without having to install a bunch of software.

Click “SQL Query” to enter the Web version of the SQL editor

Use case content

drop table if exists  "dept_manager" ; 
drop table if exists  "employees" ; 
CREATE TABLE "dept_manager" (
"dept_no" char(4) NOT NULL,
"emp_no" int NOT NULL,
"from_date" date NOT NULL,
"to_date" date NOT NULL,
PRIMARY KEY ("emp_no","dept_no"));
CREATE TABLE "employees" (
"emp_no" int NOT NULL,
"birth_date" date NOT NULL,
"first_name" varchar(14) NOT NULL,
"last_name" varchar(16) NOT NULL,
"gender" char(1) NOT NULL,
"hire_date" date NOT NULL,
PRIMARY KEY ("emp_no"));
INSERT INTO dept_manager VALUES('d001',10002,'1996-08-03','9999-01-01');
INSERT INTO dept_manager VALUES('d002',10003,'1990-08-05','9999-01-01');
INSERT INTO employees VALUES(10001,'1953-09-02','Georgi','Facello','M','1986-06-26');
INSERT INTO employees VALUES(10002,'1964-06-02','Bezalel','Simmel','F','1985-11-21');
INSERT INTO employees VALUES(10003,'1959-12-03','Parto','Bamford','M','1986-08-28');

The answer

select emp_no from employees where emp_no not in (
  select emp_no from dept_manager
);

You can see how long my SQL has been running

SQL language is also a programming language, can not be learned in a day, to practice makes perfect, continuous practice, and often used in daily work, I believe that through this method, everyone can be called SQL master.

Do you have any good methods, welcome to share!!

Data analysis -SQL daily practice