This article is also published on Zhihu and my personal website, welcome to follow

I am limited to “getting started” because I am not an SQL Server expert, but I have accumulated some experience in recent performance optimization. Although it is not deep, it is enough to deal with some SQL statement performance problems that we need to solve daily, so I share it for your reference. The selfish point is that even if you don’t use this skill for a long time, you can quickly pick it up by reviewing your own articles after being out of practice.

I’m limited to “query statements” because there is room for optimizing memory usage, query compilation, deadlocks, and so on in SQL Server, which is not covered in this article.

Another meaning of “getting started” is that I’m going to focus on principles. I’ve found that neither front-end performance optimization nor SQL performance optimization can match WYSIWYG coding, either because of the limited information provided by the tools, or because the performance bottleneck is a mountain of inherited code, and most of the time you have to improvise. Sometimes you can get to the root of the problem by pulling the strings, and sometimes you have to rewrite the code in a big way. Either way, you need to understand the tools behind it. I hope you can understand this article even with simple SELECT, UPDATE, DELETE

This article covers two parts: an Index and an Execution Plan. While indexes solve more than 90 percent of our performance problems, we also need to know when and where to add indexes, so read the execution plan for hints on this.

Person.person, Person.PersonPhone, person.emailAddress, person.EmailAddress, person.EmailAddress All three tables contain a BusinessEntityID field, which can be used to correlate information about the same person.

Be careful of Scan

It is no exaggeration to compare a database to a book. Imagine if you had to look up a line of text in a book without a table of contents and the only thing you could do was look it up page by page. A database works the same way. For a table without any indexes, it can only scan the entire table to find a match

For example, if I delete all indexes in the PersonPhone table to find a specific phone number:

  SELECT *
  FROM Person.PersonPhone
  WHERE PhoneNumber = '156-555-0199';
Copy the code

The implementation plan shows us the following process:

Since we’ll talk about execution plans later, for now you can think of execution plans as the execution of SQL statements. The Table Scan in the figure above tells us that it scans the entire Table. In addition, this step consumes the most resources in the entire execution process: Cost: 100%. Cost here is only an abstract unit, which does not represent the consumption of CPU or I/O in a single dimension, but the result of statistics of various resources.

In fact, 100% in the above process does not mean that scan operation is inefficient, because it only involves the query operation of a single table. Even for such a simple query of an indexed table, you will see Cost: 100%. For example, if I query the Person table with the [PK_Person_BusinessEntityID] index:

  SELECT *
  FROM Person.Person
  WHERE BusinessEntityID = 10;
Copy the code

The resulting execution process is as follows:

A non-SCAN Clustered Index Seek (I’ll explain later, but you can think of it here as a superior operation to scan) operation also consumes 100%.

But if we do a joint query on the PersonPhone and Person tables, the efficiency of the query becomes significantly higher:

  SELECT *
  FROM Person.PersonPhone AS PersonPhone
  JOIN Person.Person AS Person ON PersonPhone.BusinessEntityID = Person.BusinessEntityID
  WHERE PhoneNumber = '156-555-0199';
Copy the code

Scan consumes 91% of all operations

So scan is an optimization point that we can identify. When you find that a table is missing an index or you see a SCAN operation in the execution plan, try to fix the performance problem by adding an index.

Key Logical Reads

SQL Server queries data in the buffer cache first, and if it does not find the data in the buffer cache, it continues to search for the data in disk. The former is called logical read, and the latter is called physical read. Given that reads and writes from memory are more efficient than from disk, we certainly want to avoid any physical reads whenever possible.

And what exactly does logical Read read? Is page, page is the smallest unit of data organization in the database, we only need to understand this depth, how the page is organized, the page data structure is not important. So the logical reads should be as small as possible. By default, you don’t see logical read output. To turn this ON, use SET STATISTICS IO ON. For example, to query a PersonPhone table with no index, we would use the following statement:

SET STATISTICS IO ON
GO

  SELECT *
  FROM Person.PersonPhone
  WHERE PhoneNumber = '156-555-0199';

SET STATISTICS IO OFF
GO
Copy the code

The resulting logical read information is as follows:

(1 row affected) Table ‘PersonPhone’. Scan count 1, logical reads 158, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Once I’ve added a Clustered Index for PersonPhone with PhoneNumber as key (this is just an optimization if you don’t know anything about Index), the result of the above statement would be:

(1 row affected) Table ‘PersonPhone’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Logical reads are too high and may (but not necessarily) suggest something:

  • Missing indexes cause multiple rows to be scanned
  • A higher value may indicate too much pressure on the disk
  • Even query operations can lock data (depending on the isolation level), and excessively long queries can plan subsequent read and write operations, causing a ripple effect.

Another benefit of logical Read is that, as a measure of performance, it doesn’t fluctuate as much as, say, Duration or CPU time

Logical Read is not as useful as an execution plan, on the one hand because it is one-way, meaning you can read logical reads values from SQL statements but cannot read them backwards. In this sense, an execution plan is better for troubleshooting problems. On the other hand, it’s not always accurate, and if you remove the WHERE statement from the above query, you’ll see that logical reads doesn’t change much before and after adding index.

However, Logical reads can be used as one of our reference indicators.

Index

Clustered Index & Nonclustered Index

Now, finally, index. The way Index works is very simple. If we compare a database to a book, the Index is the table of contents of the book, and it helps you locate the data quickly.

In the above table, if we want to find rows for a particular company, we need to check each row in the table to see if it matches that expected value. This is a full table scan, which is very inefficient, and if the table is large and only a few rows match the search criteria, the whole scan process is extremely inefficient.

We can add an index to this table:

The index contains one entry for each row in the AD table and is sorted by company_num value. Now, instead of searching the entire table line by line for matches, we can use this index. Let’s say we want to find all the rows of company number 13. We start scanning the index and find three values that belong to the company. We then reach the index value of company number 14, which is a bit larger than the value we are looking for. Since the index values are already in order, when we read the index row containing 14, we know we can’t find any more matches to 13 and can exit the search. Thus, one way to improve efficiency with an index is to know where the matching line ends and skip the rest; Another way to use indexes to improve efficiency is to use a positioning algorithm to find the first match without a linear scan from the index position (for example, binary search is much faster than scanning). In this way, we can quickly locate the first match value, saving a lot of search time.

But in SQL Server, indexes are divided into several classes. Clustered Index is the most commonly used: the data in the table is physically sorted by Clustered Index. Because there can only be one physical order relationship, only one clustered index is allowed for a table. When you add a primary key constraint to a table, the database automatically creates a clustered index with the primary key for you.

We can add an index to the PersonPhone with PhoneNumber as key, and then execute the statement above to query PhoneNumber again

  SELECT *
  FROM Person.PersonPhone
  WHERE PhoneNumber = '156-555-0199';
Copy the code

You can see the execution plan changed to Clustered Index Seek as shown below

Seek is the most efficient, we should try to make the query statement to perform Seek operation, it is no longer like scan line by line, but similar to the book directory directly to the destination of the required data.

But clustered Index seek does not work in all cases, such as the BusinessEntityID condition in the case of PhoneNumber index above:

  SELECT *
  FROM Person.PersonPhone
  WHERE BusinessEntityID = 4511
Copy the code

You will find that the execution plan is Clustered Index Scan

Index scan means that the database retrieves all rows through the index before scanning. If you compare Index scan and Table scan, the logical reads for both are about the same.

Configuring nonclustered Index is no different from staging index. You will also see non-clustered index Seek in the staging plan. The obvious difference is that there is no effect on the order of the original table. Although they may look the same, there are actually many connections behind them, and understanding these connections can help us determine which index should be added appropriately and when.

How Index works

Imagine a single column of 27 rows of data, divided into nine pages because of the limited page size

After you add a Clustered Index for them, the Index data structure looks like this

When you’re looking for a value of 5, the search will start at the top node, and since 5 is between 1 and 10, the search will continue to the next node on the left branch, and since 5 falls between 4 and 7, the search will go down the level to the node that starts with 4. And then finally you find the 5 on the leaf

In fact, we’ve overlooked some details. The clustered Index is structured as follows:

It is not difficult to see from the figure that each layer node is a bidirectional list, and the real data of the table is stored on the leaf node.

But nonclustered Index has a slightly different storage structure, with leaf nodes consisting of index pages rather than data pages. Nonclustered index uses row locator to locate the corresponding rows (Pointers). For heap tables, clustered index indexes are clustered. Row locator refers to the ROW identifier of each row. For non-Heap tables, row locator points to a clustered index.

Given the limited space available, we can summarize what index should be used when:

  • You should create a clustered index before creating a nonclustered index
  • If you are always querying data that needs to be sorted by a certain column, add a clustered index for that column
  • Do not add a clustered index to columns that will be updated frequently, as this will cause row lockers for all related Noneclusterd indexes to be updated frequently, which can cause deadlock issues.
  • Instead, you can add nonclustered index to frequently updated columns because it only affects the current nonclustered index
  • Nonclusterd index is not suitable for large queries as they may cause additional lookup operations, in which case you should turn the index into a covering index.

Covering Index

Delete all indexes under PersonPhone add PhoneNumber to nonclustered index and execute the initial query statement:

  SELECT *
  FROM Person.PersonPhone
  WHERE PhoneNumber = '156-555-0199';
Copy the code

You will get the following execution plan:

In addition to Index Seek, the bottom right lookup operations account for the largest proportion. The reason for triggering a lookup is very simple: A lookup is triggered when the database decides to use nonclustered Index for the query and the column information to be queried is not in nonclustered Index (neither as the key of the index nor in the includes list). Lookup means that it will locate row data based on the row locator associated with index (clustered Index for non-Heap table, RID for Heap table) and read the desired column data from it. In addition to consuming logical Reads on the index page, the entire process also consumes additional logical Read operations on the data page. As you can imagine, a database that uses a clustered Index lookup never needs a lookup because the leaf node of a clustered Index is the Data Page

An index of this type can be called a covering index if all the column information needed for a query is provided, which means that access to the data page can be omitted.

We can place columns other than the index key that need to be queried into the includes list. This will solve the problem of lookup above:

conclusion

I also wanted to write about join efficiency (compared to hash join/nested loop/merge join), but I think what kind of join database uses is out of our control. Whether or not the database actually uses our index is out of our control. The execution plan is calculated by its internal Optimizer, and each execution plan can vary depending on the state of the resource, data, and index. But at least the index is more controllable. Most performance problems can be solved with Index