Database optimization under large data volume and high concurrency

If a reasonable database model can not be designed, it will not only increase the difficulty of programming and maintenance of client and server programs, but also affect the performance of the actual operation of the system. Therefore, a complete database model design is necessary before a system can be implemented. In a system analysis, design stage, because the amount of data is small, the load is low. We tend to notice the function implementation, and it is hard to notice the performance of the weaknesses, wait until the system into operation after a period of time, only to find that the performance of the system in the lower, then to consider to improve the system performance is to spend more manpower material resources, and the whole system is the inevitable form a patch project. Therefore, when considering the process of the whole system, we must consider whether our system will have extreme situations in the case of high concurrency and large data access. (For example, abnormal data occurred in the external statistics system on July 16, caused by concurrent access of large amounts of data, and the response time of the database could not keep up with the speed of data refreshing. The specific situation is: when the date is critical (00:00:00), judge whether there is a record of the current date in the database, if not, insert a record of the current date. In the case of low concurrent access, there is no problem, but when the number of visits at the critical date is quite large, there will be multiple conditions when making this judgment, and multiple records of the current date will be inserted into the database, resulting in data errors.) After the database model is determined, it is necessary to make a data flow diagram in the system to analyze the possible bottlenecks. In order to ensure the consistency and integrity of database, too many associations between tables are often designed during logical design to reduce data redundancy as much as possible. (For example, the region of the user table can be stored in another region table.) If the data redundancy is low, the data integrity can be easily guaranteed, which improves the data throughput speed, ensures the data integrity, and clearly expresses the relationship between data elements. For the associated query between multiple tables (especially big data tables), its performance will be reduced and the programming difficulty of the client program will also be increased. Therefore, the physical design needs to compromise. According to the business rules, the data volume of the associated table and the access frequency of data items should be determined. In order to improve the response time of the system, reasonable data redundancy is also necessary. Designers should consider the type and frequency of system operation in the design stage. In addition, it is best not to associate child tables with self-increasing attribute fields as primary keys. This is inconvenient for system migration and data recovery. The mapping relationship of the external statistics system is lost. The original table must be able to be rebuilt from the table it was separated from. The advantage of using this rule is that you can make sure that you don’t introduce extra columns into the split table and that all the table structures you create are as large as they really need to be. It’s a good habit to apply this rule, but unless you’re dealing with a very large amount of data, you won’t need it. (For example, in a pass system, I can separate USERID, USERNAME, USERPASSWORD as a table and use USERID as a foreign key for other tables.)

Table design specific attention to the problem:

1. The length of the data line should not exceed 8020 bytes. If this length is exceeded, the data will occupy two lines in the physical page, causing storage fragmentation and reducing query efficiency.

2. The ability to use numeric fields to select numeric types rather than string (phone number), which reduces query and join performance and increases storage overhead. This is because the engine is processing queries and concatenating back to compare each character in the string one by one, whereas for numeric types it only needs to compare once.

For immutable char and mutable vARCHar, both 8000 bytes. Char is fast, but consumes storage space. Varchar is slow, but saves storage space. You can make flexible choices when designing fields. For example, you can select CHAR for fields with small length changes such as user name and password, and VARCHAR for fields with large length changes such as comments.

4, the length of the field should be set as short as possible under the premise of meeting the possible needs as far as possible, so as to improve the efficiency of the query, but also to reduce the consumption of resources in the establishment of the index.

Second, the optimization of the query to ensure that on the basis of the implementation of functions, as far as possible to reduce the number of database access; By searching parameters, the number of rows accessed to the table is minimized and the result set is minimized, so as to reduce the network burden. The operations that can be separated are handled separately as far as possible to improve the response speed of each time. When SQL is used in the data window, try to put the index used in the first column of the selection; The structure of the algorithm is as simple as possible. When querying, do not use too many wildcards such as SELECT * FROM T1. SELECT several columns such as SELECT COL1,COL2 FROM T1. SELECT TOP 300 COL1,COL2,COL3 FROM T1 where the user does not need that much data. In the absence of an index, the database searches for a certain piece of data and must perform a full table scan, traversing all data once to find the records that meet the conditions. In the case of a small amount of data, the difference may not be noticeable, but in the case of a large amount of data, it is extremely bad. How SQL statements are executed in SQL SERVER? They are worried that their written SQL statements will be misunderstood by SQL SERVER. Select * from table1 where name=’zhangsan’ and tID > 10000 Select * from table1 where tID > 10000 and name=’zhangsan’ If tID is an aggregated index, the next sentence will simply look up the table’s 10000 entries later; Select tID>10000 from zhangsan where name=’zhangsan’ and condition = tID>10000 In fact, such worries are unnecessary. SQL SERVER has a “Query analysis optimizer” that calculates the search criteria in the WHERE clause and determines which index reduces the search space for table scans, that is, it can be automatically optimized. Although the query optimizer can automatically optimize queries based on the WHERE clause, sometimes the query optimizer will not perform queries as quickly as you intended. During the query analysis phase, the query optimizer looks at each stage of the query and decides whether it is useful to limit the amount of data that needs to be scanned. If a stage can be used as a scan parameter (SARG), it is said to be optimizable and the desired data can be obtained quickly using the index. Definition of SARG: An operation used to restrict search because it usually refers to a specific match, a match within a range of values or an AND join of more than two conditions. The column name operators < constant or variable > or < constant or variable > column names may appear on one side of the operator, while constants or variables appear on the other side of the operator. Such as: If an expression does not satisfy the SARG form, it cannot limit the scope of the search, That is, SQL SERVER must determine for each row whether it satisfies all the conditions in the WHERE clause. So an index is useless for expressions that do not satisfy SARG form. Therefore, the most important thing to optimize a query is to try to make the statement conform to the query optimizer’s rules and avoid full table scans and use index queries.

Specific points to note:

1. Avoid null values in the WHERE clause. Otherwise, the engine will abandon the use of index and perform full table scan. Select id from t where num is null select id from t where num is null select id from t where num=0

2. Avoid using it in where clauses! = or <> otherwise the engine will abandon the index for a full table scan. The optimizer will not be able to determine the number of rows that will die by index, so it needs to search all rows in the table.

Select id from t where num=10 or num=20 select id from t where num=10 or num=20 select id from t where num=10 union all select id from t where num=20

4. Use in and not in with caution, because in prevents the system from using the index and only searches the table directly. Select id from t where num between 1 and 3 select id from t where num between 1 and 3

5. Do not search for characters that have been indexed using non-initial letters. This also prevents the engine from taking advantage of the index. See the following example: SELECT * FROM T1 WHERE NAME LIKE ‘%L%’ SELECT * FROM T1 WHERE SUBSTING(NAME,2,1)= ‘L’ SELECT * FROM T1 WHERE NAME LIKE ‘L%’ even if the NAME field has an index, the first two queries still cannot use the index to complete the accelerated operation, the engine has to complete the task of all the data in the whole table. The third query can use indexes to speed up operations.

6. Forcing the query optimizer to use an index if necessary, such as using parameters in the WHERE clause, also results in a full table scan. Because SQL resolves local variables only at run time, the optimizer cannot defer the choice of an access plan until run time; It must be selected at compile time. However, if an access plan is established at compile time, the value of the variable is unknown and therefore cannot be used as an input for index selection. Select id from t where num=@num; select id from t with(index) where num=@num

7. Avoid expression operations on fields in the WHERE clause, which will cause the engine to abandon indexes and perform full table scans. SELECT * FROM T1 WHERE F1=100*2 SELECT * FROM T1 WHERE F1=100*2

SELECT * FROM RECORD WHERE SUBSTRING(CARD_NO,1,4)= ‘5378’

SELECT member_number, first_name, last_name FROM members WHERE DATEDIFF(yy,datofbirth,GETDATE()) > 21 SELECT member_number, first_name, last_name FROM members WHERE dateofbirth < DATEADD(yy,-21,GETDATE()) Any operation on a column will result in a table scan, which includes database functions, computed expressions, and so on, moved to the right of the equals sign whenever possible.

8. Avoid functional manipulation of fields in the WHERE clause, which will cause the engine to abandon indexes and perform a full table scan. Such as: Select id from t where substring(name,1,3)=’ ABC ‘–name id starting with ABC select id from t where substring(name,1,3)=’ ABC Datediff (day,createdate,’2005-11-30′)=0– ‘2005-11-30’ select id from t where name like ‘abc%’ select id from t where createdate>=’2005-11-30′ and createdate<‘2005-12-1’

9. Do not perform functions, arithmetic operations, or other expression operations to the left of the “=” in the WHERE clause, or the system may not use the index properly.

10. When using an index field as a condition, if the index is a compound index, the first field in the index must be used as a condition to ensure that the system can use the index. Otherwise, the index will not be used, and the field order should be as consistent as possible with the index order.

11. In many cases it is a good choice to use exists: elect num from a where num in(select num from b) select num from a where exists(select 1 from b where num=a.num)

SELECT SUM(T1.C1)FROM T1 WHERE( (SELECT COUNT(*)FROM T2 WHERE T2.C2=T1.C2>0) SELECT SUM(T1.C1) FROM T1WHERE EXISTS( SELECT * FROM T2 WHERE t2.c2 = t1.c2) SELECT * FROM T2 WHERE t2.c2 = t1.c2 The latter does not result in a large number of locked table scans or index scans.

If you want to verify the existence of a table, don’t use count(), which is inefficient and wastes server resources. The word EXISTS can be used instead. IF (SELECT COUNT() FROM table_name WHERE column_name = ‘XXX ‘) IF EXISTS (SELECT * FROM table_name WHERE column_name = ‘xxx’)

It is often necessary to write a T_SQL statement to compare a parent result set to a child result set to find if there are records in the parent result set that are not in the child result set, as in: SELECT a.dr_key FROM hdr_tbl a—- TBL a indicates that TBL uses the alias A instead of WHERE NOT EXISTS (SELECT * FROM dtl_tbl b WHERE a.dr_key = b.hdr_key) SELECT a.hdr_key FROM hdr_tbl a LEFT JOIN dtl_tbl b ON a.hdr_key = b.hdr_key WHERE b.hdr_key IS NULL SELECT Hdr_key FROM hdr_tbl WHERE hdr_key NOT IN (SELECT hdr_key FROM dTL_tbL) The three notations can obtain the same correct result, but the efficiency decreases IN sequence.

12. Use table variables instead of temporary tables whenever possible. If the table variables contain a lot of data, be aware that the indexes are very limited (only primary key indexes).

13. Avoid frequent creation and deletion of temporary tables to reduce the consumption of system table resources.

14. Temporary tables are not unusable, and their proper use can make some routines more efficient, for example, when a large table or a data set in a commonly used table needs to be referenced repeatedly. However, for one-off events, it is best to use exported tables.

15. When creating a temporary table, if a large amount of data is inserted at a time, you can use Select INTO instead of create table to avoid creating a large number of logs and improve the speed. If the amount of data is small, to reduce the resources of the system table, create table first, then insert.

16. If temporary tables are used, you must explicitly delete all temporary tables at the end of the stored procedure, truncate TABLE first, and then DROP TABLE. In this way, system tables cannot be locked for a long time.

17. SET SET NOCOUNT ON at the beginning of all stored procedures and triggers and SET NOCOUNT OFF at the end. There is no need to send a DONE_IN_PROC message to the client after each statement of the stored procedure and trigger is executed.

18. Avoid large transaction operations and improve system concurrency.

19. Try not to return a large amount of data to the client. If the amount of data is too large, consider whether the corresponding requirements are reasonable.

Avoid using incompatible data types. For example, float and int, char and varchar, binary and varbinary are incompatible. Data type incompatibilities may prevent the optimizer from performing optimizations that could otherwise be performed. For example, SELECT name FROM employee WHERE salary > 60000. In this statement, if salary is money, the optimizer is difficult to optimize because 60000 is an integer. We should convert integers to coins at programming time, not at runtime.

21. Make full use of join conditions. In some cases, there may be more than one join condition between two tables. Ex. : SELECT SUM(A.AMOUNT) FROM ACCOUNT A,CARD B WHERE A.CARD_NO = B.CARD_NO SELECT SUM(A.AMOUNT) FROM ACCOUNT A,CARD B WHERE A.card_no = b.card_no AND a.account_no = b.account_no The second sentence will execute much faster than the first.

22. Using view-accelerated queries to sort a subset of tables and create views can sometimes speed up queries. It helps avoid multiple sort operations and, in other ways, simplifies the optimizer’s work. For example, SELECT cust.name, rcvbles.balance,… Other columns FROM CUST, Rcvbles WHERE ust. Customer_id = rcvbls. customer_id AND rcvbls. balance>0 AND UST. Postcode > “98000” ORDER BY UST

If the query is to be executed multiple times instead of once, you can find all the unpaid customers in a view and sort them by their names: CREATE VIEW DBo.v_CUST_RCVLbes AS SELECT CUst.name, rcvbles. Balance,… Other columns FROM CUST, Rcvbles WHERE ust. Customer_id = rcvbls.customer_id AND rcvbls.balance >0 ORDER BY ust. Name then query in the view as follows: SELECT * FROM V_CUST_RCVLBES WHERE postcode> “98000” has fewer rows than the main table, and the physical order is the required order, reducing disk I/O, so the query work can be greatly reduced.

GROUP BY SELECT OrderID FROM Details WHERE UnitPrice > 10 GROUP BY OrderID SELECT DISTINCT OrderID FROM Details WHERE UnitPrice > 10

Do not perform the SELECT DISTINCT function on a UNION ALL. This reduces unnecessary resources

35. Avoid using SELECT INTO statements. The SELECT INOT statement results in a table lock that prevents other users from accessing the table.

Third, algorithm optimization

Avoid cursors because they are inefficient, and you should consider rewriting them if they operate on more than 10,000 rows. Before using a cursor based approach or a temporary table approach, look for a set based solution to solve the problem, which is usually more efficient. As with temporary tables, cursors are not unusable. Using the FAST_FORWARD cursor for small data sets is generally preferable to other line-by-line processing methods, especially if you have to reference several tables to get the data you want. Routines that include “totals” in the result set are generally faster to execute than those that use cursors. If development time allows, both the vernior-based approach and the set-based approach can be tried to see which works better. Cursors provide a means of line-by-line scanning in a specific set. Cursors are generally used to traverse data line-by-line, and perform different operations according to different conditions of the data taken out. In particular, cursor (large data set) loops defined for multiple tables and large tables can easily cause the program to enter a long queue or even crash. In some cases, it is necessary to use cursors. At this time, you can also consider transferring qualified data rows into temporary tables, and then defining cursors for temporary tables to perform operations, which can significantly improve performance. (for example, internal statistics first edition) encapsulates stored procedures

Build efficient indexes

Indexes are generally created for two purposes: to maintain the uniqueness of indexed columns and to provide a policy for fast access to the data in the table. There are two large database index cluster index and the cluster index, namely a no cluster index table is stored according to heap structure data, all data are added in the end of the table, and has set up a cluster index table, the data on physical storage, in accordance with the order of the cluster index key a table allows only one cluster index, as a result, according to the structure of B tree, It is understandable that adding any kind of index improves the speed of queries by indexed column, but reduces the performance of insert, update, and delete operations, especially when the Fill Factor is large. Therefore, frequent insertion, update and deletion operations are carried out on tables with many indexes, and smaller padding factors are set during table building and index building, so as to leave more free space in each data page and reduce the work of page splitting and reorganizing. Indexing is one of the most efficient ways to get data from a database. Ninety-five percent of database performance problems can be solved by indexing. As a rule, I usually use unique grouped indexes for logical primary keys, unique ungrouped indexes for system keys (as stored procedures), and ungrouped indexes for any foreign key columns [fields]. But indexes are like salt: too much of a dish becomes salty. You need to consider how much space the database has, how tables are accessed, and whether those accesses are primarily used for reads and writes. In fact, you can think of an index as a special kind of directory. Microsoft SQL SERVER provides two types of indexes: clustered and nonclustered indexes. Below, we illustrate the difference between clustered index and non-clustered index: in fact, the body of our Chinese dictionary is itself a clustered index. For example, when we look up the word “an”, we naturally turn to the first few pages of the dictionary, because the pinyin for “an” is “an”, and the dictionary that sorts Chinese characters according to pinyin starts with “A” and ends with “Z”, so “an” naturally comes first. If you go through all the sections starting with “A” and still can’t find the word, you don’t have it in your dictionary. Similarly, if you look up “Zhang”, you will also turn to the last part of your dictionary, because “Zhang” is spelled zhang in pinyin. That is, the body of the dictionary is itself a table of contents, and you do not need to look in another table of contents to find what you are looking for. We call this kind of text content itself a kind of regular arrangement of the table of contents “clustered index”. If you know a word, you can quickly look it up from the auto. But you may also encounter you do not know the word, do not know its pronunciation, at this time, you can not find the word you want to check in accordance with the method just now, and need to go according to the “radical” to find the word you are looking for, and then according to the page after the word directly turn to a page to find the word you are looking for. But you combined with “radical directory” and “index of Chinese characters and to check the word” sort is not true of the text sorting method, such as you look up “a” word, we can see that after check radical index of Chinese characters “a” in the page number is 672 pages, index of Chinese characters in “a” above is “chi”, but the page number is 63 – page, under “a” is “crossbow” word, The page is 390. Obviously, these words are not really above or below “zhang”. The successive “chi”, “zhang” and “crossbow” that you see now are actually their ordering in the non-clustered index, which is a mapping of the words in the body of the dictionary in the non-clustered index. We can find the word you want in this way, but it takes two steps, first finding the result in the table of contents, and then turning to the page number you want. We call this kind of sorting method in which the catalogue is purely a catalogue and the body is purely a text “non-clustered index”. By extension, we can easily understand that there can only be one clustered index per table, because directories can only be sorted in one way.

(a) when to use clustered index or non-clustered index

The following table summarizes when to use clustered or non-clustered indexes (important).

Action Description Use clustered indexes Use non-clustered indexes

Columns are often grouped and sorted

Should or should not return data in a range

One or very few different values should not

Should different values of decimals

Different values of large numbers should not be applied

Frequently updated columns should not

The foreign key column should

Primary key column should

Frequent changes to index columns should not be allowed

In fact, we can understand the above table from the previous example of the definition of clustered and non-clustered indexes. For example, return an item of data in a range. For example, if you have a table with a time column and you create the aggregate index on that column, you can query all the data between January 1, 2004 and October 1, 2004 very quickly because your dictionary text is sorted by date. The clustering index only needs to find the beginning and end data in all the data to be retrieved; Unlike a non-clustered index, you must first look up the page number of each item in the table of contents, and then look up the specific content by page number.

(2) Combined with reality, talk about the misunderstanding of index use

The purpose of theory is application. Although we have just listed when clustered or non-clustered indexes should be used, in practice these rules can easily be ignored or cannot be analyzed in a comprehensive way. Below we will talk about the misunderstanding of index use according to the actual problems encountered in practice, so that we can master the method of index establishment. 1. The idea that a primary key is a clustered index is extremely wrong and a waste of clustered indexes. SQL SERVER builds clustered indexes on primary keys by default. In general, we create an ID column in each table to distinguish each piece of data, and this ID column is automatically increased, usually with a step of 1. This is the case with the COLUMN Gid in our office Automation example. At this point, if we set this column as the primary key, SQL SERVER defaults this column to the clustered index. This has the advantage of having your data physically sorted by ID in the database, but I don’t think it makes much sense. Obviously, the advantages of clustered indexes are obvious, and the rule that there can only be one clustered index per table makes clustered indexes even more valuable. From the definition of clustered indexes we discussed earlier, we can see that the biggest benefit of using clustered indexes is the ability to quickly narrow down queries based on query requirements and avoid full table scans. In practice, because ID numbers are automatically generated, we do not know the ID number of each record, so it is difficult for us to use ID numbers to query in practice. This makes it a waste of resources to have a primary key, the ID number, as a clustered index. Secondly, it does not comply with the rule that “aggregation indexes should not be established under different values of large numbers” to have fields with different ID numbers as clustered indexes. Of course, this is only for users who frequently modify the content of the record, especially the index entries, but it does not affect the query speed. In the office automation system, no matter the documents that need to be signed and received displayed on the home page of the system, the meeting or the file query, the data query under any circumstances can not be separated from the field “date” and the user’s own “user name”. Typically, the front page of OA shows documents or meetings that each user hasn’t signed in yet. Although we where clause can be limited to the current user has yet to sign for it, but if your system has been established for a long time, and the data quantity is very big, so, every time when each user opened the home page for a full table scan, do the meaning is not big, the vast majority of users a month before the file has been visited by, Doing so only increases the overhead of the database. In fact, when the user opens the home page of the system, the database only queries the files that the user has not read in the past three months. The field of “date” is used to limit table scanning and improve the speed of query. If your OFFICE automation system is two years old, your home page will theoretically be eight times faster, or even faster.

In fact, we can see that in the above example, statements 2 and 3 are identical, and the indexed fields are the same; The only difference is that the former builds a non-aggregate index on the Fariqi field, while the latter builds an aggregate index on the fariqi field, but the query speed is completely different. Therefore, it is not necessary to simply index any field to improve query speed. From the table statement, we can see that the fariQI field has 5003 different records in the table with 10 million data. It is a good place to build an aggregate index on this field. In reality, we post several files every day with the same date, which perfectly conforms to the rule of “neither most of them are the same, nor only a few of them are the same” for clustered indexing. From this point of view, it is very important that we build “appropriate” aggregate indexes to speed up our queries.

3. Add all the fields that need to improve the query speed into the aggregate index to improve the query speed. As mentioned above, the “date” and the “username” of the user are indispensable fields in the data query. Since both fields are so important, we can combine them to create a compound index. Many people think that adding any field to the clustered index will speed up the query, while others wonder if the query speed will be slowed if the composite clustered index field is queried separately. With that in mind, let’s look at the following query speed (both result sets are 250,000 pieces of data) : (The date column fariqi is first placed in the start column of the composite clustered index, and the user name neibuyonghu is second.) Even slightly faster than using all composite index columns (with the same number of query result sets); If only the non-start column of the composite clustered index is used as the query condition, the index is useless. Of course, statements 1 and 2 are the same because the number of items queried is the same. If all columns of the composite index are used and the number of query results is small, this results in “index coverage” and thus optimal performance. Also, keep in mind that the leading column must be the most frequently used column, regardless of whether or not you use the other columns in the aggregate index.

(3) Other Notes: “Water can carry a boat, it can also overturn a boat.” The same goes for indexes. Indexes can help improve retrieval performance, but too many or inappropriate indexes can also lead to system inefficiencies. Because every time a user adds an index to a table, the database does more work. Too many indexes can even lead to index fragmentation. Therefore, we need to establish a “proper” index system, especially for the creation of aggregated indexes, should be more refined, so that your database can get high performance

Database optimization under large data volume and high concurrency

Related Posts

Do you believe that one document can give you a salary increase of 10K? I bought it. You got it. You can do it

For Elasticsearch!

Decision trees and random forests