Good design is the cornerstone of high performance. It should be designed, optimized and re-adjusted according to the actual business requirements and usage scenarios of the system. In this process, various factors are often weighed, such as how to divide database tables, how to choose the appropriate data types of fields and so on.

MySQL supports a wide variety of data types, which can be a pain for choke-phobic people. When creating a database table, most people basically use the two types of INT and VARCHAR. As for the length, they will choose enough size to avoid insufficient in the future. Only at that time, after the pit who is uncomfortable.

If you’re an extreme, productive developer, you don’t want this to happen. In the face of many data types, how to choose the right data type is critical to high performance. This article will introduce how to select the optimized data types to improve the performance of MySQL. The most commonly used data types will be selected to explain, which is helpful for creating tables and optimizing table field types in actual development.

First, the principle of choice

No matter what type of data you store, a few simple rules will help you make a better choice.

Smaller is usually better

In general, you should choose the smallest data type to store the data correctly as much as possible. Smaller data types are generally faster because they take up less disk space, memory, and require fewer CPU cycles to process.

However, when choosing smaller data types, it is important not to underestimate the range of stored values, because changing the data type and length later can be a very painful and time-consuming operation. If you are not sure which data type is best, choose the smallest type that you think will not exceed the range.

2. Keep it simple

Simple data-type operations typically require fewer CPU cycles. For example, integers are less costly than character operations because character sets and varsity rules (e.g., collation rules) make character comparisons more complicated than integer comparisons.

3. Avoid NULL

NULL is a very common value, and it is often a bad habit to default to NULL for certain fields. If the field values in the query happen to be NULL values set, optimization is more difficult for MySQl because nullable fields complicate indexing and value comparison.

NULL values cannot be indexed, affecting index statistics and optimizer judgment. Any column in a composite index that contains a NULL value is invalid for the composite index.

2. String type

The string type is the most frequently used data type in the database. VARCHAR and CHAR are the two most important string types that can be used to store strings, but they are saved and retrieved in different ways. VARCHAR is a variable length character type, while CHAR is a fixed length character type. The following is an explanation and comparison of the two types.

1.VARCHAR

The VARCHAR type is used to store variable-length strings, which is more space-efficient than the fixed-length type because it uses only the necessary space (for example, the shorter the string uses the least space).

VARCHAR requires 1 or 2 extra bytes to record the length of the string (if the maximum length of the field is 255 bytes or less, only 1 byte is used for length, otherwise 2 bytes are used for length). For example, a VARCHAR(10) field requires 11 bytes of storage space, and VARCHAR(1000) requires 1002 bytes of storage space, of which 2 bytes are required to store length.

2.CHAR

The CHAR type is fixed length. When the data type is CHAR, MySQL removes all trailing whitespace.

The CHAR type is suitable for very short strings, or if all values are close to the same length. For example, the CHAR type is ideal for storing the MD5 value of a password because it is a fixed-length value. The CHAR type is also better than the VARCHAR type for frequently changing data because fixed-length CHAR types are less prone to fragmentation. For very short columns, the CHAR type is more space-efficient than VARCHAR. For example, using CHAR(1) to store only Y and N values would require 2 bytes if using VARCHAR(1), because there would be an extra byte of record length.

To help you understand, CHAR is illustrated by the following concrete example. Create a CHAR(10) CHAR(10) CHAR(10) CHAR(10) CHAR(10) CHAR(10)

mysql> create table char_test(ch char(10));
Query OK, 0 rows affected

mysql> insert into char_test(ch) values('xcbeyond'), (' xcbeyond'), ('xcbeyond ');
Query OK, 3 rows affected
Records: 3  Duplicates: 0  Warnings: 0
Copy the code

The odd thing is that when we query, we find that the space at the end of the third string is automatically truncated. To better show whether there are Spaces, concatenate the ‘characters before and after the CH field for easy comparison.

mysql> select concat("'",ch,"'") from char_test;
+--------------------+
| concat("'",ch,"'") |
+--------------------+
| 'xcbeyond'         |
| '  xcbeyond'       |
| 'xcbeyond'         |
+--------------------+
3 rows in set
Copy the code

If you store the same value with the VARCHAR(10) field, Spaces at the end of the string are not truncated.

Date type

MySQL provides two similar date types: DATETIME and TIMESTAMP.

They all represent dates well for applications, but in some cases, they differ. Let’s take a look.

1.DATETIME

The DATETIME type can hold a wide range of dates, from 1001 to 9999, with an accuracy of seconds. It encapsulates the date and time into an integer format YYYYMMDDHHMMSS, irrespective of time zone, using 8 bytes of storage space.

By default, MySQL displays DATETIME values in a sortable, unambiguous format, such as 2020-03-05 22:38:40.

2.TIMESTAMP

TIMESTAMP, as the name suggests, is the same as a UNIX TIMESTAMP, and holds the number of seconds since 00:00.0 on January 1, 1970. TIMESTAMP uses only four bytes of storage space, so it has a much smaller range than DATETIME, representing only from 1970 to 2038.

The value displayed in TIMESTAMP depends on the time zone and is set by the MySQL server, operating system, and client connection. Thus, TIMESTAMP with a stored value of 0 is displayed as 1969-12-31 19:00:00 in the EASTERN TIME zone of the United States, 5 hours from Greenwich.

You should generally try to use TIMESTAMP because it is more storage efficient than DATETIME and is more appropriate for business across time zones.

What if you need to store date and time values that are less granular than seconds? MySQL does not currently provide a suitable data type, but other workarounds can be used. For example, you can use your own storage format: you can use BIGINT to store subtler timestamp, or use DOUBLE to store decimal parts after seconds. Alternatively, you can use the MariaDB database instead of MySQL.

TEXT and BLOB types

We typically choose the CHAR or VARCHAR type for small strings, and TEXT and BLOB for data such as large TEXT.

The TEXT and BLOB types are string data types designed to store very large amounts of data, stored as strings and binary, respectively. For example, TEXT is usually used to hold string content such as article content and logs, while BLOB is usually used to hold binary data content such as images and videos. It has the following characteristics:

  • The TEXT type has a character set and collation.

  • BLOB types store binary data and have no collation or character set.

  • Columns of type TEXT and BLOB cannot be indexed in MySQL, nor can these indexes be used to eliminate sorting.

Unlike other data types, MySQL treats each TEXT and BLOB type value as a separate object. When values become too large, InnoDB uses a special “external” storage area. Each value needs 1 to 4 bytes in a row to store a pointer, and then stores the actual value in the external storage area.

When choosing between TEXT and BLOB, you should select the minimum storage type that meets your requirements. The following describes some common problems of the TEXT and BLOB types.

1. When a large number of delete operations are performed,TEXTandBLOBThis can cause some performance problems

Delete operations leave large “holes” in database tables, and records that fill these “holes” later can have an impact on insert performance. To improve performance, you are advised to periodically use the OPTIMZE TABLE function to defragment such tables to avoid performance problems caused by empty tables.

The actual combat demonstration verification instructions are as follows:

Create test table text_test where id and context are type int(11) and text:

mysql> create table text_test(id int(11),context text);
Query OK, 0 rows affected
Copy the code

Insert a large number of data into the text_test table.

The repeat function is used to copy strings

mysql> insert into text_test(id.context) values(1.repeat('xcbeyond'.1000));
Query OK, 1 row affected

mysql> insert into text_test(id.context) values(2.repeat('xcbeyond'.1000));
Query OK, 1 row affected

mysql> insert into text_test(id.context) values(3.repeat('xcbeyond'.1000));
Query OK, 1 row affected

mysql> insert into text_test(id.context) values(4.repeat('xcbeyond'.1000));
Query OK, 1 row affected

mysql> insert into text_test(id.context) values(5.repeat('xcbeyond'.1000));
Query OK, 1 row affected

mysql> insert into text_test(id.context) values(6.repeat('xcbeyond'.1000)); Query OK, 1 row affected......Copy the code

Select * from text_test; select * from text_test;

2020/03/07 Sat 15:58 540,672 text_test.ibdCopy the code

The data file size shown here is 540Kb.

Alter table text_test alter table text_test alter table text_test alter table text_test

mysql> delete from text_test where id < 10;
Query OK, 9 rows affected
Copy the code

5) Check the physical file size of text_test here:

2020/03/07 Sat 16:05 573,440 text_test.ibdCopy the code

Strangely, the data file size did not decrease as a result of deleting data, but actually increased a bit.

Select table TEXt_test with OPTIMIZE:

mysql> optimize table text_test;
+----------------+----------+----------+-------------------------------------------------------------------+
| Table          | Op       | Msg_type | Msg_text                                                          |
+----------------+----------+----------+-------------------------------------------------------------------+
| test.text_test | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| test.text_test | optimize | status   | OK                                                                |
+----------------+----------+----------+-------------------------------------------------------------------+
2 rows in set
Copy the code

Select * from text_test; select * from text_test;

2020/03/07 Sat 16:08 458,752 text_test.ibdCopy the code

You can see that the data file size of the table is reduced, indicating that the “empty” space has been reclaimed.

2. Use synthetic indexes to improve large text fields (TEXT,BLOBType) query performance

A composite index is a hash value created from the contents of a large text field and stored in a separate column of data, which can then be retrieved to find the row.

Note, however, that this technique can only be used for queries that match exactly (hash values are not useful for range search operators such as < or >=). You can use the MD5() function to generate the hash value, use SHA1() or CRC32(), or use your own application logic to calculate the hash value. Remember that numeric hash values can be stored efficiently. Similarly, if the hash algorithm generates strings with trailing whitespace, do not store them in CHAR or VARCHAR columns, which are affected by trailing whitespace removal. Synthesized hash indexes are especially useful for BLOB or TEXT data columns. Lookup with hash identifier values is much faster than searching the BLOB column itself.

The actual combat demonstration verification instructions are as follows:

Create test table text_test2 (context, text, varchar, 40); create test table text_test2 (context, text, varchar, 40);

mysql> create table text_test2(id int(11),context text,hashValue varchar(40));
Query OK, 0 rows affected
Copy the code

Select * from text_test2 where hashValue = MD5;

mysql> insert into text_test2 values(1.repeat('xcbeyond'.10),md5(context));
Query OK, 1 row affected

mysql> insert into text_test2 values(2.repeat('xcbeyond'.10),md5(context));
Query OK, 1 row affected

mysql> select * from text_test2;
+----+----------------------------------------------------------------------------------+-------------------------------- --+
| id | context                                                                          | hashValue                        |
+----+----------------------------------------------------------------------------------+-------------------------------- --+
|  1 | xcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyond | 537f6020f5b2b59456a61271a2b3f285 |
|  2 | xcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyond | 537f6020f5b2b59456a61271a2b3f285 |
+----+----------------------------------------------------------------------------------+-------------------------------- --+
2 rows in set
Copy the code

3) If you need to query the value of context column, use hashValue to query the value of context column:

mysql> select * from text_test2 where hashValue = md5(repeat('xcbeyond'.10));
+----+----------------------------------------------------------------------------------+-------------------------------- --+
| id | context                                                                          | hashValue                        |
+----+----------------------------------------------------------------------------------+-------------------------------- --+
|  1 | xcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyond | 537f6020f5b2b59456a61271a2b3f285 |
|  2 | xcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyondxcbeyond | 537f6020f5b2b59456a61271a2b3f285 |
+----+----------------------------------------------------------------------------------+-------------------------------- --+
2 rows in set
Copy the code

The above example demonstrates the use of synthetic indexes, which reduce I/O to some extent and improve query efficiency because they can only be used for exact matching.

3. Avoid retrieval when it’s unnecessaryTEXT,BLOBThe value of the type

For example, a SELECT * query is not a good operation unless you can be sure that the WHERE clause as a constraint will only find the rows you need. Otherwise, a large number of values are likely to be transferred aimlessly across the network. This is another example where BLOB or TEXT identifier information stored in a composite index column can be helpful to the user. The user can search index columns, determine which data rows are required, and then retrieve BLOB or TEXT values from the eligible data rows.

4. Separate BLOB or TEXT columns into separate tables

In some circumstances, it may make sense to move these columns to a second table to convert the columns in the original table to a fixed-length row format. This reduces fragmentation in the main table and gives you the performance advantage of fixed-length rows. It also enables the master table to run SELECT * queries without transferring a large number of BLOB or TEXT values over the network.

Select a unique identifier

Unique identifiers, often referred to as primary keys, are used to act as the sole judge of table records. Unique identifiers. Choosing the right data type is very important.

Unique identifiers are often used more for comparison with other values or values of other tables (for example, in associative queries). Identification columns may also be used as foreign keys in other tables, so when selecting data types for identification columns, you should select the same type as the corresponding column in the root associated table.

When selecting a unique identifier type, you need to consider not only the storage type, but also how MySQL performs calculations and comparisons on this type, since comparisons are used most often in SQL queries and are the biggest performance constraints.

Once you have selected a type, it is important to ensure that the same type is used in all associated tables. Because type directness is often an exact match, mixing different data types can cause performance problems, and even if there is no performance impact, implicit type conversions can cause errors that are hard to spot when comparing operations.

The smallest data type should be selected as long as the range of values can be satisfied and there is room for future growth.

Here are some tips:

1. Integer type

Integers are usually the best choice for identifying columns because they are fast and can be used with AUTO_INCREMENT.

2. The value is a string

If you can avoid it, avoid using string types as identification column types whenever possible, as they are space-consuming and generally slower than numeric types. Be especially careful when using strings as identity columns in MyISAM storage engine tables. MyISAM uses compressed indexes for strings by default, which can make queries much slower.

Be careful with completely “random” strings, too, such as those generated by MD5(), SHA1(), or UUID(). The new values generated by these functions can be arbitrarily distributed over a large space, causing insert and some select operations to be slow:

  • Because insert values are written randomly to different positions in the index, it makesinsertStatements are slower. This results in page splitting and random disk access.
  • selectStatements become slower because logically non-contiguous data is distributed in different places on disk and in memory.
  • Random values make caching poor for all types of query statements because it invalidates the principle of access locality on which the cache works.

Six, summarized

In practical development, there are many tools that automatically generate table building scripts and so on. Automatic generation brings great convenience to development in the early stage, but at the same time, it leads to serious performance problems. Some tools generate things that use large VARCHAR types when storing any data, which is often incorrect. If it is automatically generated, be sure to double-check to make sure it is reasonable.

For example, some ORM frameworks (such as MyBatis and Hibernate) will store any type of data column to any type of back-end data, which usually means that there is no design to use a better data type for storage, and later security risks are great, and problems are difficult to troubleshoot. In general, be sure to double check to make sure that it makes sense. That’s why I personally don’t like to use tools like this to generate code, because checking is a waste of my time.

Most of the commonly used data types have been covered here, what their characteristics are, where they can seriously affect performance, and so on. When choosing a data type, you are half way there. The rest of the details should be taken into account in your daily development contacts.

The brief summary is as follows:

  • For string types, the best strategy is to allocate only as much space as you really need.

  • Date type: Select a date type that meets the minimum storage requirements.

  • OPTIMIZE TABLE for tables with TEXT and BLOB fields whose operations are frequently performed