How to efficiently generate millions of data

background

In the work, most of our needs will encounter new scenarios of data, such as: batch generation of test data, simple data migration, statistical data reports, etc. If the amount of data added and modified at a time is particularly large, it will cause a long execution time and not up to our expectations, often make the head big.

So how do you quickly generate millions of data? Usually we use multithreading to speed up the execution of code, but what about the other way around, how do you maximize the performance of a task by giving it a single thread? In order to follow the multithreading, the speed of each thread can get the biggest improvement, here I through several different forms, give you a test comparison, see which way speed is faster.

Technology USES

JAVA1.8
Mybatis
MySql5.7

The preparatory work

First, we prepare a simple user information table for the test, the table structure is as follows.

create table zcy_user
(
    id          int auto_increment primary key,
    user_name   varchar(32)  null,
    sex         int          null,
    email       varchar(50)  null,
    address     varchar(100) null,
    create_time timestamp    null
);
Copy the code

Add batch inserts to XML files.

insert into t_user(`user_name`.`sex`.`email`.`address`) VALUES
<foreach collection="list" item="emp" separator=",">
  (#{emp.userName}, #{emp.sex}, #{emp.email} ,#{emp.address})
</foreach>
Copy the code

The test time

So let’s test this with control variables, let’s do a single thread, insert a million pieces of data at a time, and see how long it takes.

Insert one at a time

Since it takes a long time to insert one piece of data at a time, we estimate that 1 million pieces of data is about 31165.70 seconds by taking 311.65 seconds for 10,000 pieces of data.

Insert 100 at a time

Each insertion of 100 data takes 373.58 seconds.

Insert 1000 strips at a time

Each insertion of 1000 pieces of data takes 62.08 seconds.

Insert 10,000 pieces at a time

Each insert of 10000 data takes 31.83 seconds.

Insert 50,000 pieces each time

Each insertion of 50000 data takes 29.30 seconds.

Insert 100,000 at a time

An error message is displayed when 100000 entries are inserted each time. The cause of the error message is displayed. The size and quantity of the uploaded file is 5900064kb larger than the maximum configuration limit of 4194304kb (4M).

contrast

Insert the number of	Take/SEC.	memory	speed	Speed multiplier
1	31165	free	slow	/
100	373	good	slow	83 times
1000	62	good	In the	502 times
10000	31	medium	fast	1005 times
50000	29	larger	fast	1074 times
100000	/	A large	/	/

Process knowledge

Mysql memory structure

The official website of Mysql introduces the Innodb storage engine structure, which includes memory storage and disk storage. The specific structure is shown in the figure below. The memory storage includes Buffer Pool and Log Buffer. We know that random disk read and write is inefficient. Therefore, when Mysql is designed, read and write data to the Buffer Pool first and then write data to the disk by the I/O thread.

Buffer Pool

In order to avoid disk IO caching of table data and index data, Mysql is designed to load some frequently called data on disk into the buffer pool to speed up access. The Buffer Pool uses LRU (Least recently used) algorithm to calculate the data stored in the cache. But in ordinary LRU algorithm, if there is a SQL requests the query a table full amount of data, at the time of in-depth paging, in fact, every time for the first few pages of data, so that the cache will only save page of previous data, however, the data is not really a call highest frequency data, As a result, the cache space is not really being utilized.

According to the above analysis, LRU has been optimized in the design of Mysql. On the basis of the separation of the original LRU, the division of cold and hot data has been added. 3/8 of the original space is used to store cold data and 5/8 of the original space is used to store hot data. It can be seen from the configuration of Mysql that the data access interval is 1 second as the dividing line, and the data with continuous request time less than 1 second will be moved to the front of the cold data through LRU algorithm. Continuous requests longer than 1 second are allocated to hot data. This modification ensures that hot data is stored in cached data.

# Check the hot and cold data separation time of 1000ms
mysql> show variables like 'innodb_old_blocks_time';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| innodb_old_blocks_time | 1000  |
+------------------------+-------+
1 row in set (0.00 sec)
Copy the code

Change Buffer

Here we take update as an example. When the data page to be updated is in the Chang Buffer, the data page is directly updated in memory. If the data page is not in memory. InooDB caches these updates in the Change Buffer without compromising data consistency, so that the data page does not need to be read from disk. The next time a query needs to access the data page, the data page is read into memory and the change Buffer operations related to the page are performed. In this way, multiple operations can be merged and submitted in the form of merge, improving execution performance. If execution fails or the system is interrupted unexpectedly, you can use the Redo Log to perform tasks to ensure that data is correct.

Analysis of performance improvement principle

Merger transaction

The data from the above test, INSERT values, is actually multiple INSERT value transactions combined to commit. For Mysql, a Redo Log is generated for each transaction. When multiple commands are combined into one transaction, disk I/OS are reduced to improve performance. However, there is an upper limit on the size of log files, and performance does not improve significantly beyond the maximum configured size.

Write values to guarantee a transaction
insert into t_user(`user_name`.`sex`.`email`.`address`) VALUES
(1.1.1.1), (2.2.2.2)..................... (n,n,n,n);Write multiple values to a transaction
begin;
insert into t_user(`user_name`.`sex`.`email`.`address`) value(1.1.1.1);
insert into t_user(`user_name`.`sex`.`email`.`address`) value(2.2.2.2); .....................insert into t_user(`user_name`.`sex`.`email`.`address`) value(n,n,n,n);
commit;
Copy the code

Ensure data order

Innodb engine uses B+ tree index to store data in order. If index insertion is continuous, sequential I/O operations will be performed in the same contiguous space during the secondary disk search, whereas if index insertion is out of order, random I/O operations will be performed during the secondary disk scan. What a high price to pay. Therefore, the index value of the generated data should be ID or regular data as far as possible to ensure that the new data can be stored in continuous space.

Usage scenarios

Data Report Import
Initialize data in batches
Millions of millions of data migrations
Log table record

conclusion

Now many of the engineering project, the amount of data made millions already is common, but performance needs to be protected, according to the test results analysis, the mass production data, can according to the business scenario can combine can merge request as far as possible to the same transaction, so that we can improve the performance of many, reduce the pressure to our database and service, Make the program more stable.

, recruiting

Zhengcaiyun Technology team (Zero) is a passionate, creative and executive team based in picturesque Hangzhou. The team has more than 300 r&d partners, including “old” soldiers from Alibaba, Huawei and NetEase, as well as newcomers from Zhejiang University, University of Science and Technology of China, Hangzhou Electric And other universities. Team in the day-to-day business development, but also in cloud native, chain blocks, artificial intelligence, low code platform system, middleware, data, material, engineering platform, the performance experience, visualization technology areas such as exploration and practice, to promote and fell to the ground a series of internal technical products, continue to explore new frontiers of technology. In addition, the team is involved in community building, Currently, There are Google Flutter, SciKit-Learn, Apache Dubbo, Apache Rocketmq, Apache Pulsar, CNCF Dapr, Apache DolphinScheduler, and Alibaba Seata and many other contributors to the excellent open source community. If you want to change something that’s been bothering you, want to start bothering you. If you want to change, you’ve been told you need more ideas, but you don’t have a solution. If you want change, you have the power to make it happen, but you don’t need it. If you want to change what you want to accomplish, you need a team to support you, but you don’t have the position to lead people. If you want to change the original savvy is good, but there is always a layer of fuzzy window…… If you believe in the power of believing, believing that ordinary people can achieve extraordinary things, believing that you can meet a better version of yourself. If you want to be a part of the process of growing a technology team with deep business understanding, sound technology systems, technology value creation, and impact spillover as your business takes off, I think we should talk. Any time, waiting for you to write something and send it to [email protected]

Wechat official account

The article is published synchronously, the public number of political cloud technology team, welcome to pay attention to

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How to efficiently generate millions of data

background

Technology USES

The preparatory work

The test time

Insert one at a time

Insert 100 at a time

Insert 1000 strips at a time

Insert 10,000 pieces at a time

Insert 50,000 pieces each time

Insert 100,000 at a time

contrast

Process knowledge

Mysql memory structure

Buffer Pool

Change Buffer

Analysis of performance improvement principle

Merger transaction

Ensure data order

Usage scenarios

conclusion

Recommended reading

, recruiting

Wechat official account

How to efficiently generate millions of data

background

Technology USES

The preparatory work

The test time

Insert one at a time

Insert 100 at a time

Insert 1000 strips at a time

Insert 10,000 pieces at a time

Insert 50,000 pieces each time

Insert 100,000 at a time

contrast

Process knowledge

Mysql memory structure

Buffer Pool

Change Buffer

Analysis of performance improvement principle

Merger transaction

Ensure data order

Usage scenarios

conclusion

Recommended reading

, recruiting

Wechat official account

Related Posts

Go: How to develop Concurrent programs better?

ConcurentHashMap

18 common data analysis interview questions – probability and statistics category