Excel import optimization record of 10W row-level data

specifications

There is a requirement for Excel import in the project: import of payment records

The implementation/user will fill the data of other systems into the Excel template in our system, and the application will read, proofread and convert the file content to generate overdue data, bills and bill details and store them in the database.

Before I took over, I probably did not have a high pursuit of efficiency due to the small amount of data imported before. In version 4.0, however, I expect to import at the level of 10W + Excel rows and insert more than 3N rows into the database. That is, 10W rows of Excel will be inserted into the database at least 30W rows. So optimizing the original import code is imperative. I gradually analyzed and optimized the imported code to complete it within 100 seconds. (The final performance bottleneck was the processing speed of the database. The 4g memory of the test server not only put the database, but also put a lot of microservice applications. Not very good at processing). The detailed process is as follows. At each step, problems affecting performance are listed and solutions are provided.

The need to import Excel is still very common in the system, and my optimization method may not be optimal. Readers are welcome to provide better ideas in the comments section

Some of the details

Data import: The template is provided by the system. The template format is XLSX (supporting 65535+ rows of data). Users write corresponding data in corresponding columns based on the table header
Data verification: There are two types of data verification:
- Field length and field regular expression verification. There is no external data interaction in in-memory verification. The impact on performance is small
- Data repeatability check, such as whether the ticket number is the same as the existing ticket number in the system (the database needs to be queried, which affects performance)
Insert data: the test environment uses MySQL 5.7 for database, undivided database and tables, and Druid for connection pool

Iteration record

First edition: POI + line-by-line query proofreading + line-by-line insertion

This version, the oldest, uses native POI to manually map Excel rows to ArrayList objects and store them to List. The code performs the following steps:

Manually read Excel into a List
Loop through, in which the following steps are performed
1. Check field length
2. Some query database verification, such as the verification of the current line arrears corresponding to the house exists in the system, need to query the house table
3. Writes the current row data
Returns the result of execution, if error/verification fails. A prompt message is displayed and data is rolled back

Obviously, this implementation must have been rushed out, and the subsequent use may have been minimal with no discernable performance issues, but it works best for single-digit/ten-digit levels of data. There are the following obvious problems:

The database is queried once for each row of data. The NETWORK I/O times for applications to access the database are multiplied by n times, and the time is also multiplied by N times
Data is also written line by line, and the problem is the same as above
Data reading using native POI, code is very redundant, poor maintainability.

Second edition: EasyPOI + Cache database query operations + bulk inserts

In view of the three problems analyzed in the first edition, the following three methods are used to optimize them respectively

Cache data, trading space for time

The time cost of checking database line by line is mainly in the network IO back and forth, and the optimization method is very simple. All validation data is cached in the HashMap. Go directly to the HashMap.

For example, if the house in the verification row exists, the area + building + unit + room number is used to query the house table to match the house ID. If the house ID is found, the verification succeeds. The house ID is stored in the generated bill of default. The housing information will not be updated when it is imported into arrears. In addition, the housing information of a community is not very much (within 5000), so I use an SQL to store all the houses in the community in the area/building/unit/room number as key and the house ID as value into HashMap. Subsequent verification only needs to be hit in HashMap

Custom SessionMapper

Mybatis natively does not support writing query results directly into a HashMap, requires custom SessionMapper

SessionMapper specifies the result set of SQL queries to be processed using the MapResultHandler

@Repository public class SessionMapper extends SqlSessionDaoSupport { @Resource public void setSqlSessionFactory(SqlSessionFactory sqlSessionFactory) { super.setSqlSessionFactory(sqlSessionFactory); + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Long> getHouseMapByAreaId(Long areaId) { MapResultHandler handler = new MapResultHandler(); this.getSqlSession().select(BaseUnitMapper.class.getName()+".getHouseMapByAreaId", areaId, handler); Map<String, Long> map = handler.getMappedResults(); return map; }}Copy the code

MapResultHandler handler that puts the result set into the HashMap

public class MapResultHandler implements ResultHandler { private final Map mappedResults = new HashMap(); @Override public void handleResult(ResultContext context) { @SuppressWarnings("rawtypes") Map map = (Map)context.getResultObject(); mappedResults.put(map.get("key"), map.get("value")); } public Map getMappedResults() { return mappedResults; }}Copy the code

Sample Mapper

@mapper@repository public interface BaseUnitMapper {// Long> getHouseMapByAreaId(@Param("areaId") Long areaId); }Copy the code

Sample Mapper XML

<select id="getHouseMapByAreaId" resultMap="mapResultLong">
    SELECT
        CONCAT( h.bulid_area_name, h.build_name, h.unit_name, h.house_num ) k,
        h.house_id v
    FROM
        base_house h
    WHERE
        h.area_id = #{areaId}
    GROUP BY
        h.house_id
</select>
            
<resultMap id="mapResultLong" type="java.util.HashMap">
    <result property="key" column="k" javaType="string" jdbcType="VARCHAR"/>
    <result property="value" column="v" javaType="long" jdbcType="INTEGER"/>
</resultMap>        
Copy the code

Then call the method corresponding to the SessionMapper class in your code.

Batch insert using values

MySQL insert statement supports values (),(),() to insert multiple rows of data at one time. Mybatis foreach can insert multiple rows of data at one time.

<insert id="insertList">
    insert into table(colom1, colom2)
    values
    <foreach collection="list" item="item" index="index" separator=",">
    	( #{item.colom1}, #{item.colom2})
    </foreach>
</insert>
Copy the code

Use EasyPOI to read and write Excel

EasyPOI uses annotation-based import and export, so you can modify Excel by modifying annotations, which is very convenient and easy to maintain code.

Third edition: EasyExcel + cache database query operations + batch inserts

After adopting EasyPOI in the second version, thousands or tens of thousands of Excel data can be easily imported, but it takes a little time (5W data is written to the database in about 10 minutes). However, since the operation of import is basically developed while looking at the log while importing, there is no further optimization. However, the good times don’t last long, there is a new community to move in, the bill Excel has 41W lines, at this time, using EasyPOI in the development environment run directly OOM, after increasing the JVM memory parameters, although not OOM, However, the CPU usage is 100% for 20 minutes and all data is still not read successfully. Therefore, it is necessary to optimize the speed when reading large Excel. Do you want me to go deep into POI optimization? Don’t panic. Check GITHUB for other open source projects. Then Ali EasyExcel came into view:

Emmm, I thought this was made for me! Give it a try. EasyExcel uses similar annotations to EasyPOI to read and write Excel, so switching from EasyPOI is easy and can be done in minutes. It is also true as Described by Ali: the average reading time of 41W rows, 25 columns and 45.5m data is 50, so it is recommended to use EasyExcel for reading large Excel.

Fourth edition: Optimize data insertion speed

In the second version of insert, I used values batch inserts instead of row by row inserts. Concatenate a long SQL with sequential inserts every 30,000 rows. The whole import method is the most time-consuming, very time-consuming. Later, I reduced the number of lines to 10000, 5000, 3000, 1000 and 500 at a time and found that 1000 was the fastest. I guess innodb_buffer_pool_size is too long for a write operation and disk swap occurred due to exceeding the memory threshold. The speed was limited, and the database performance of the test server was not very good, and he could not handle too many inserts. So we ended up with 1000 inserts at a time.

After 1000 inserts each time, in order to drain the CPU of the database, then the network IO waiting time needs to be used, this requires multi-threading to solve, and the simplest multi-threading can use parallel flow to achieve, then I will code with parallel flow to test:

10W rows of Excel, 42W back orders, 42W record details, 2W record, 16 threads parallel insert database, 1000 rows at a time. The insertion time is 72s, and the total import time is 95 s.

Parallel insert utility classes

The code for parallel insertion I have encapsulated a functional programming tool class, also provided to you

/** * function: ** @author Keats * @date 2020/7/1 9:25 */ public class InsertConsumer {** ** */ private final static int SIZE = 1000; /** * If you need to adjust the number of concurrent requests, Modify the following methods of the second parameter can be * / static {System. SetProperty (" java.util.concurrent.ForkJoinPool.com mon. Parallelism ", "4"); } /** * insert method ** @param list insert data set * @param consumer, @param <T> public static <T> void insertData(List<T> List, Consumer<List<T>> consumer) { if (list == null || list.size() < 1) { return; } List<List<T>> streamList = new ArrayList<>(); for (int i = 0; i < list.size(); i += SIZE) { int j = Math.min((i + SIZE), list.size()); List<T> subList = list.subList(i, j); streamList.add(subList); } // The number of concurrent threads used by parallel streams is the number of CPU cores and cannot be changed locally. Influenced by global change, consider streamList. ParallelStream (). The forEach (consumer); }}Copy the code

Most of these use a lot of Java8 apis, so if you don’t know, check out my previous Blog about Java. The method is simple to use

InsertConsumer.insertData(feeList, arrearageMapper::insertList);
Copy the code

Other items that affect performance

The log

Avoid printing too many INFO logs in the for loop

In the process of tuning, I also found one thing that affected performance in particular: For info logs, 41w rows, 25 columns, and 45.5m data are still used. An info log is printed every 1000 rows between the beginning and the end of data reading. For cache verification data, 3+ info logs are printed each row between the end of data reading. Print and persist to disk. The following is the difference in efficiency between printing logs and not printing logs

Print log

Do not Print logs

I thought I had chosen the wrong Excel file, so I chose it again, and the result was the same

Cache verification data – It takes only 1/10 of the time to print logs when the verification is complete.

conclusion

Methods to improve Excel import speed:

Use faster Excel reading framework (Ali EasyExcel is recommended)
For validation that requires interaction with the database, use the cache appropriately according to the business logic. Trade space for time
Concatenate long SQL with values(),(),() to insert multiple rows at a time
Use multithreading to insert data, take advantage of network IO wait time (parallel stream is recommended, easy to use)
Avoid printing useless logs in loops