What problems do you encounter with Java concatenation of "100 million line strings"?

This paper will involve three aspects as follows:

1. Performance comparison of four implementations for a 100,000 for loop

2. Change the For loop to 100 million times

Demand 00.

The original plan is to generate 100 million simulated data, detailed requirements are as follows:

Create 100 millionInsert SQLStatements, such as:INSERT INTO products (`id`.`code`) value (1.'000000000'); In the preceding command, the ID type is INT(11) and the code type is VARCHAR(9). The value ranges from 00000000 to 99999999. If the id is less than 9 digits, 0 is added to make the length meet the requirement of 9 digits.Copy the code

01. Performance comparison of four implementations for a 100,000-time for loop

The ultimate goal is 100 million, but it will involve time consumption, so the plan is to start with 100,000 lines of data to see the efficiency of the implementation method. After deciding how to use it, upgrade the data volume to 100 million lines and finally achieve the requirements.

We chose four implementations that are common in everyday code:

1. Use "+" to concatenate strings. 2, use StringBuffer | StringBuilder to concatenate strings; Formate () format String and concatenate String with "+"; 4, the use of the String. Formate () to format strings, and StringBuffer | StringBuilder String concatenation;Copy the code

How is the efficiency of each of them in the current scenario when 100,000 data is used? The following is the comparison after statistics

The code structure is as follows:

If you’re interested in code, look here;

The overall performance comparison is clear:

StringBuilder > “+” Splice > string.formate ()

Concatenating strings with “+” is not as simple as replacing StringBuilder directly with “+” concatenating strings, although the underlying implementation is optimized to use StringBuilder.

A “+” string concatenation, equivalent to execution

new StringBuilder(str)
    .append(newStr)
    .toString();
Copy the code

Such as:

String name = "P" + "a" + "g" + "e";
Copy the code

The equivalent of

String name = new StringBuilder(
    new StringBuilder(
        new StringBuilder("P")
            .append("a")
            .toString()
        ).append("g")
         .toString()
    ).append("e")
     .toString();
Copy the code

Using the above method to test different orders of magnitude of elapsed time yields the following reference data:

New objects are created multiple times and the toString() method is called frequently, which ultimately degrades its performance.

Based on the above results, we can get the following results:

1In the case of the complexity of the above requirements, is less than1000String. Formate (); String. Formate ();2, in the case of the complexity of the above requirements, is greater than1000StringBuilder or String.formate() is preferred when using a String of data. String.formate() is generally better for readability, but if performance is a big deal consider using StringBuilder;3If the scene is not simple enough, avoid using "+" concatenated strings, as it does not provide very good readability or performance.Copy the code

02. Target 100 million

The amount of data above is only 100,000 (i.e., 100,000) items, while our target is 100,000,000(i.e., 100 million) items.

Through the process of 100-100,000 pieces of data, our string concatenation performance is not thread growth, at 10W:

(1) “+” has reached 38s, and 100 million records can be inferred at least 38s x 1000 ≈ 10.5h at one time. (2) StringBuilder only uses 43ms, so it might surprise us. (3) String.formate + StringBuilder consumes 532ms, 100 million records at least 0.532s x 1000 ≈ 532s (9 minutes)

So it’s a good idea to use scenario 2 directly to generate SQL statements with StringBuffer, setting the loop test to 100 million times.

When the number of cycles was set to 100 million, something went wrong.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Copy the code

The JVM can set heap memory with -xms and -xmx. After several tests, the following reference data were obtained:

Obviously a MacBook Pro with 16GB of memory is not enough to create 100 million rows of Insert SQL statements directly in memory. So how to accomplish the above requirements?

Since it cannot be generated by a one-for-one For loop, it can be generated by multiple For loops, such as a 10-million-dollar For loop executed 10 times.Copy the code

Execute twice: 1. Execute 10 times to generate 10 million rows of data each time and persist the same file. 2. Execute 10 times to generate 10 million rows of data each time and persist the generated data in a separate file. Execute 100 times, generate 1 million rows of data each time, and persist the generated data in a separate file.

The result is as follows:

So far requirement implementation.

3. Develop

1. You would not have encountered some of the above problems if you had chosen to append every SQL generated to a file in the first place.

2. StringBuilder does perform well in terms of execution efficiency. However, if the test data is larger than 100 million, note that the StringBuilder has a capacity inside. Capacity is of type int, so there is a limit on the maximum capacity.

3, StringBuilder will have byte[] implementation, processing large amount of data is also need to pay attention to the problem of array boundary.

4. The size of the final generated file can be judged in the actual application. It is recommended to use a file with a volume of 64.9MB of 1 million lines of data per file, because it is faster to open text files with this volume (either through Vim or text tools).

5. When dealing with large amounts of data, you need to focus not only on readability, but also efficiency and, if necessary, JVM parameter Settings.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

What problems do you encounter with Java concatenation of “100 million line strings”?

Demand 00.

01. Performance comparison of four implementations for a 100,000-time for loop

02. Target 100 million

3. Develop

What problems do you encounter with Java concatenation of “100 million line strings”?

Demand 00.

01. Performance comparison of four implementations for a 100,000-time for loop

02. Target 100 million

3. Develop

Related Posts

SpringCloud Hands-on (ix) – Container Automation Deployment and Continuous Integration (Docker)

(Essence) The use of VUE V-ON on 12 July 2020

Zookeeper implements a distributed lock to trap interviewers