Some tips from senior Java developers on solving big data problems

As we all know, when Java is dealing with a large amount of data, loading into the memory will inevitably lead to memory overflow, and in some data processing we have to deal with massive data, in data processing, we common means are decomposition, compression, parallel, temporary files and other methods;

For example, we want to export data from a database (no matter what database it is) to a file, usually Excel or CSV in text format. In Excel, with the POI and JXL interfaces, you often have no way to control when memory is written to disk, which is disgusting, and the size of the objects constructed by these apis in memory is many multiples of the original size of the data, so you have to split Excel. Fortunately, the POI is starting to realize this problem. As of 3.8.4, the number of rows in the cache is now available. It provides an SXSSFWorkbook interface that allows you to set the number of rows in memory. Unfortunately, when you exceed this number of rows, each time you add a row, it writes the first row to disk (if you set 2000 rows, When you write a line 20001, he will be the first line written to disk), but this time he some temporary files, so that doesn’t consume memory, but you will find that brush the frequency of the disk will be very high, we really don’t want to be like this, because we want him to achieve a range of disposable brush such as disk data, such as a brush is 1 m, It’s a pity that there is no this API, very painful, I tested, by writing small Excel than using the current brush disk API to write a large file, efficiency, and that if people visit a little bit more disk IO may carry not to live, because IO resources are very limited, so you may as well open file is the best policy. And when you write CSV, which is a text file, you can control it a lot of the time, but you don’t want to use the API that CSV provides, it’s not very controllable, because CSV is a text file, so if you write it as text, it’s recognized by CSV; How do I write that? Let’s talk about…

In the data processing level, such as reading data from the database, generate local files, write code for convenience, we do not necessarily 1M how to deal with, this to the underlying driver to split, for our program we think it is continuous write; Let’s say we want to export a 1000W database table to a file; At this point, you either have to do pagination, oracle uses three layers, MySQL uses limit, but pagination is a new query every time, and it gets slower and slower as you turn the page. In fact, we want to get a handle, and then we want to swim down, compile a portion of the data (say 10000 lines) and write the file once. When writing to a buffer using outputStream, flush the buffer. Next, execute a SQL with no WHERE condition, will the memory run out? Yes, it’s worth thinking about it. It’s possible to do something with SQL through the API, for example: PreparedStatement statement = connection. PrepareStatement (SQL), and this is the default get precompiled, can also through the Settings: PreparedStatement statement = connection.prepareStatement(sql , ResultSet.TYPE_FORWARD_ONLY , ResultSet.CONCUR_READ_ONLY);

To set the cursor so that it does not cache data directly into local memory, then set statement. SetFetchSize (200); Set the size of each cursor traversal; Oracle’s JDBC API does not cache data into Java memory by default. Mysql does not cache data into Java memory by default. Java provides standard API may not be effective, a lot of time depends on the implementation mechanism of the manufacturer, and this setting is a lot of online said to be effective, but this is pure plagiarism; For Oracle, it is not cache to memory, so Java memory will not cause any problems. If mysql is used, it must first use version 5 or higher, and then add the parameter useCursorFetch=true to the connection parameter. The cursor size can be set by adding defaultFetchSize=1000 to the connection parameter, for example:

jdbc:mysql://xxx.xxx.xxx.xxx:3306/abc? zeroDateTimeBehavior=convertToNull&useCursorFetch=true&defaultFetchSize=1000

Last time was entangled with the problem for a long time (mysql data old lead to the expansion of the program memory, parallel two direct system went down), also went to see a lot of source code to find the miracle here, finally after mysql document confirmation, and then test, multiple, parallel and data volume is more than 500 w, will not cause memory expansion, GC is fine, this problem is finally over.

We’ll talk about other, data split and merged, when more data files we want to merge, when the file is too large to split, merge and split process also encountered similar problems, well, we can control this in the range, if the file is can organize the data in the final, so at the time of splitting and merging, You don’t want to go by the number of logical lines, because eventually you need to interpret the data itself, but you don’t need to just split it, you need to do binary processing, and in this binary processing, you need to be careful not to use the same method as you would with a read file, The read method provides a range of offset and length overloads, which can be calculated during the loop. Write to a large file as above, do not read a certain program to flush to disk; In fact, the processing of small amount of data is also useful in modern NIO technology, for example, multiple terminals simultaneously request a large file download, such as video download, in the normal case, if the Java container is used to process, generally two situations will occur:

One for memory, because each request to load a file size of memory, and even more, because when the Java package will produce a lot of other memory overhead, if you are using a binary will produce much less, and after the process of input and output flow will also experience memory copy a few times, of course, if you like nginx middleware, You can sent by send_file model, but if you want to use the program to deal with, memory unless you are big enough, but the Java memory again big also can have the GC, if you really is very big, the memory when the GC dead, of course the place also can consider themselves by direct memory calls and release, However, the remaining physical memory is also required to be large enough, so large enough is how much? It depends on the size of the file itself and how often it is accessed;

The second is that if the memory is large enough, the limit is the thread. The traditional IO model is that the thread is one request one thread, and the thread is allocated from the main thread from the thread pool, and then starts working. After your Context packaging, Filter, interceptors and business code at all levels and the business logic, database access, access to the file, the renderings, etc., in fact the whole process threads are hung live, so this part of the resource is very limited, and if it is big file belongs to the IO intensive operation, a lot of CPU time is free, Method is the most direct, of course, is to increase the number of threads to control, memory is enough big, of course, also have enough space to apply for the thread pool, but generally a thread pool is general process could be limited too much is not recommended, and under the limited system resources, to improve performance, we begin to have new IO technology, namely the NIO technology, New version of the inside and the AIO technology, NIO can be asynchronous I/o, but the process is still blocked in the middle, speaking, reading and writing (which is in a real, speaking, reading and writing process, but not to care about the way the response), has not yet truly asynchronous IO, listening to connect when he is don’t need a lot of threads, a separate thread to handle, Connections are also the traditional sockets that become selectors, and those that do not require data processing do not need to be allocated by the thread; AIO does this through a so-called callback registration, which of course requires OS support, and allocates threads when it drops. It is not yet mature, with performance at best comparable to NIO, but AIO will surely surpass NIO as the technology develops. Node.js powered by Google’s V8 virtual engine is a similar model. This technique is not the focus of this article;

Combine the above two is to solve the large files, parallelism, the soil’s approach is to document the size of each request to reduce to a certain extent, such as 8 k (the size is after testing the size of the network transmission suitable local read a file doesn’t need such a small), if I do some in-depth again, can do a certain amount of cache, Cache multiple requests for the same file in memory or distributed cache, you do not cache the entire file in memory, cache the recently used cache in a few seconds or so, or you can use some hot algorithms to match; Similar thunderbolt download breakpoint transmission (but thunderbolt network protocol is not quite the same), it may not be continuous in the processing of download data, as long as the final can merge, in the server can turn around, who just need this piece of data, it can give it; Only after using NIO, can support a large connection and concurrency, local through NIO to do socket connection test, 100 terminals at the same time request a thread of the server, the normal WEB application is the first file has not been sent, the second request either wait, or timeout, or directly rejected the connection, If you change to NIO, all 100 requests can be connected to the server, the server only needs 1 thread to process the data, pass a lot of data to these connection request resources, read a part of the data and pass it out, but you can calculate that the overall efficiency does not improve in the overall long connection transmission process. It’s just that the relative relative and the amount of memory spent is quantified, and that’s the beauty of technology, maybe not too many algorithms, but you have to understand it.

There are many similar data processing, some time will on efficiency, such as in the file splitting and merging process of HBase, otherwise affect online business is a difficult thing, a lot of problems is worth us to study the scenario, because there are different ways to solve different scene, but the same, understand the thought and method, Knowing the memory and architecture, knowing that you’re dealing with a Shenyang scenario, just changing the details can be amazing.

Some tips from senior Java developers on solving big data problems

Related Posts

GitHub Hot Topics Vol.13: Nearly 40K STAR Computer Paper Project Dominates GitHub Trending again

Website optimization storm

Ultra comprehensive UI workflow Guide part III: Design Specification