How many ways does Java copy files? Which is the most efficient?

1. Use the java. IO class library to directly build a FileInputStream for the source file to read, and then build a FileOutputStream for the target file to complete writing.

public static void copyFileByStream(File source, File dest) throws
        IOException {
    try (InputStream is = new FileInputStream(source);
         OutputStream os = newFileOutputStream(dest);) { byte[] buffer =new byte[1024];
        int length;
        while ((length = is.read(buffer)) > 0) {
            os.write(buffer, 0, length); }}}Copy the code

2. Use the transferTo or transferFrom method provided by java.nio class library


public static void copyFileByChannel(File source, File dest) throws
        IOException {
    try (FileChannel sourceChannel = new FileInputStream(source)
            .getChannel();
         FileChannel targetChannel = newFileOutputStream(dest).getChannel ();) {for(long count = sourceChannel.size() ; count>0;) { long transferred = sourceChannel.transferTo( sourceChannel.position(), count, targetChannel); sourceChannel.position(sourceChannel.position() + transferred); count -= transferred; }}}Copy the code

The Java standard library itself already provides a centralized file.copy implementation. For copy efficiency, this is actually tied to the operating system configuration. Generally speaking, the NIO transferTo/From approach is faster in most cases because it can take advantage of the underlying mechanisms of the operating system. Avoid unnecessary copying and context switching.

Copy of analysis

From the practical point of view, NIO Transfer scheme is not 100% the fastest, it may not be true, from the technical situation, there are several aspects worth noting

  1. What are the differences between different copy modes and the underlying mechanisms?
  2. Why might zero-copy have a performance advantage?
  3. BUffer classification and use.
  4. Impact of Direct Buffer on garbage collection

1. Copy implementation mechanism analysis

Different copy methods, what are the obvious differences in nature. Firstly, there are User Space and Kernel Space in operating system, which are the basic concepts of operating system. Operating system Kernel and hardware drivers run in kernel-space and have relatively high privileges. User-mode space, on the other hand, is reserved for ordinary users and services.

Two kinds of state CPU The program write
User mode CPU access resources are limited Program reliability, low security requirements Simple programming
Kernel mode The CPU can access any resource Program reliability, high security requirements High writing and maintenance costs

When we use the I/O stream for reading and writing, there are actually several context switches. For example, the application reads data from disk to kernel cache in kernel mode, and then switches to user mode to read data from kernel cache to user cache. The principle of writing is the same, but the steps are reversed

However, NIO transferto-based implementation will use zero-copy technology on Linux and Unix. Data transfer does not require user-mode participation, which eliminates the overhead of context switch and unnecessary memory copy, thus improving application copy performance. Note that transferTo can be used not only for file copying, but also for similar applications, such as reading disk files and sending them over sockets, to enjoy the performance and scalability improvements provided by this mechanism.

The transfer process of transferTo is as follows:

Java IO/NIO source code structure analysis

Java Standard library (file.copy), so how is it implemented?

public static Path copy(Path source, Path target, CopyOption... options) throws IOException
Copy the code
public static long copy(InputStream in, Path target, CopyOption... options)  throws IOException
Copy the code
public static long copy(Path source, OutputStream out) throws IOException
Copy the code

Copy does more than just support file-to-file operations. No one dictates that the input and output streams must be specific to files. These are two useful tools.

The latter two replication implementations use inputStream.transferto (). The internal implementation is actually reading and writing the stream in user mode. For analysis of the first method, refer to the following code:

public static Path copy(Path source, Path target, CopyOption... options) throws IOException { FileSystemProvider provider = provider(source); if (provider(target) == provider) { // same provider provider.copy(source, target, options); / / this is this article analysis the path} else {/ / marketers will CopyMoveHelper. CopyToForeignTarget (source, target, options); } return target; }Copy the code

JDK source code, internal implementation and public API definition can not be easily linked, NIO parts of the code is even defined as templates rather than Java source files, in the build process automatically generated source code, by the way, part of the JDK code mechanism and how to bypass hidden obstacles.

  1. First, direct tracking, found FileSystemProvider is just an abstract class, read its source code can understand, the original file system actual logic exists in JDK internal implementation, public API is actually through the ServiceLoader mechanism load a series of file system implementation, Then provide the service.

  2. JDK source search FileSystemProvider and NIO, you can locate sun/ NIO/FS, NIO is closely related to the underlying operating system, so each platform has its own part of the unique file system logic.

  1. After omitting some details, you can finally locate UnixFileSystemProvider → Unixcopyfile. Transfer step by step, and find that this is a local method.

  2. Finally, clear location to Unixcopyfile.c, its internal implementation is clear that it is just a simple user space copy!

So, understand that the most common copy method is not actually using transferTo, but user-mode copy implemented by native technology.

In practice, there are general principles for improving the performance of I/O operations such as copy operations:

  1. In the program, the use of caching and other mechanisms, reasonable reduction of IO times (in network communication, such as TCP transmission, window size can also be seen as a similar idea).

  2. Use mechanisms such as transferTo to reduce context switching and additional IO operations.

  3. Minimize unnecessary conversion processes, such as codec; Object serialization and deserialization, such as manipulating text files or network communications, can consider transferring binary information directly rather than converting it to a string if text information is not needed in the process.

To master the NIO Buffer

Buffer is NIO’s basic tool for manipulating data. Java provides the corresponding Buffer implementation for every primitive data type (except Boolean), so it is necessary to master and use Buffer, especially when using Direct Buffer. Because of its particularity in garbage collection, we should focus on mastering it.

Buffer has several basic properties:
  • Capacity, which reflects the size of the Buffer, i.e. the length of the array.

  • Position, the starting position of the data to operate on.

  • Limit, equivalent to the operation limit. The meaning of limit is obviously different when reading or writing. For example, when a read operation is performed, it is likely that the limit is set to the upper limit of the data it can hold; When writing, the capacity or writable limit below the capacity is set.

  • Mark, record the location of the last postion. The default is 0, which is a convenience consideration, but often not necessary.

The basic operations of Buffer:
  • Capacity is the size of the buffer and position is 0. Limit is the default size of Capacity.

  • Position rises as we write a few bytes of data, but it cannot exceed the limit size.

  • If we want to read the data we wrote earlier, we need to call the flip method and set position to 0 and limit to the previous position.

  • If you want to read from the beginning again, you can call rewind and leave limit unchanged and position set to 0 again.

Direct Buffer and garbage collection

  • Direct Buffer: If you look at the Buffer method definition, you will see that it defines isDirect(), which returns whether the current Buffer is of Direct type. This is because Java provides both in-heap and out-of-heap Direct buffers, which can be created directly with its allocate or allocateDirect methods.

  • MappedByteBuffer: It maps files directly to a memory area of a specified size. When programs access this memory area, they will directly manipulate the file data, eliminating the loss of data transfer from kernel space to user space. We can use Filechannel. map to create an MappedByteBuffer, which is essentially a Direct Buffer.

In practice, Java will try to do only local I/O operations on the Direct Buffer. For many IO intensive operations with large amounts of data, this may bring significant performance advantages because:

  • The Direct Buffer does not change its memory address during its lifetime, and the kernel can access it safely. Many IO operations can be very efficient.

  • The possible additional maintenance of the object store in the heap is reduced, so access efficiency may be improved.

Note, however, that Direct buffers are more expensive to create and destroy than normal in-heap buffers, so they are generally recommended for long-term use and large data scenarios.

With Direct Buffer, we need to be aware of its impact on memory and JVM parameters. First, since it is not on the heap, parameters such as Xmx do not affect the amount of memory used by out-of-heap members such as the Direct Buffer. We can use the following parameters to set the size: -xx :MaxDirectMemorySize= 512MB

From a parameter setting and memory troubleshooting point of view, this means that we can calculate the amount of memory Java can use without considering only the needs of the heap, as well as a number of off-heap factors such as the Direct Buffer. Out-of-heap footprint is also a possibility if there is insufficient memory.

In addition, most garbage collection processes do not actively collect the Direct Buffer. Its garbage collection process is based on Cleaner(an internal implementation will be described later) and PhantomReference mechanism, which itself is not a public type. Internally, a Deallocator is responsible for destruction logic. Its destruction is often delayed until full GC, so improper use can easily lead to OutofMemoryErrors.

Advice:

  1. In your application, you explicitly call System.gc() to force triggering.

  2. Alternatively, in some frameworks that use a lot of Direct Buffer, the framework itself calls the release method in the program, which is what Netty does. See PlatformDependent0 for more information.

  3. Reuse Direct Buffer.

Trace and diagnose Direct Buffer memory usage

Direct Buffer memory diagnostics can also be a headache because the usual records, such as garbage collection logs, do not contain information such as Direct Buffer. Fortunately, after the JDK 8 version, you can use the Native Memory Tracking (NMT) feature for diagnosis, can add the following parameters at program startup – XX: NativeMemoryTracking = {the summary | detail}

However, MD, NMT activation typically results in a 5% to 10% performance drop in the JVM, which is pretty intoxicating!

expand

At run time, the following command can be used for interactive comparison:

JCMD <pid> vm. native_memory detail JCMD <pid> vm. native_memory baseline // Compares the baseline memory allocation JCMD <pid> vm. native_memory detail.diffCopy the code