Hello, I’m Daming.
Personal website: www.topjava.cn/
Copy-on-write, or copy-on-write, is a concept that I learned about when I was learning about Redis persistence. I’ve seen it a long time ago (Java container concurrency), but I’ve never really looked at it in depth, so I’ll take this opportunity to explore it. So write an article about it.
COW (copy-on-write for short) is an optimization strategy in the field of computer design. Its core ideas are as follows: If multiple callers request the same resource (such as memory or data store on disk) at the same time, they will get the same pointer to the same resource, until one of the callers tries to modify the contents of the resource, the system will actually make a private copy to the caller. The original resource, as seen by other callers, remains the same. This process is transparently transparent to all other callers. The main advantage of this approach is that if the caller does not modify the resource, no private copy is created, so multiple callers can share the same resource when they are just reading (from Wikipedia).
In Linux copy – on – write
To understand Linux COW, it is necessary to know two functions: fork() and exec(), where exec() is a collection of functions including execl(), execlp(), execv(), execle(), execve(), and execvp().
fork()
What is fork()? It is the only way in the UNIX operating system to spawn a new process. It is used to create a child process that is equivalent to a copy of its parent process. They have the same physical space (memory area).
The fork() function is called once and returns two times. It is called from the parent process and returns two values. One is returned to the parent process, which returns the process ID number of the new child, and the other is returned to the child process, which returns 0. So we can basically determine whether the current process is a child or a parent based on the return value.
Since any child has only one parent, we can get the parent’s process ID by calling getppid, and the parent can have multiple children, so fork() returns the child’s process ID so that it can identify its child.
exec()
Fork () creates a child process that is a copy of the parent process. It doesn’t make sense to just fork a copy of the parent process. We want the child process to do something, something different from the parent process, and that’s where exec() comes in. Its purpose is to load a new program that overrides the image in the memory space of the current process to perform a different task.
For example, if the parent prints hello World, the child will also print Hello World. However, when the child executes exec(), it does not necessarily print Hello World, it might execute 1 + 1 = 2. The diagram below:
The following articles are recommended for fork() and exec() :
- The fork and exec functions: blog.csdn.net/bad_good_ma…
- Fork () : fork (); Instance) : blog.csdn.net/jason314/ar…
- Linux c fork () and the exec function description and usage: blog.csdn.net/nvd11/artic…
- Linux OS fork: blog.csdn.net/sinat_35925…
- Linux system in the process of programming (5) : the exec function series (execl, execlp execle, execv, execvp) using: www.cnblogs.com/mickole/p/3…
Fork produces is identical to that of the parent of the child process, the practice of using traditional, will directly to the parent process data is copied to the child, the child process to create, after the completion of data segments between parent and child process and stack is complete independence, according to our practice, the child process generally perform is different from the parent process functions, After exec(), the original data will be emptied, so that the previous copy process will become invalid, which is a very wasteful process, since the traditional copy method is invalid for a lot of time, so there is copy-on-write technology, the principle is very simple:
The children of the fork share the memory space with the parent process. If the child process does not modify the memory space, the data in the memory space will not be copied to the child process. As a result, the creation of the child process is very fast (instead of copying, the child process directly references the physical space of the parent process).
After fork, there is no waste of space for the child process to execute exec().
As follows:
Another detail is that after fork, the kernel will put the child at the front of the queue, so that the child process will execute first, so that the parent process will not execute the copy on write, and then the child process will execute the exec system call, resulting in inefficient replication.
Copy On Write technology implementation principle:
After fork(), the kernel sets all memory pages in the parent to read-only, and the address space of the child processes points to the parent. When both parent and child processes read only memory, nothing happens. When one of these processes writes to memory, the CPU hardware detects that the pages of memory are read-only and triggers a page-fault, one of the kernel’s interrupt routines. In an interrupt routine, the kernel makes a copy of the page that triggered the exception, so the parent and child processes have separate copies.
Copy of Redis – on – write
We know that Redis is single-threaded, and the data of Redis cannot be stored in memory all the time, so it must be flushed to the hard disk periodically. This process is the persistence process of Redis, so how to implement the persistence process of Redis as a single thread while responding to the client commands? The answer is to rely on COW, which is to rely on the COW implementation of the fork function of the system.
There are two types of Redis persistence: RDB snapshots and AOF logs.
An RDB snapshot is a snapshot of all the data in Redis memory at a given time. When RDB persistence is executed, the Redis process forks a child process to execute the persistence process. The process is blocked, and the parent process continues to receive commands from the client when the fork process is complete. The child process and the Redis process share the data in memory, but the child process does not modify the data in memory, but constantly reads and writes to the file, but the Redis parent process is different, it needs to respond to the client command to constantly modify the data in memory. At this time, COW mechanism of the operating system will be used to separate data segments and pages. When the Redis parent process modifies the data of a page, it will copy the data of the page and then modify the copied page. At this time, the corresponding data page of the child process has not changed. It’s still the data at the fork.
The AOF log is to write every write command received to the log file to ensure that data is not lost. The problem with this is that log files get bigger over time, so Redis provides a rewrite process (bgrewriteaof) to compress log files. The override also calls the fork() function to produce a child process to compress the file.
For more information on Redis persistence, see this article: [Redis] —- Redis persistence
Copy – on – write in Java
If you are familiar with Java concurrency, you must know that Java also has two containers that use copy-on-write mechanism. They are CopyOnWriteArrayList and CopyOnWriteArraySet, which are quite useful in our concurrent use scenarios. Now let’s use CopyOnWriteArrayList to analyze Java copy-on-write.
CopyOnWriteArrayList to implement the List interface, the bottom implementation is to use array to achieve. Internally, it holds a private array for each element.
private transient volatile Object[] array;
Copy the code
This array is not accessible directly, only getArray() and setArray() are allowed.
final Object[] getArray() {
return array;
}
final void setArray(Object[] a) {
array = a;
}
Copy the code
For example, if you want to add a new array to the array, you need to add a new array to the array. If you want to add a new array to the array, you need to add a new array to the array. For example, if you want to add an array to the array, you need to add a new array to the array.
public boolean add(E e) {
final ReentrantLock lock = this.lock;
lock.lock();
try {
// Get the old array
Object[] elements = getArray();
int len = elements.length;
// Copy the new array
Object[] newElements = Arrays.copyOf(elements, len + 1);
// Add elements to the new array
newElements[len] = e;
// Reference the old array to the new array
setArray(newElements);
return true;
} finally{ lock.unlock(); }}Copy the code
It is added with a lock. If not, multiple copies may appear when multiple threads are written.
Read operations are as follows:
public E get(int index) {
return get(getArray(), index);
}
private E get(Object[] a, int index) {
return (E) a[index];
}
Copy the code
If the read operation is not locked, dirty data may occur.
So the COW container in Java works like this:
When we modify an element in a container, we do not directly operate the container, but copy the current container to copy a new container, and then operate on the new container. After the operation is complete, reference the original container to the new easily, and read the old container directly.
It is also a lazy principle, and also a bit of a read-write separation (read and write from different containers).
These two containers are good for scenarios where you have to acquire locks and copy arrays every time you write, and performance is a big issue.
For more information On Java COW, see this article: Talking about copy-on-write Containers in Concurrent Java
The resources
- COW COW! Learn about the Copy On Write mechanism