Another autumn recruitment season, oh mama ah I was abused miserably ~ this is not, the last few missing did not update the blog, actually I secretly used the time to review the textbook (fog

I insist on sharing blog in the community for a long time, because there are many omissions in the past articles, many big guys are pointing out my fault in the comments, I am very happy and very disappointed, happy is that everyone helps me to point out mistakes, disappointed humble knowledge can not always do perfect. Anyway, welcome all kinds of pr~ in the comments section

All right, back to business. Review, inadvertently see related knowledge of the Java virtual machine, I had a very strong interest, I today to combine the knowledge of computer memory model, and the Java memory model, a Java object model, the JVM memory structure and relevant knowledge together, this article, a total of 1.5 W words, share with everyone, thank you for reading.

Want to unlock more new poses? Please visit my blog https://blog.tengshe789.tech/ (😘

Computer memory

Believe that everyone has a computer, also have diy computer experience. Now a powerful DIY computer can be assembled about 3K, an I5-8400 CPU 869 yuan, DDR4 memory 1200 yuan, B360 motherboard 300 yuan radiator 50 yuan, mechanical hard disk 200 yuan 350W power supply 300 yuan chassis 100 yuan, yes, You only need 3k to get a powerful 6C6T.

What is the most important component of a PC? If you look at the price, it’s CPU and memory, so let me tell you a little bit about the relationship between CPU and memory.

CPU and memory cache

Terms related to CPU

First, some CPU terminology:

  • Socket: The socket in which the CPU is inserted into the mainboard is called a socket.
  • Die: The core (Die), also called the core, is one of the physical components of the CPU. Cpus can also be divided into multiple die cpus and single die cpus. For example, our current powerful AMD TR-2990WX is a 4die CPU with 8 cores per die.
  • Core: Physical core. The term core was coined by Intel, initially to differentiate itself from its rival AMD, but later became more widely used and diluted.
  • Thread: indicates the number of hardware threads. A program may need to be performed with multiple threads execute ~ now is also a powerful hyper-threading technology, a CPU core CPU in the past often only support a single thread, now some powerful CPU, such as IBM POWER 9, 8 core support 32 threads a core four threads (on average), the theoretical performance is very strong.

To sum up, the star CPU, AMD TR-2990WX, uses a socket with 4 dies in it, for a total of 32 physical cores and 64 threads

CPU cache

We all know that the DATA to be processed by the CPU will be stored in memory, but why is this so, the memory cache hard disk is not ok?

Of course the answer is no. CPU processing speed is very strong, memory speed is very fast but can not keep pace with the CPU, so, the emergence of cache. Unlike memory from the DRAM family, cache SRAM and memory are characterized by high speed, small capacity, complex structure and high cost.

The reasons for the difference in memory and cache performance are as follows:

  1. DRAM only needs a capacitor and a transistor to store one bit of data, whereas SRAM requires six transistors. Since DRAM stores data in the capacitor, the capacitor needs to be charged and discharged to carry out read and write operations, which leads to a relatively large delay in reading and writing data.
  2. Storage can be misread as a two-dimensional array, each storage cell has its row address column address. The capacity of SRAM is very small, and its storage unit is relatively short (row and column are short), which can be transferred to SRAM at one time. DRAM, on the other hand, needs to transmit row and column addresses separately.
  3. SRAM frequency is close to CPU frequency. And DRAM frequency and CPU gap is relatively large.

Modern caches are usually integrated into the CPU. In order to meet the needs of performance and cost, the real caches often use pyramid multi-level cache architecture. That is, when the CPU wants to read a piece of data, it first looks for it in the level-1 cache, if it doesn’t find it it looks for it in the level-2 cache, if it still doesn’t find it it looks for it in the level-3 cache or in memory.

Here is the original Skylake architecture that Intel has used recently

As you can see, each core has its own L1 and L2 cache, and they share an L3 cache. If the CPU wants to access data in memory, it must pass through L1,L2,L3,LLC (or L4).

Cache consistency issues

In the beginning, the CPU was actually only one thread per core, so there was no need to worry about cache consistency. Single thread, that is, the CPU core cache was only accessed by one thread. Cache exclusivity, no access conflicts and other issues.

Later hyper-threading technology came to our field of vision, ‘CPU multi-thread’, is the process of multiple threads can access Shared data in process at the same time, the CPU will be loaded into a piece of memory cache, different threads when they visit the same physical address, will be mapped to the same cache location, so even if the thread switching, the cache still will not fail. But because only one thread can be executing at any time, cache access conflicts do not occur.

As time goes on, ** “multi-core CPU multi-threading” ** is introduced, where multiple threads access a shared memory in a process and execute on different cores, so each core keeps a buffer of shared memory in its own CAEhe. Since multiple cores can be parallel, it is possible for multiple threads to write to their caches at the same time, and the data in their caches may be different.

This is what we call cache consistency.

Currently recognized as the best solution is Intel’s MESI protocol, which we highlight below.

The msci agreement

Let’s start with the unit of I/O operations. Most people know that I/O operations in memory are not measured in bytes, but in blocks. Why is this?

In fact, the data access of I/O operations has the feature of spatial continuity, that is, a large amount of data needs to be accessed in the memory space. However, I/O operations are slow. It takes the same time to read one byte as to read N bytes.

In x86 cpus, each cache line holds 64 bytes. Each level of the cache is divided into several groups of cache lines.

See 👉 Wikipedia for how caching works

Now let’s look at the MESI specification, which is actually named after four cache line states. We define each cache line in the CPU to be marked with four states (represented by an extra two bits) :

  • M: Modified

    The cache line is only cached in that CPU’s cache and is dirty, that is, inconsistent with the data in main memory. The memory in this cache line needs to be written back to main memory at some future point before it can be read by other cpus. When written back to main memory, the state of the cache row becomes exclusive.

  • E: Exclusive

    The cache line is only cached in the CPU’s cache, it is clean and consistent with the data in main memory. This state can become shared at any time when the memory is being read by another CPU. Similarly, when the CPU modifies the contents of the cache row, the state can become Modified.

  • S: Shared

    This state means that the cache row may be cached by multiple cpus, and the data in each cache is consistent with the main memory data (clean). When one CPU modifies the cache row, the cache row can be invalidated in other cpus.

  • I: Invalid

    The cache is invalid (other cpus may have modified the cache line).

However, just having these four states can cause problems. Here is a reference to the Oracle documentation.

Updating a single element in the same cached line of code from different processors simultaneously invalidates the entire cached line, even if the updates are logically independent of each other. Each time a single element of a cached line of code is updated, this line is marked as invalid. Other processors that access different elements in the same line of code will see that line marked as invalid. Even if the accessed elements have not been modified, they are forced to get a newer copy of the line of code from memory or elsewhere. This is because it is cache-consistent based on lines of cached code, rather than on individual elements. As a result, there will be increases in interconnect communication and overhead. Also, access to elements in a cached line of code is prohibited while the line is being updated.

MESI protocol can guarantee cache consistency, but not real-time. This situation is called pseudo-sharing.

Pseudo sharing problem

Pseudo sharing is a real problem in Java. Suppose you have a Java class like this

class MyObiect{
    long a;
    long b;
    long c;
}
Copy the code

According to the Java specification, MyObiect objects are allocated in the heap space, and the three variables A, B, and C are close neighbors in the memory space, accounting for 8 bytes each and adding up to 24 bytes. While our x86 cache line is 64 bytes, it’s entirely possible that these three variables could be in one cache line and shared by two different CPU cores!

According to the MESI protocol, if thread 1 and thread 2 in different physical core cpus want to operate on these variables mutually exclusive, it is very likely to seize resources from each other, resulting in the original parallelism into serial, greatly reducing the concurrency of the system, which is the pseudo-sharing of cache.

Resolving pseudo sharing

It’s easy to solve the problem by placing these variables on different cache lines. In java8, a universal solution was provided to use the @contended annotation to ensure that variables or attributes in an object are not in a cached line ~

@Contended
class VolatileObiect{
    volatile long a = 1L;
    volatile long b = 2L;
    volatile long c = 3L;
}
Copy the code

Memory inconsistency problem

I mentioned that the MESI protocol addresses cache consistency in multi-core cpus. Now let’s talk about CPU memory inconsistency.

Three CPU architectures

First, there are three nouns to understand:

  • SMP(Symmetric Multi-Processor)

SMP, symmetric multiprocessing system has many tightly coupled multiprocessors. In such a system, all cpus share all resources, such as bus, memory and I/O system, etc., and there is only one copy of the operating system or management database. One of the biggest characteristics of this system is sharing all resources. There is no distinction between multiple cpus, equal access to memory, peripherals, and one operating system. The operating system manages a queue, and each processor processes the processes in the queue in turn. If two processors simultaneously request access to a resource (for example, the same memory address), the hardware and software locking mechanism resolves the resource contention problem.

[

The so-called symmetric multi-processor architecture refers to that multiple cpus in a server work symmetrically without primary, secondary or subordinate relationships. All cpus share the same physical Memory, and it takes the same time for each CPU to Access any address in the Memory. Therefore, SMP is also called Uniform Memory Access (UMA). SMP servers can be expanded by adding memory, using faster cpus, adding more cpus, expanding I/O(number of slots and buses), and adding more external devices (usually disk storage).

The main characteristic of an SMP server is sharing. All resources in the system (CPU, memory, I/O, and so on) are shared. It is this feature that leads to the main problem with the SMP server, which is that it has very limited scalability. For SMP servers, each shared segment can cause bottlenecks for SMP servers to scale, and the most limited is memory. Because each CPU must access the same memory resources through the same memory bus, memory access conflicts increase rapidly as the number of cpus increases, resulting in CPU resource waste and greatly reducing the effectiveness of CPU performance. Experiments have shown that the best CPU utilization of SMP servers is between 2 and 4 cpus.

[

  • NUMA(Non-Uniform Memory Access)

NUMA is one of the results of efforts to develop techniques for effectively scaling up large systems due to SMP’s limited ability to scale. Using NUMA technology, it is possible to combine dozens (or even hundreds) of cpus into a single server. The NUMA server CPU module structure is shown in the figure below:

A NUMA server is characterized by multiple CPU modules. Each CPU module consists of four cpus and has independent local memory and I/O slots. Each CPU has access to the entire system’s memory (an important difference between NUMA systems and MPP systems) because its nodes can connect and exchange information through interconnected modules, such as Crossbar Switches. Obviously, access to local memory will be much faster than access to remote memory (memory of other nodes in the system), which is where nonuniform storage access NUMA comes in. Because of this feature, applications need to be developed with minimal information interaction between different CPU modules in order to maximize system performance.

NUMA technology can be used to solve the expansion problem of the original SMP system, which can support hundreds of cpus in a physical server. Typical examples of NUMA servers include HP’s Superdome, SUN15K, IBM 690, etc.

However, NUMA technology also has some drawbacks. Because the latency of accessing remote memory is much longer than that of local memory, system performance cannot increase linearly as the number of cpus increases. For example, when HP released the Superdome, they published the relative performance of the Superdome compared to other HP UNIX servers, and found that the Superdome (NUMA architecture) of 64 cpus had a relative performance of 20. The 8-way N4000(shared SMP architecture) has a relative performance value of 6.3. As you can see from this result, 8 times the number of cpus is only a 3 times the performance improvement.

  • MPP(Massive Parallel Processing)

Different from NUMA, MPP provides another way to extend the system. It consists of multiple SMP servers connected through a network of nodes to work together to accomplish the same tasks. From a user’s perspective, MPP is a server system. Its basic feature is that multiple SMP servers (each SMP server is called a node) are connected through the network of nodes. Each node only accesses its own local resources (memory, storage, etc.). It is a completely shared Nothing structure, so it has the best scalability. Theoretically, there is no limit to its scale, with current technology capable of connecting 512 nodes and thousands of cpus. At present, there is no standard for node interconnection network in the industry, such as Bynet of NCR and SPSwitch of IBM, which have adopted different internal implementation mechanisms. However, the node Internet is only for internal use of the MPP server and is transparent to users.

In an MPP system, each SMP node can also run its own operating system, database, and so on. Unlike NUMA, however, there is no problem with remote memory access. In other words, the CPU in each node cannot access the memory of the other node. Nodes exchange information to each other through networks of nodes in a process known as Data Redistribution.

But THE MPP server requires a sophisticated mechanism to schedule and balance the load and parallel processing of individual nodes. Current SERVERS based on MPP technology tend to mask this complexity with system-level software, such as databases. NCR’s Teradata, for example, is a relational database software based on MPP technology that allows developers to build applications using the same database system regardless of how many nodes are in the backend server, without having to worry about scheduling the load on any one of them.

Massively Parallel Processing (MPP) is a Massively Parallel Processing system. These systems are composed of a number of loosely coupled Processing units. Note that this is a Processing unit, not a processor. The CPU in each cell has its own private resources, such as bus, memory, hard disk, etc. Within each cell there is an instance copy of the operating system and management database. The biggest feature of this structure is that it does not share resources.

Cache consistency under NUMA structure

The MESI protocol addresses the consistency of the cache under the traditional SMP architecture. In order to achieve the consistency of the cache under the NUMA architecture, Intel introduced an extension of MESI protocol, MESIF, but there is no information at present, nor can it be studied. Please consult the Intel Wiki for more information.

Java memory model

The cause of

When we write programs, why do we care about the memory model? Well, as we said earlier, the cache consistency problem, the memory consistency problem is the result of hardware upgrades. The simplest and most straightforward way to solve the problem is to disable CPU caching and let the CPU interact directly with main memory. However, doing so can guarantee concurrency problems in multiple threads. But that is a step back in time.

Therefore, in order to ensure that concurrent programming can meet the atomicity, visibility and order. There is an important concept, and that is the memory model.

That is, in order to ensure the correctness of shared memory (visibility, order, atomicity), the memory model is needed to define the corresponding specifications of read and write operations of multithreaded programs in the shared memory system ~

JMM

The Java Memory Model is translated from the English Java Memory Model (JMM). The JMM is not as real as the JVM memory structure. It is a mechanism and specification conforming to the memory model specification, shielding the access difference of various hardware and operating system, ensuring that the Java program can guarantee the same effect on the memory access under various platforms. As described in JSR-133: Java Memory Model and Thread Specification, the JMM is related to multithreading. It describes a set of rules or specifications that define a shared variable written by one Thread to be visible to another Thread.

Then, under the simple summary, Java is through the Shared memory to communicate between threads, and communicate with Shared memory, in the process of communication will exist a series of such as visibility, atomicity, sequentiality and other issues, the JMM is around the multi-thread communications with a series of related features of models. The JMM defines syntactic sets that map to the Java language as volatile, synchronized, and other keywords.

In JMM, we refer to the shared memory where multiple threads communicate as main memory, whereas in concurrent programming multiple threads maintain their own local memory (an abstract concept), where the data stored is a copy of the data in main memory. The JMM mainly controls the data interaction between local and main memory.

The JMM is a very important concept in Java, and it is thanks to the JMM that concurrent programming in Java avoids many problems.

The JMM application

Those of you familiar with Java multithreading know that Java provides a number of keywords related to concurrent processing, such as volatile, synchronized, final, concurrent packages, and so on. These are just some of the keywords that the Java memory model provides us with when it encapsulates the underlying implementation.

When developing multithreaded code, we can directly use keywords like synchronized to control concurrency and never need to worry about underlying compiler optimizations, cache consistency, and so on. Therefore, the Java memory model, in addition to defining a set of specifications, provides a set of primitives that encapsulate the underlying implementation for developers to use directly.

Concurrency programming addresses the issues of atomicity, order, and visibility, so let’s take a look at what is guaranteed in Java.

atomic

Atomicity means that the CPU cannot pause and then schedule an operation without interrupting, completing, or not executing it at all.

The JMM provides atomicity guarantees for accessing basic data types (there are two main steps in writing a working memory variable to main memory: Store and write), but actual business processing scenarios often require atomicity guarantees on a much larger scale.

In Java, to ensure atomicity, two advanced bytecode instructions, Monitorenter and Monitorexit, are provided, and the Java keyword for these bytecode instructions is synchronized.

Therefore, synchronized can be used in Java to ensure that operations within methods and code blocks are atomic. Here is a recommended article for an in-depth understanding of the Synchronized implementation of Java concurrency.

visibility

Visibility means that when multiple threads access the same variable and one thread changes the value of the variable, other threads can immediately see the changed value.

The Java memory model relies on main memory as a transfer medium by synchronizing the new value back to main memory after a variable is modified and flushing the value from main memory before the variable is read.

The volatile keyword in Java provides the ability to synchronize modified variables to main memory immediately after they are modified, and to flush variables from main memory each time they are used. Therefore, volatile can be used to ensure visibility of variables in multithreaded operations.

In addition to volatile, the Java keywords synchronized and final and static are also visible. Here are my reading notes:

order

Orderliness means that the order in which a program is executed is the order in which the code is executed.

In Java, synchronized and volatile can be used to ensure order between multiple threads. Implementation methods are different:

The volatile keyword disallows instruction reordering. The synchronized keyword ensures that only one thread can operate at a time.

Ok, so this is a brief introduction to the keywords that can be used to solve atomicity, visibility, and order in Java concurrent programming. As readers may have noticed, the synchronized keyword seems to be all-purpose, satisfying all three of these attributes at once, which is why so many people abuse synchronized.

Synchronized, however, is a performance deterrent, and while the compiler provides many lock optimization techniques, overuse is not recommended.

JVM

As we all know, Java code is meant to run on a virtual machine, and as the virtual machine executes Java programs, the memory it manages is divided into different data areas, each of which has its own purpose. Let’s talk about the JVM runtime memory region structure

JVM runtime memory area structure

The JVM runtime memory region structure is described in the Java Virtual Machine Specification (Java SE 8) as follows:

1. Program counter

Program Counter Register (Program Counter Register), also known as PC Register. In assembly language, a program counter is a register in the CPU. It stores the address of the current instruction executed by the program (or the address of the storage unit where the next instruction is located). When the CPU needs to execute an instruction, It is necessary to get the address of the storage unit where the current instruction is to be executed from the program counter, and then obtain the instruction according to the obtained address. After getting the instruction, the program counter will automatically add 1 or get the address of the next instruction according to the transfer pointer, and so on until all the instructions are executed.

Although a program counter in the JVM is not a physical CPU register, as it is in assembly language, the function of a program counter in the JVM is logically equivalent to that of an assembly language program counter, that is, to indicate which instruction to execute.

Since multithreading in the JVM is achieved by switching between threads to obtain CPU execution time, the kernel of a CPU will execute instructions in only one thread at any given time. Therefore, in order to ensure that each thread can return to its original execution position after the switch, Each thread needs to have its own independent program counter, and can not interfere with each other, otherwise it will affect the normal execution order of the program. Therefore, it is safe to say that program counters are private to each thread.

According to the JVM specification, if a thread is executing a non-native method, the program counter stores the address of the instruction that is currently being executed. If the thread executes a native method, the value in the program counter is undefined.

Because the amount of space stored in the program counter does not change with the execution of the program, there is no OutOfMemory (OutOfMemory) for program counters.

2. The Java stack

The Java Stack, also known as the Java Vitual Machine Stack, is often referred to as the Stack, similar to the Stack in the DATA segment of THE C language. In fact, the Java stack is an in-memory model for the execution of Java methods. Why do you say that? Here’s why.

The Java stack holds stack frames, and each stack frame corresponds to a method being called, Stack frames include Local Variables, the Operand Stack, and references to runtime constant pools of the class to which the current method belongs (the concept of runtime constant pools is covered in the methods section) Pool, method Return Address, and some additional information. When a thread executes a method, it creates a corresponding stack frame and pushes the stack frame. When the method completes execution, the stack frame is pushed off the stack. Therefore, the stack frame corresponding to the method currently executed by the thread must be at the top of the Java stack. At this point, you should be able to understand why when using recursive method is easy to cause the phenomenon of memory and why the stack area space without the programmer to management (of course in Java, programmers need not related to basic memory allocation and release, because the Java garbage collection has its own), This part of the space allocation and release are automatically implemented by the system. For all programming languages, this part of the stack space is opaque to the programmer. The following diagram shows a Java stack model:

The local variable scale, as the name suggests, doesn’t have to be explained and you can see what it does. Is used to store local variables in a method (including non-static variables declared in the method and function parameters). For variables of primitive data types, its value is stored directly, and for variables of reference types, a reference to an object is stored. The size of a local variable table can be determined by the compiler, so the size of a local variable table does not change during program execution.

Operand stack, if you are familiar with the expression evaluation problem, one of the most typical applications of the stack is to evaluate an expression. When you think about a thread executing a method, it’s actually executing statements over and over again, and it’s really just doing the calculation. So it can be said that all the calculations in the program are done with the help of the operand stack.

A reference to a run-time constant pool. Because constants from a class may be needed during method execution, a reference to a run-time constant is required.

Method return address. When a method is finished, it returns to where it was called before, so a method return address must be stored in the stack frame.

Because each thread may be executing a different method, each thread has its own Java stack that does not interfere with each other.

3. Local method stack

The native method stack works and works very similar to the Java stack. The difference is simply that the Java stack serves the execution of Java methods, while the Native Method stack serves the execution of Native methods. The specific implementation methods and data structures developed here are not mandated in the JVM specification, and virtual machines are free to implement them. The HotSopt virtual machine directly combines the native method stack with the Java stack.

4. The heap

In C, the heap is the only area of memory that programmers can manage. The malloc and free functions allow the programmer to claim and free space on the heap. So what does it look like in Java?

The Java heap is used to store the objects themselves as well as arrays (array references, of course, are stored in the Java stack). Unlike C, however, in Java, programmers don’t have to worry about space release. Java’s garbage collection mechanism takes care of it automatically. Therefore, this space is also the primary area managed by the Java garbage collector. In addition, the heap is shared by all threads, and there is only one heap in the JVM.

5. Methods area

The method area is also a very important area in the JVM, which, like the heap, is shared by threads. In the method area, the information of each class (including the name of the class, method information, field information), static variables, constants, and the compiled code of the compiler are stored.

In addition to the description of a Class’s fields, methods, interfaces, and so on, another piece of information in the Class file is the constant pool, which stores literal and symbolic references generated during compilation.

A very important part of the method area is the runtime constant pool, which is a runtime representation of the constant pool for each class or interface that is created after the class or interface is loaded into the JVM. Of course, only the contents of the Class file constant pool can be entered into the runtime constant pool. New constants can also be added to the runtime constant pool at run time, such as the String intern method.

In the JVM specification, method areas are not mandated to implement garbage collection. Many people tend to refer to the method area as a “permanent generation” because the HotSpot VIRTUAL machine implements the method area in a permanent generation so that the JVM’s garbage collector can manage this area in the same way as the heap area without having to design a garbage collection mechanism for it. However, since JDK7, the Hotspot VIRTUAL machine has removed the runtime constant pool from the persistent generation.

Memory layout of the Java object model

Java is an object-oriented language, and the storage of Java objects in the JVM has a certain structure. This storage model of Java objects themselves is called the Java Object Model.

In HotSpot VIRTUAL machine, an OOP-Klass Model is designed. OOP (Ordinary Object Pointer) refers to a Pointer to a common Object, whereas Klass describes the specific type of Object instance.

For each Java class, when it is loaded by the JVM, the JVM creates an instanceKlass for that class, which is stored in the methods area and used to represent the Java class at the JVM layer. When we create an object using new in Java code, the JVM creates an instanceOopDesc object. The layout of the object stored in memory can be divided into three areas: Header, Instance Data, and Padding.

  1. Object header: token (32-bit VM 4B, 64-bit VM 8B) + type pointer (32-bit VM 4B, 64-bit VM 8B) + [Number length (only required for array objects)]
  2. Instance data: Longs/Doubles, INTS, SHORTS/CHars, Bytes/Boolean, Oops (Ordinary Object Pointers), Fields of the same width are always allocated together for later retrieval. Variables defined by the parent class appear before variables defined by the subclass.
  3. Aligned padding: For 64-bit virtual machines, the object size must be a multiple of 8B, or placeholder padding is required

JVM memory garbage collector

To understand the existing collector, we need to understand some terminology. At its most basic, garbage collection involves identifying memory that is no longer being used and making it reusable. Modern collectors do this in several phases, which we tend to describe as follows:

  • Parallel – While the JVM is running, there are both application threads and garbage collector threads. The parallel phase is performed by multiple GC threads, that is, the GC work is divided between them. There is no reference to whether the GC thread needs to suspend the application thread.
  • The serial-serial phase is performed only on a single GC thread. As before, it does not say whether the GC thread needs to suspend the application thread.
  • In the STW-STW phase, the application thread is paused so that the GC can do its work. When an application is paused because of GC, this is usually due to The Stop The World phase.
  • Concurrency – If a phase is concurrent, the GC thread can run at the same time as the application thread. Concurrent phases are complex because they require processing of application threads before the phase completes, which may invalidate the work.
  • Incremental – If a phase is incremental, it can run for a while and then terminate prematurely due to some condition, such as the need to perform a higher priority GC phase while still completing productive work. The incremental phase is in sharp contrast to the phase that needs to be fully completed.

Serial collector

The most basic collector is the Serial collector, which is a single-threaded collector and is still the default generation collector for the JVM in Client mode. It has one advantage over other collectors: it is simple and efficient (compared to the single-threaded collection of other collectors), and the Serial collector achieves maximum efficiency by focusing solely on garbage collection because it has no overhead of thread interaction. In a user desktop scenario, the memory allocated to the JVM is not very large, and pause times can be anywhere from a few tens to a hundred milliseconds, which is perfectly acceptable as long as collections are infrequent.

ParNew collector

ParNew is the multithreaded version of Serial, which is consistent in the recycling algorithm and object allocation principle. The ParNew collector is the default for many next-generation garbage collectors running in Server mode, mainly because in addition to the Serial collector, only the ParNew collector currently works with the CMS collector.

Parallel avenge

The Parallel Scavenge collector is a new generation garbage collector that uses a replication algorithm as well as a Parallel multithreaded collector.

The Parallel Scavenge collector focuses more on controllable throughput, which is equal to the time the user code is run /(the time the user code is run + garbage collection time). Intuitively, throughput is higher as long as the maximum GC pause time is smaller, but reduced GC pause times come at the expense of throughput and new generation space. For example, instead of collecting every 10 seconds and stopping for 100 milliseconds, the collection now takes 5 seconds and stopping for 70 milliseconds. As pause times go down, throughput goes down as well.

Shorter pauses are better for applications that need to interact with the user; The high throughput can make the most efficient use of CPU time and complete computing tasks as soon as possible, which is mainly suitable for background computing.

Serial Old collector

The Serial Old collector, an older version of the Serial collector, is also a single-threaded collector that uses a “mark-collation algorithm” for collection. It runs the same as the Serial collector.

Parallel Old collector

The Parallel Old collector is an older version of the Parallel Avenge collector that uses multithreading and mark-collation algorithms for garbage collection. Typically used in conjunction with the Parallel Scavenge collector, the “throughput-first” collector is the hallmark of this combination and can be used in applications where throughput and CPU resource sensitivity are important.

CMS collector

CMS (Concurrent Mark Sweep) collector is a collector aimed at obtaining the shortest pause time. The CMS collector adopts the mark-sweep algorithm and runs in the old era. It mainly includes the following steps:

  • Initial tag
  • Concurrent tags
  • To mark
  • Concurrent remove

Initial tagging and re-tagging still require “Stop the world”. Initial tagging only marks objects that GC Root can directly associate with, while concurrent tagging is the GC Root Tracing process. Re-tagging is intended to correct the tagging of those objects that changed as the user program continued to run during concurrent tagging.

In general, the CMS collector collection process is performed concurrently with the user thread, as the collection thread and the user thread work together, the most time-consuming concurrent marking and concurrent cleaning of the process. While CMS has the advantages of concurrent collection, low pauses, and is already largely a good garbage collector, it has three significant disadvantages:

  1. The CMS collector is sensitive to CPU resources. In the concurrent phase, while it does not cause the user thread to pause, it can slow down the application by taking up a portion of the thread (CPU resources).
  2. The CMS collector cannot handle floating garbage. The so-called “floating garbage” is in the concurrent marking stage, because the user program is running, then there will be new garbage, this part of the garbage is marked, the CMS can not deal with it in the second collection, have to deal with it in the next GC, this part of the unprocessed garbage is called “floating garbage”. Because the program also needs to run during the garbage collection phase, that is, it needs to reserve enough memory space for users to use, so the CMS collector can not wait until the old era is almost full, like other collectors, but needs to reserve some space for concurrent collection of program operation. If the MEMORY space set aside by the CMS does not meet the program’s requirements, it is the JVM that starts the preparatory scheme: the Serial Old collector is started temporarily to collect the Old years, and the pauses are long.
  3. Because CMS uses a mark-sweep algorithm, there is a lot of memory fragmentation after collection. When memory fragmentation becomes excessive, it becomes difficult to allocate large objects, which is when Full GC is performed.

G1 collector

The G1 collector has significant improvements over CMS:

· G1 collector adopts mark-collation algorithm.

· Pauses can be controlled very precisely.

G1 collector can be implemented in the basic without sacrificing the throughput under the condition of complete low pause memory recovery, this is because it is trying to avoid the whole area of recycling, G1 collector will Java heap (including new generation) and the old s divided into multiple regions (Region), and maintain a list of priority in the background, every time according to the time allowed, Priority is given to areas that collect the most garbage.

ZGC collector

A new addition to Java 11 is the ZGC Garbage collector, which boasts GC pauses of less than 10ms. ZGC adds two new technologies to Hotspot Garbage Collectors: coloring Pointers and read barriers. The following quotes foreign articles said:

Coloring pointer

Coloring Pointers is a technique for storing information in Pointers (or referencing in Java terminology). Because Pointers can handle more memory on 64-bit platforms (the ZGC only supports 64-bit platforms), some bits can be used to store state. ZGC will limit support to a maximum of 4Tb heap (42-bits), leaving 22 bits available. It currently uses 4 bits: Finalizable, Remap, mark0 and mark1. We’ll explain their purpose later.

One problem with coloring Pointers is that it requires extra work when you need to uncolor them (because of the need to mask information bits). Platforms like SPARC have built-in hardware that supports pointer masking so it’s not a problem, while for x86, the ZGC team uses a neat multiple mapping technique.

Multiple mapping

To understand how multiple mapping works, we need to briefly explain the difference between virtual and physical memory. Physical memory is the actual memory available to the system, usually the capacity of the installed DRAM chips. Virtual memory is abstract, which means that applications have their own view of (usually isolated) physical memory. The operating system is responsible for maintaining the mapping between virtual memory and physical memory ranges, which it does by using page tables and the processor’s memory management unit (MMU) and the transformation lookup buffer (TLB), which translates the address requested by the application.

Multiple mapping involves mapping different ranges of virtual memory to the same physical memory. Since there is only one REmap in the design, mark0 and mark1 can be 1 at any point in time, so three mappings can be used to accomplish this. There is a nice diagram in the ZGC source code to illustrate this.

Read barrier

Read barriers are snippets of code that run whenever an application thread loads a reference from the heap (i.e. access a non-primitive field on an object) :

void printName( Person person ) {
    String name = person.name;  // This triggers the read barrier
                                // Because the reference needs to be read from the heap
                                // 
    System.out.println(name);   // There is no direct read barrier
}
Copy the code

In the code above, String name = Person.name accesses the Person reference on the heap and then loads the reference into the local name variable. The read barrier is triggered. The systemt.out line does not trigger the read barrier directly, because no reference is loaded from the heap (name is a local variable, so no reference is loaded from the heap). But other read barriers may be triggered inside System and out, or println.

This is in contrast to the write barriers used by other GCS, such as G1. The job of the read barrier is to check the state of the reference and do some work before returning the reference (or even a different reference) to the application. In the ZGC, it performs this task by testing loaded references to see if certain bits are set. If the test passes, nothing else is done, and if it fails, some stage-specific task is performed before the reference is returned to the application.

tag

Now that we know what these two new technologies are, let’s take a look at the GC cycle for the ZG.

The first part of the GC cycle is the tag. Tagging involves finding and tagging all heap objects accessible to the running application, in other words, finding objects that are not garbage.

The ZGC markup is divided into three phases. The first stage is STW, where GC roots are marked as live objects. GC Roots are similar to local variables that allow access to other objects on the heap. An object is considered garbage if it cannot be accessed by traversing the object graph starting with roots, then the application cannot access it. A collection of objects accessed from roots is called a Live set. The GC roots labeling step is very short because the total number of roots is usually small.

When this phase is complete, the application resumes execution and the ZGC begins the next phase, which simultaneously traverses the object graph and marks all accessible objects. During this phase, the read barrier pin tests all loaded references with a mask that determines whether they are marked or unmarked, and if not, adds them to the queue for marking.

After The traversal is complete, there is a final, short Stop The World phase, which handles some edge cases (which we will ignore for now), and then The marking phase is complete.

relocation

The next major part of the GC cycle is relocation. Relocation involves moving live objects to free up some of the heap memory. Why move objects instead of filling gaps? Some GCS actually do this, but it has the unfortunate consequence that allocating memory becomes more expensive because the memory allocator needs to find free space to put objects in when it does. By contrast, if large chunks of memory can be freed, allocating memory is simple by incrementing the pointer to the size required by the new object.

The ZGC divides the heap into many pages, and at the beginning of this phase, it simultaneously selects a set of pages that need to relocate live objects. After The relocation set is selected, a Stop The World pause appears in which The ZGC relocates The root objects in The collection and maps their references to The new location. As with The previous Stop The World step, The pause time involved here depends only on The number of roots and The ratio of The size of The relocation set to The total active set of The object, which is usually quite small. So unlike many collectors, the pause time increases as the heap grows.

After root is moved, the next stage is concurrent relocation. In this phase, the GC thread iterates through the relocation set and relocates all objects in the pages it contains. If the application thread tries to load objects before GC relocates them, the application thread can also relocate the object, which can be done through read barriers (triggered when a reference is loaded from the heap)

This ensures that all references seen by the application are updated, and that it is impossible for the application to operate on the relocated objects at the same time.

The GC thread will eventually relocate all objects in the relocation set, however there may still be references to the old locations of those objects. The GC can traverse the object graph and remap these references to new locations, but this is an expensive step. So this step is merged with the next marking phase. When traversing the object object graph during the marking phase of the next GC cycle, if references are found that are not remapped, they are remapped and then marked as active.

JVM memory optimization

In understanding the Java Virtual Machine, there are a number of JVM optimization ideas that I will briefly discuss.

Java Memory jitter

Heap memory has a certain size and there is a limit to how much data it can hold. When the Java heap becomes too large, garbage collection starts and stops objects that are no longer used in the heap to free up memory. The term memory jitter can now be used to describe the process of allocating objects in a very short amount of time. Specific how to optimize please Google query ~

JVM large page memory

What is memory paging?

The CPU accesses memory by addressing it. 32-bit CPUS have addressing widths ranging from 0 to 0xFFFFFFFF, or 4G, which means the maximum physical memory that can be supported is 4G. However, in practice, the program needs to use 4G memory, and the available physical memory is less than 4G, causing the program to have to reduce the memory footprint. To solve such problems, modern cpus introduce the Memory Management Unit (MMU).

The core idea of THE MMU is to use virtual addresses instead of physical addresses. That is, the CPU uses virtual addresses when addressing, and the MMU maps virtual addresses to physical addresses. The introduction of MMU solves the limitation of physical memory, as if the program were using 4G memory.

Memory Paging is a memory management mechanism based on MMU. It divides virtual addresses and physical addresses into page frames and page frames by fixed size (4K), and ensures that the size of the page and page frame are the same. This mechanism, in terms of data structure, ensures efficient access to memory and enables the OS to support discontinuous memory allocation. When the program runs out of memory, it can also move the infrequently used pages of physical memory to another storage device, such as a disk, known as virtual memory.

Note that virtual addresses and physical addresses need to be mapped for the CPU to work properly. Mapping requires storing mapping tables. In modern CPU architectures, mappings are typically stored in physical memory in a place called a Page table. Page tables are stored in memory, and CPU access to memory through the bus is certainly slower than direct access to registers. To further optimize performance, modern CPU architectures introduce Translation lookaside Buffer (TLB), which is used to cache a portion of frequently accessed page table contents.

Why support large memory pages?

TLB is finite, that’s for sure. When the TLB memory limit is exceeded, a TLB miss occurs, and the OS commands the CPU to access the page table in memory. If TLB misses occur frequently, application performance degrades rapidly.

To allow the TLB to store more page address mappings, we increased the memory page size.

If a page of 4M is compared with a page of 4K, the former can enable TLB to store 1000 more page address mappings, and the performance improvement is considerable.

Enable JVM large page memory

-xx :LargePageSizeInBytes=10m -xx :+UseLargePages -xx :+UseLargePages -xx :+UseLargePages -xx :+UseLargePages

Improve JVM memory usage with soft and weak references

Strong weak virtual
  1. Strong references:

As long as the reference exists, the garbage collector will never collect it

Object obj = new Object();

Obj equels(new Object());

A strong reference to a new Object in obj will be released only when the reference to obj is released. This is the encoding form we often use.

  1. Soft reference (can implement caching) :

A reference is not required and is reclaimed before running out of memory. This can be done with the following code

Object obj = new Object();

SoftReference<Object> sf = new SoftReference<Object>(obj);

obj = null;

sf.get();// Sometimes null is returned
Copy the code

Sf is a soft reference to obj, which can be retrieved by the sf.get() method. Of course, null is returned when the object is marked as an object to be reclaimed. The soft reference function is similar to that of the cache. When the memory is sufficient, the value can be directly obtained through the soft reference, without the need to query data from busy real sources, which improves the speed. When memory is low, this cached data is automatically deleted and queried from the real source.

  1. Weak references (used to prevent memory leaks in callback functions) :

The second garbage collection can be done with the following code

Object obj = new Object();

WeakReference<Object> wf = new WeakReference<Object>(obj);

obj = null;

wf.get();// Sometimes null is returned

wf.isEnQueued();// Returns whether garbage is marked by the garbage collector as garbage to be collected
Copy the code

Weak references are collected during the second garbage collection. Data can be fetched by weak references in a short period of time. When the second garbage collection is performed, null is returned. Weak references are used to monitor whether an object has been marked by the garbage collector as garbage to be collected. The isEnQueued method of weak references can be used to return whether an object has been marked by the garbage collector.

  1. Phantom reference:

When garbage collection, object values cannot be retrieved by reference, but can be achieved by the following code

Object obj = new Object();
PhantomReference<Object> pf = new PhantomReference<Object>(obj);
obj=null;
pf.get();// Always return null
pf.isEnQueued();// Returns whether it has been deleted from memory
Copy the code

A virtual reference is a reference that is collected every time the garbage is collected. The get method of a virtual reference always gets null data and is therefore a ghost reference. Virtual references are used to detect whether an object has been deleted from memory.

To optimize the

In simple terms, can use soft references are references to a huge number of objects, for details please refer to http://www.cnblogs.com/JavaArchitect/p/8685993.html

conclusion

This article is a total of 1.5W words, I from the computer physical memory system to the Java memory model, through the Java memory model led to the JVM memory related knowledge points. Please give a thumbs up if you think it’s good. I will post this post first on my personal blog and later on nuggets and other platforms. Finally, thank you very much for reading ~

The resources

Various hyperlinks in the article

In-depth Understanding of the Java Virtual Machine

The Art of Concurrent Programming in Java

Architecture Decryption from Distributed to Microservices

SMP, NUMA, MPP architecture introduction

ZGC Principle (Please use the correct posture magic to watch online)

Stefan Karlsson and PerLiden Jfokus (Please use the Correct Posture Magic online)

The statement

[Copyright Notice] This video is original content, using MIT license terms, please comply with the corresponding obligation, namely the licensee is obliged to include the copyright notice in all copies. Thanks for your cooperation.

Want to unlock more new poses? Please visit my blog https://blog.tengshe789.tech/ (😘 making community address, https://github.com/tengshe789/, welcome each other fo