The transition from native machine code to bytecode as a result of code compilation is a small step in the development of storage formats, but a giant leap for programming languages. — Understanding the Java Virtual Machine

Computers only know 0 and 1. So the programming language we write has to be escaped into binary native machine code for the machine to understand. However, with the development of virtual machines, many languages, including Java, have chosen a neutral storage format independent of operating system and machine instruction set to store compiled data.

independence

We all know the classic Java catchphrase, “Compile once, run anywhere.” To achieve this goal, a customized virtual machine on each platform needs to read uniform data. This data doesn’t depend on any platform, or even the language in which it is compiled, as long as the format is consistent, the virtual machine can use it correctly. This unified format is called bytecode (Class file).

The Class file stores the Java virtual machine instruction set and symbol table, along with several other auxiliary and structural constraints. For security reasons, Class files use many mandatory syntax and structural constraints.

Class The structure of the Class file

Let’s look at the structure of the Class file, the backbone of this article. Although it is described in JDK1.4, it contains directives and attributes that are the most important and fundamental in Class files. Subsequent versions are enhancements to it.

Each Class file corresponds to a unique Class or interface definition information, but on the other hand, classes and interfaces do not have to be defined in a file (for example, classes and interfaces can also be generated directly through the Class loader).

A Class file is a set of binary streams based on 8-bit bytes. The data items are arranged in a tight sequence in the Class file, without any delimiters. This makes the entire Class file almost necessary for the program to run, with no Spaces.

Class has two data types (although opened with a hexadecimal editor, both look like hexadecimal characters) : unsigned numbers and tables. Unsigned numbers can be used to describe numbers, index references, quantitative values, or utF-8 encoded string values. Tables are conformed data types constructed from multiple unsigned numbers or other tables as data items, all of which habitually end in “info_”. Tables are used to describe hierarchically complex structured data, and the entire Class file is essentially a table.

Data such as constant_pool_count and constant_pool next to each other can be treated as a whole (a table), with the number of the latter data recorded in the front.

Version of magic number and Class file

Look at the class file structure table, the first one is U4 Magic. This is a four-byte magic number whose sole purpose is to determine whether the file is a Class file. It is a flag that tells the virtual machine that it is a Class file, which is more secure, with a fixed value of four bytes, “0xCAFEBABE” in hexadecimal.

This is followed by a two-byte minor (minor version) and a two-byte major (major version). Which version of the compiler is used to compile the Class file, for example, 50.3, 50 is major and 3 is minor. Backward compatibility is possible at runtime, for example, a version 51 VM can run a version 50.3 class file, but not the other way around.

Constant pool

Constant_pool_count is followed by constant_pool. The constant pool can be understood as the resource repository of Class files. It is the data type that is most associated with other items in the Class file structure, and it is also one of the data items that occupy the largest amount of Class files.

The first two-byte constant_pool_count counts the number of constant_pool constants that follow. Note that the number starts at 1. For example, if constant_pool_count stores 22, then constant_pool stores 21 items. This is designed so that the “0th location” stores special data. This is the only part of the Class file where the count starts at 1. The rest of the Class file starts at 0.

There are two main types of constants stored in the constant pool: literals and symbolic references. Literals are best understood as injecting strings, final modified constant values, and so on. Symbolic references contain the following three constants:

  1. Fully qualified names of classes and interfaces
  2. The name and descriptor of the field
  3. The name and descriptor of the method

The Class file does not store the final memory distribution of each method or field, and the true memory entry (the address of a piece of information) is only known when a particular piece of code is executed. In JDK1.4, a constant pool can contain the following constant items (this will be expanded in future versions) :

The most troublesome types have their own structures, but what they all have in common is that the first byte stores a tag, which tells the virtual machine what kind of constant item it is. You can see a lot of things from this section, for example, a variable name can be up to two bytes, which is 64KB of English characters.

Access tokens

After the constant pool ends, the next two bytes represent access flags (access_flags), which are used to identify some class or interface level access information.

Use or to calculate the access token, such as a class is ACC_PIBLIC x0001 (0), ACC_SUPER x0020 (0), then the calculation of 0 x0001 | 0 x0021 x0020 = 0, the value is a sign of being accessed stored value. There are special packages for calculating keywords in Java.

A collection of class indexes, parent indexes, and interface indexes

The class index (this_class) and super_class index (super_class) are both a U2-type data, while the interface index is a set of U2-type data. The Class file has these three items to determine inheritance. All indexes except the Object class are not 0. If the size of the structure counter is 0, there is no data in the latter part.

Set of field tables

Field tables are used to describe variables declared in interfaces or classes. Fields include class-level and instance-level (object-level) variables, but not local variables inside methods. The following is the field table structure and the first attribute access flag for the field table.

Access_flags is evaluated in the same way as the previous class or interface access representation. The next two attributes are name_index and Descriptor_index, representing the simplified name and method descriptor, respectively.

The collection of field tables does not list fields inherited from superclasses or parent interfaces, but can list fields that do not exist in Java code, such as fields automatically added to internal classes to keep external classes accessible. In Addition, in Java, the same class cannot have the same simple name of the field name, such as int name, followed by String name. But at the bytecode level, the simple name can be the same, but the following description is different.

Method table collection

The structure of a method table is similar to that of a field table.

In contrast to the field table collection, the method information for the parent class does not appear in the method table collection if the parent class method has not been overridden in a subclass. In the Java language, to override a method, in addition to having the same simple name as the original method, requires a signature that is different from the original method. A signature is a set of field symbol references for each parameter in a method in the constant pool. That is, the return value is not included in the signature, so the return value is different, not overloading.

Property sheet collection

Class files, field tables, and method tables carry their own set of property tables. The data items in the property sheet are a little looser than the rest of the table, but there is also a lot of content. Now let’s look at the more important ones.

Code attributes

The Code in the program method body of a Java class is compiled and stored in Code properties, but the methods in interfaces and abstract classes are not.

Max_locals represents the storage space required by the local variable table, where the minimum unit is Slot. Slot can be reused. When code is executed outside the scope of a local variable, the Slot occupied by the local variable can be used by other local variables, greatly saving space.

Code_length and code values are stored as bytecode instructions generated when Java source code is compiled. Since each code takes up only one byte, there are only 256 instructions that can be represented. Code_length is four bytes long, but since the VIRTUAL machine can only use two bytes, a maximum of 65535 instructions can be compiled, which is generally sufficient, but when compiling complex JSPS, be aware that some compilers merge the JSP content and the output information of the page into a method. It may cause the compilation to fail.

It’s worth noting that when Javac compiles a method, the agrs_size parameter will be 1, even if you don’t fill it in, because this is implicitly passed in. Static is 0.

When we used try-catch, we noticed that finlly does not change the value of a local variable, thinking that the try has returned and then executing the data in finlly. Take the following code for example.

Public int inc () {int x; Try {x = 1;returnX; }catch (Exception e) {x=2;returnX; } the finally {x = 3; }}Copy the code

This code will never print x=3, executed in this order (without throwing exceptions, for example) : x=1 is executed first, at which point the local variable is equal to 1. We then read the return instruction and assign the value of x to a space that is returned when we return. We’ll call this space returnX for now. Then the code enters finally, and notice that it is still in the scope of the inc() method. It then assigns x to be equal to 3, and finally executes the return instruction, which returns the value of the previous returnX space to the caller. Out of scope of inc(), Slot X can be reused.

other

  1. Exceptions stores Exceptions behind the throws method.
  2. LineNumberTable is not mandatory, but it is specified by default. If you do not specify LineNumberTable, you will not be able to locate the row when the exception stack is thrown.
  3. LocalVariableTable is not required and is used to describe the relationship between variables in a LocalVariableTable in a stack frame and variables defined in Java source code.
  4. SourceFile records the name of the source code that generated the Class file
  5. ConstantValue tells the VIRTUAL machine to automatically assign values to static variables before initialization.
  6. InnerClasser records the association between the inner class and the host class.
  7. Signature this property is often used when writing AOP. This property logs information for generics. Since Java does generics erasure at compile time, it needs to be logged so that Java can get the original information for generics at runtime.
  8. The BootstrapMethods property holds the bootmethod qualifier referenced by the Invokedynamic directive, and has a lot to do with the Invoke package.

Bytecode instruction

There are no more than 256 bytecode instructions, and it is natural for an instruction to be followed by arguments, just as we write methods with arguments (which are arguments without arguments). But because the Java virtual machine is built for operand-stacks rather than registers (compiled languages), most of the time it contains only one opcode.

Due to the limited amount of bytecode, many instructions are forced to be unified. For example, arrays of Boolean, byte, short, and char are converted to the corresponding bytecode instructions of int.

Bytecode operations can cause overflows, such as adding two large positive integers, which can result in a negative number. When an operation produces an overflow, it is denoted by a signed infinity, or by a NaN value if the result of an operation is not clearly numerically defined. All arithmetic operations that use NaN values as operands return NaN as a result.

Java virtual machines can support method-level synchronization and synchronization of a sequence of instructions within a method, both of which are supported using Monitor. Synchronized’s lock is Monitor’s.