preface
Java programs have the cross-platform feature of “Write Once, Run Anywhere.” The Java solution to this is: half-compile + half-interpret, i.e..class + JVM.
1. The source program content is compiled as. Class files have strict rules about how to extract information from them, which can be interpreted as “intermediate code”, and how to interpret the contents of the file. 2. Understanding the contents of the program, each platform, depending on its own characteristics, implements its own JVM to interpret (translate).class files into truly local executable instructions.
This implements the cross-platform nature of Java. Therefore, cross-platform is based on. Class, implemented as a JVM.
The purpose of this article is to understand. Class to know what the written program code looks like to the JVM. Once you understand.class, it’s helpful to understand the JVM, understand bytecode staking, and so on.
Basic knowledge of
The bytecode
Bytecode is a binary file consisting of data pairs that contain executable programs. It is intermediate code. Generally speaking, a byte occupies eight bits, that is, a binary containing eight bits.
The.Class file referred to in this article is a bytecode file. Each byte contains 8 bits, so it is expressed in hexadecimal format and easy to read. The value ranges from 00 to FF (0 to 255).
Unsigned basic type
Unsigned numbers can be used to describe numbers, index references, numeric quantities, or utF-8 encoded strings. U1, U2, U4, and U8 represent unsigned numbers of 1 byte, 2 byte, 4 byte, and 8 byte respectively.
literal
A literal is a fixed-value representation that has no meaning and requires a scene to give it meaning. For example, if 007 has no meaning, but is used to mean James Bond, you know that 007 stands for a very good secret agent. In the program, int x = 10 and String s = “10” give the literal 10 a different meaning.
Fully qualified name
The fully qualified name of a class is the full name of the class. Replace all with /, for example, java.lang.String with Java /lang/String
The descriptor
Descriptors are used to describe the data type of the field, the parameter class table of the method, and the return value. Each symbol corresponds to a different data type
Identification character | meaning |
---|---|
B | byte |
C | char |
D | double |
F | float |
I | int |
J | long |
S | short |
Z | boolean |
V | void |
L | Object types such as String represent Ljava/lang/String; |
The Class file
A Java file contains all the information about a class. Here is a Java class:
import java.io.Serializable; public class TestClass implements Serializable{ private int m = 123; private static int x = 10; private static final int y = 20; public int increace(){ return m+1; } public void m() throws Exception{// The logic is not written} public static String hello(){return "hello word"; }}Copy the code
This Java file contains the following information:
- Class as TestClass and externally accessible, implements Serializable interface
- You have class variables x and y, and you have a member variable m
- Have externally accessible class functions hello() and externally accessible member functions increace() and m()
Note: Unless otherwise stated. Class files are compiled from Java files
This information will be compiled in the. Class file. Through the command
Javac fileName.java
The Java file can be compiled into the corresponding. The Class files. .class files are bytecode files that can be read using the corresponding editor. The editor used in this article is “010”, available for Windows and Mac. Download it yourself.
The.class file uses the byte code table to reach the information, each data is compact, does not contain any delimiters, so the whole. The Class file stores almost all of the data necessary for the program to run. How to parse bytecode data requires a set of rules to interpret and strictly follow.
The.class file style uses a pseudo-structure similar to the STRUCTURE of the C language to store data. You can think of a.class file as a collection of tables that can be indexed to find the corresponding data. It can be understood that the relative location of data determines the meaning it is given.
The.Class file format is as follows
type | The name of the | The number of | meaning |
---|---|---|---|
u4 | magic | 1 | Magic number, used to determine whether the virtual machine can accept |
u2 | minor_version | 1 | Second version number |
u2 | major_version | 1 | The major version number |
u2 | constant_pool_count | 1 | Constant pool number |
cp_info | constant_pool | constant_pool_count-1 | Constant pool contents |
u2 | access_flags | 1 | Access tokens |
u2 | this_class | 1 | Class index |
u2 | super_class | 1 | Index of the parent class |
u2 | interfaces_count | 1 | The interface number |
u2 | interfaces | interfaces_count | Set of interface tables |
u2 | fields_count | 1 | Number of fields |
field_info | fields | field_count | Set of field tables |
u2 | methods_count | 1 | Methods the number |
method_info | methods | methods_count | Method table collection |
u2 | attributes_count | 1 | Number of attributes |
attribute_info | attributes | attributes_count | Property sheet combination |
Some data messages are of a fixed length, others are situational, but there are constraints that tell you how long they are. Each information corresponds to the existing. The Class instance file is displayed, and all that is left is to parse the Class information layer by layer.
Constant pool
There are two main categories of constant pools: literals and symbolic references. Symbolic references include:
- Fully qualified names of classes and interfaces
- The name and descriptor of the field
- The name and descriptor of the method
Unlike C and C++, Java code is compiled without a “wire” step, which is dynamically wired when the JVM loads the Class file. The.class file does not hold the final memory layout information for each method or field because it cannot be converted at runtime to get the actual memory entry address and cannot be used by the JVM. At JVM runtime, symbolic references are retrieved from the constant pool, parsed and translated into a specific memory address for use, and this information is stored in the JVM’s methods section.
The length occupied by the constant pool varies, so the constant quantity statistics need to be provided by 0x0008 to 0x0009, and then the length occupied by the body can be calculated according to the specific constant type in the constant pool.
However, this is rather tedious, each constant type corresponds to a table, according to the different table to consult the specific table structure to obtain information. The first digit of a constant type, U1, corresponds to the table structure of the constant, as follows
type | mark | describe |
---|---|---|
CONSTANT_Utf8_info | 1 | The character string is utF-8 encoded |
CONSTANT_Integer_info | 3 | Integer literals |
CONSTANT_Float_info | 4 | Floating point literals |
CONSTANT_Long_info | 5 | Long integer literals |
CONSTANT_Double_info | 6 | A double – precision floating-point literal |
CONSTANT_Class_info | 7 | Symbolic reference to a class or interface |
CONSTANT_String_info | 8 | String type literals |
CONSTANT_Fieldref_info | 9 | Symbolic reference to a field |
CONSTANT_Methodref_info | 10 | Symbolic references to methods in a class |
CONSTANT_InterfaceMethodref_info | 11 | Symbolic references to methods in the interface |
CONSTANT_NameAndType_info | 12 | A partial symbolic reference to a field or method |
CONSTANT_MethodHandle_info | 15 | Method handles |
CONSTANT_MethodType_info | 16 | Identify method types |
CONSTANT_InvokeDynamic_info | 18 | Represents a dynamic method call point |
This article is not intended to list the corresponding table structures unless necessary, but refer to the generic table of Class file structures for details
Here’s a primer.
The first constant type, indicated by 0x000A, has a value of 0A, 10 in decimal, and is of type CONSTANT_Methodref_info. The data types in the table are U1, U2, and U2, which consist of 5 bytes. If it is of type CONSTANT_Utf8_info, there is also a length attribute indicating that the literal takes up the length of bytes, which needs to be added. The second constant type is indicated by 0x000F with a value of 0F, decimal 09, and table type CONSTANT_Fieldref_info. And so on…
This step by step to find the corresponding constant is also more troublesome, fortunately, Java built-in class tool – Javap can be. Class file bytecode is analyzed by command
javap -verbose fileName
You get the following information :(only the constant pool part is shown)
Number of constant pools If the value ranges from 0x0008 to 0x0009, it is 23. If the value is converted to 35 in decimal notation, it indicates that the constant pool index ranges from 1 to 35. Looking at the two graphs above, the former index starts at 0 and the latter index starts at 1.
If you don’t know the way, the constant pool analysis is really confusing. Personally, the constant pool information is “building blocks”.
In this example, the constant types involved in the constant pool are:
- CONSTANT_Methodref_info
- CONSTANT_Fieldref_info
- CONSTANT_String_info
- CONSTANT_Class_info
- CONSTANT_Interger
- CONSTANT_NameAndType
- CONSTANT_Utf8
Forget about the specific table structure for a moment, the above table type structure relationship is shown as follows:
This is just the composition of the constant types involved in the current example. A constant of any type, divided continuously, will either point to a constant of the base type CONSTANT_Utf8, or be itself a base type such as CONSTANT_Interger. The basic constant type CONSTANT_Utf8 does not have much meaning on its own, while the other types are scenes that give meaning to CONSTANT_Utf8.
CONSTANT_Utf8_info can be considered the most basic type
{// constant type u1 tag; // byte length u2 length; // utF-8 encoding bytes[length]; }Copy the code
When encountering a constant of type CONSTANT_Utf8_info, the bytes are encoded in utF-8 abbreviation to get the literal
The class level information
The class defined is
public class TestClass implements Serializable
Copy the code
It contains the following information:
- The class itself: TestClass
- Access flag: public
- Implement the Serializable interface
- The parent class is Object
From the.class file format table, the only data following the constant pool is class-level data
From 0x0143 to 0x014C:
- Access_flags (U2): the hex value is 0x0021
- This_class (U2): decimal value 5, pointing to the 5th constant in the constant pool, type CONSTANT_Class_info, class TestClass
- Super_class (U2): decimal value 5, pointing to the sixth constant, type CONSTANT_Class_info, class Java /lang/Object
- Interface_count (U2): indicates the number of realized interfaces
- Interface [0] : points to the seventh constant in the constant pool, with type CONSTANT_Class_info and interface name Java/IO /Serializable
The CONSTANT_Class_info regular scale is structured as follows
{// constant type u1 tag; // Points to an index of type CONSTANT_Utf8_info with the constant pool offset of name_index, and // represents the class or interface's permission name u2 name_index; }Copy the code
Consistent with the previous statement that constant pools are building blocks, the following types of constant pools remain the same.
Access flags are represented by flag bits, and the meanings of each flag are shown in a table
Sign the name | Flag values | meaning |
---|---|---|
ACC_PUBLIC | 0x0001 | Whether the type is public |
ACC_FINAL | 0x0010 | Final or not, only classes can declare it |
ACC_SUPER | 0x0020 | Whether to allow the use of invokespecial bytecode instructions new semantics, in JDK 1.0.2 has changed the need to distinguish |
ACC_INTERFACE | 0x0200 | Identifies this as an interface |
ACC_ABSTRACT | 0x0400 | Whether the type is abstract |
ACC_SYNTHETIC | 0x1000 | Indicates that this class is not generated by user code |
ACC_ANNOTATION | 0x2000 | Mark this as an annotation |
ACC_ENUM | 0x4000 | Identifies this as an enumeration |
The current situation of 0 x0001 | 0 x0021 x0020 = 0
Attribute (Attribute table)
Property tables are special..class files, field tables, method tables, and so on can carry their own set of property tables to describe proprietary scenarios, and therefore prespecify this table.
The properties of the property sheet are:
- Rules are more relaxed, do not require strict order, length, content
- Any compiler can write custom property information to a property sheet, as long as it does not duplicate an existing property sheet, and the JVM ignores attributes it does not recognize.
The property table structure is
// refers to a constant of constant pool type CONSTANT_Utf8_info, representing u2 attribute_name_index; // Attribute table info Occupies the length u4 attribute_length; // This requires the implementation structure, which is attribute_length Info Info; }Copy the code
So an attribute list is u2 + u4 + attribute_length.
Java and definition to many property sheets, the article check out the involved to do the following description, others in the actual need to consult
The attribute name | Use location | meaning |
---|---|---|
Code | Method table | Bytecode instructions compiled into Java code |
ConstantValue | Field in the table | Constant value defined by the final keyword |
Exceptions | Method table | Method throws an exception |
LineNumberTable | Code attributes | The relationship between the Java source line number and the bytecode instruction |
SourceFile | The class file | Record the source file name |
The attribute table carried by the field table and method table is not involved at present. The current node involves the SourceFile. Class carries a property table.
Range: 0x025A to 0x0262 Total U2 + U4 + ATtiBUte_LENGTH = 8 bytes SourceFile Attribute structure is as follows
Pseudo-code {// points to a constant of constant pool type CONSTANT_Utf8_info, representing u2 attribute_name_index; U4 attribute_length; // point to a constant of constant pool type CONSTANT_Utf8_info, representing the sourcefile u2 sourcefile_index; }Copy the code
So from the SourceFile property table, the SourceFile name is testclass.java
Field in the table
Refer to the.Class file format table, after the interface table, is the number of fields and the number of fields table
The value ranges from 0x014D to 0x016E, where 0x014D to 0x014E indicates the number of fields. The value 0x0003 indicates that the number of fields is 3. The field table structure is as follows
// access flag U2 access_flags // points to constant with constant pool type CONSTANT_Utf8_info, indicating field name U2 name_index // points to constant with constant pool type CONSTANT_Utf8_info, // u2 descriptor_index // attributes_count attribuite_info}Copy the code
Field tables, like.Class, can carry their own attribute tables to handle special scenarios; attribuite_info is optional. When attributes_count is 0, attribuite_info is not required. Fields also have access flags to further constrain them. Field names are represented by name_index constants that point to the constant pool, and field types are represented by descriptors, such as Int for I (forgetting the basics).
The current example defines the following fields:
private int m = 123;
private static int x = 10;
private static final int y = 20;
Copy the code
Member variables m and class variables x, y are defined. Let’s take y for example
The positions are 0x015F to 0x016E, where:
- Access flag: 0x001A
- Field name index: 0x000B, decimal value 11, pointing to the 11th constant in the constant pool, y
- The descriptor index is 0x0009, pointing to the ninth constant in the constant pool, which is I
- The number of attribute tables is 0x0001, and the number is 1
- The total length of the property table is u2 + U4 + 2 bytes, that is, 0x0167 to 0x016E
The meanings of field access flag bits are shown in the table:
Sign the name | Sign a | meaning |
---|---|---|
ACC_PUBLIC | 0x0001 | Whether or not to public |
ACC_PRIVATE | 0x0002 | Whether it is private |
ACC_PROTECTED | 0x0004 | Whether it is protected |
ACC_STATIC | 0x0008 | Whether it is the static |
ACC_FINAL | 0x0010 | Whether it is final |
ACC_VOLATILE | 0x0040 | Whether it is volatile |
ACC_TRANSIENT | 0x0080 | Whether for transient |
ACC_SYNTHETIC | 0x1000 | Is generated automatically by the compiler |
ACC_ENUM | 0x4000 | Field No Enum |
The current private static final, namely 0 x0001 | 0 x0008 | 0 x0010, 0 x001a. Private static final int y = private static final int y = private static final int y = private static final int y
ConstantVulue, the property sheet, is used to assign values to variables with final and static modifications of the basic data type. In addition to the regular base data, the property table ConstantVulue also has a ConstantValue_index index of type U2 that refers to constants in the constant pool used to initialize data. The value between 0x016D and 0x016E is 0x000D, pointing to the value at index 13 of the constant table is integer 20.
In the instance, the variable M and the class variable x are assigned in the member initial function and the class initial function, as explained below.
Method table
Next to the field table is the number of methods and the collection of method tables
The value ranges from 0x016F to 0x0258, and the value is 0x0005. In addition to the example customizable increace(), m(), and Hello (), there are instance constructor ()v, class constructor () methods.
Method table structure is as follows:
Pseudo-code {// access flag U2 access_flags; // method name index, pointing to constant U2 name_index of constant pool type CONSTANT_Utf8_info; // The method returns the value descriptor index pointing to the constant U2 descriptor_index of the constant pool type CONSTANT_Utf8_info; U2 attributes_count; // Attribute table contents Info Info; }Copy the code
Methods can also carry property sheets to describe proprietary scenarios. From the above structure and the specific bytecode, you can deduce what the method contains. Where, the meanings of access flags are shown in the table:
Sign the name | Flag values | meaning |
---|---|---|
ACC_PUBLIC | 0x0001 | Whether or not to public |
ACC_PRIVATE | 0x0002 | Whether it is private |
ACC_PROTECTED | 0x0004 | Whether it is protected |
ACC_STATIC | 0x0008 | Whether it is the static |
ACC_FINAL | 0x0010 | Whether it is final |
ACC_SYNCHRONIZED | 0x0020 | Whether it is synchronized |
ACC_BRIDGE | 0x0040 | Is there a compiler generated bridge method |
ACC_VARARGS | 0x0080 | Whether to accept indefinite parameters |
ACC_NATIVE | 0x0100 | Whether it is native |
ACC_ABSTRACT | 0x0400 | Whether it is the abstract |
ACC_STRICTFP | 0x0800 | Whether for strictfp |
ACC_SYNTHETIC | 0x1000 | Is generated automatically by the compiler |
Take the instance method () as an example:
Information as follows:
- Access_flags: 0x0001, public
- Name_index: 0x000E, is 14, corresponding to the constant pool
- Descriptor_index: 0x000F, is 15, get ()V for the constant pool
- Attributes_count: 0x0001 is 1, and the number of attribute tables is 1
Function information can be deduced from the information. The instantiation function is expressed as public ()V, followed by access flag, function name, and return value.
A function method, and more importantly, how to express the function it provides. Essence, all the code in the function body is conducted in arithmetic operation, therefore, as long as the function body of code into a bytecode instruction, function is executed according to the execution can, again take a chance again the key information, can draw function performs, stack depth, the number of local variables, the bytecode file size.
After attributes_count, 0x0179 to 0x01A5 are the contents of the attribute sheet. According to the format of the property table convention described above, the position 0x0179 ~ 0x0180 is 0x0010, and the decimal value is 16, and the constant pool knows that the property table is of type Code.
The structure of the Code type attribute table is as follows:
// The index of the attribute_name_index refers to the constant U2 attribute_name_index of constant table type CONSTANT_Utf8_info; U4 attribute_length; // stack depth u2 max_stack; // local variables u2 max_locals; // bytecode instruction length u4 code_length; // bytecode instruction u1 code code_length; U2 exception_table_length; Exception_info EXCEPtion_table; U2 attributes_count; // Attribute_info attributes; }Copy the code
Method information can be obtained not only directly by reading bytecode files, but also by
javap -verbose className
You can also get it and stick it together
The red circle is the bytecode instruction set, the yellow circle is each bytecode instruction, the green circle is the basic information of the Code property table, and the blue circle is the instance function information parsed by the JavAP tool.
First, the maximum stack depth is 2, and the maximum operand stack depth is never exceeded at any point in the function execution. Then the number of local variables is 1, which represents the storage space required for the local variable table in Slot, which is the minimum unit that the JVM can use to allocate memory for local variables. Also in the blue circle is args_sige, the number of arguments accepted by the method, which is 1, or this. Finally, the function code block is converted into bytecode instructions.
Bytecode instructions are beyond the scope of this article, so you might as well look at them briefly.
A bytecode instruction represents a particular operation, which is represented by a single byte length, followed by zero to more than one of the required operands.
The initialization function in this example is translated as: 2A B7 00 01 2A 10 7B B5 00 02 B1 in the blue circle above:
1: invokespecial #1 4: aload_0 // Push 123 5: bitpush 123 // access field m and store 123 into 7: Putfield #2 // return 10: return}Copy the code
The bytecode of putfield #2 is B5 0002, where B5 indicates that the operation is performed 00. 02 is the parameter required for the operation, and the value is 0x0002, indicating that the constant pool type is CONSTANT_Fieldref_info. CONSTANT_Fieldref_info structure for
CONSTANT_Fieldref_info {u1 tag; CONSTANT_ClassInfo_info // represents the class u2 class_index of the field; // refers to a constant of constant pool type CONSTANT_NameAndType_info // represents the field name and type U2 name_AND_type_index; } CONSTANT_ClassInfo_info { u1 tag; // constant to the end of the constant pool CONSTANT_Utf8_info // represents the class name u2 name_index; } CONSTANT_NameAndType_info { u1 tag; // constant to the end of the constant pool CONSTANT_Utf8_info // represents the name U2 name_index; // Constant pointing to the end of the constant pool CONSTANT_Utf8_info // representing the type U2 descriptor_index; }Copy the code
The current value refers to the second constant.
B5 00 02 assigns 123 on the stack to testclass. m (private int m = 123).
The assignment of x is done in the class constructor, and the inicrea() and m() functions can be analyzed in the same way. No more statements, point to the point.
conclusion
At this point, you know what the.class file looks like. The Class file format table can parse out basic information about the file contents. The.class file can be regarded as a collection of multiple tables, according to the rules formulated by the table, follow the steps, naturally can find out the corresponding information. .class files are easy to understand how to present information, but it’s hard to have the patience to tease out trivial index relationships, especially in the constant pool and property table sections. The property sheet section provides plenty of room for more content depending on the scenario.
The article is only about parsing. Class file basic rules, further parsing rules interested or need to understand again, method unchanged.
reference
“Understanding the Java Virtual Machine in Depth” — chapter 6
In-depth understanding of JVM bytecode execution engines
What is the meaning of class files in Java?
Why is the Java language cross-platform