preface

Java programs have the cross-platform feature of “Write Once, Run Anywhere.” The Java solution to this is: half-compile + half-interpret, i.e..class + JVM.

1. The source program content is compiled as. Class files have strict rules about how to extract information from them, which can be interpreted as “intermediate code”, and how to interpret the contents of the file. 2. Understanding the contents of the program, each platform, depending on its own characteristics, implements its own JVM to interpret (translate).class files into truly local executable instructions.

This implements the cross-platform nature of Java. Therefore, cross-platform is based on. Class, implemented as a JVM.

The purpose of this article is to understand. Class to know what the written program code looks like to the JVM. Once you understand.class, it’s helpful to understand the JVM, understand bytecode staking, and so on.

Basic knowledge of

The bytecode

Bytecode is a binary file consisting of data pairs that contain executable programs. It is intermediate code. Generally speaking, a byte occupies eight bits, that is, a binary containing eight bits.

The.Class file referred to in this article is a bytecode file. Each byte contains 8 bits, so it is expressed in hexadecimal format and easy to read. The value ranges from 00 to FF (0 to 255).

Unsigned basic type

Unsigned numbers can be used to describe numbers, index references, numeric quantities, or utF-8 encoded strings. U1, U2, U4, and U8 represent unsigned numbers of 1 byte, 2 byte, 4 byte, and 8 byte respectively.

literal

A literal is a fixed-value representation that has no meaning and requires a scene to give it meaning. For example, if 007 has no meaning, but is used to mean James Bond, you know that 007 stands for a very good secret agent. In the program, int x = 10 and String s = “10” give the literal 10 a different meaning.

Fully qualified name

The fully qualified name of a class is the full name of the class. Replace all with /, for example, java.lang.String with Java /lang/String

The descriptor

Descriptors are used to describe the data type of the field, the parameter class table of the method, and the return value. Each symbol corresponds to a different data type

Identification character meaning
B byte
C char
D double
F float
I int
J long
S short
Z boolean
V void
L Object types such as String represent Ljava/lang/String;

The Class file

A Java file contains all the information about a class. Here is a Java class:

import java.io.Serializable; public class TestClass implements Serializable{ private int m = 123; private static int x = 10; private static final int y = 20; public int increace(){ return m+1; } public void m() throws Exception{// The logic is not written} public static String hello(){return "hello word"; }}Copy the code

This Java file contains the following information:

  1. Class as TestClass and externally accessible, implements Serializable interface
  2. You have class variables x and y, and you have a member variable m
  3. Have externally accessible class functions hello() and externally accessible member functions increace() and m()

Note: Unless otherwise stated. Class files are compiled from Java files

This information will be compiled in the. Class file. Through the command

Javac fileName.java

The Java file can be compiled into the corresponding. The Class files. .class files are bytecode files that can be read using the corresponding editor. The editor used in this article is “010”, available for Windows and Mac. Download it yourself.

The.class file uses the byte code table to reach the information, each data is compact, does not contain any delimiters, so the whole. The Class file stores almost all of the data necessary for the program to run. How to parse bytecode data requires a set of rules to interpret and strictly follow.

The.class file style uses a pseudo-structure similar to the STRUCTURE of the C language to store data. You can think of a.class file as a collection of tables that can be indexed to find the corresponding data. It can be understood that the relative location of data determines the meaning it is given.

The.Class file format is as follows

type The name of the The number of meaning
u4 magic 1 Magic number, used to determine whether the virtual machine can accept
u2 minor_version 1 Second version number
u2 major_version 1 The major version number
u2 constant_pool_count 1 Constant pool number
cp_info constant_pool constant_pool_count-1 Constant pool contents
u2 access_flags 1 Access tokens
u2 this_class 1 Class index
u2 super_class 1 Index of the parent class
u2 interfaces_count 1 The interface number
u2 interfaces interfaces_count Set of interface tables
u2 fields_count 1 Number of fields
field_info fields field_count Set of field tables
u2 methods_count 1 Methods the number
method_info methods methods_count Method table collection
u2 attributes_count 1 Number of attributes
attribute_info attributes attributes_count Property sheet combination

Some data messages are of a fixed length, others are situational, but there are constraints that tell you how long they are. Each information corresponds to the existing. The Class instance file is displayed, and all that is left is to parse the Class information layer by layer.

Constant pool

There are two main categories of constant pools: literals and symbolic references. Symbolic references include:

  • Fully qualified names of classes and interfaces
  • The name and descriptor of the field
  • The name and descriptor of the method

Unlike C and C++, Java code is compiled without a “wire” step, which is dynamically wired when the JVM loads the Class file. The.class file does not hold the final memory layout information for each method or field because it cannot be converted at runtime to get the actual memory entry address and cannot be used by the JVM. At JVM runtime, symbolic references are retrieved from the constant pool, parsed and translated into a specific memory address for use, and this information is stored in the JVM’s methods section.

The length occupied by the constant pool varies, so the constant quantity statistics need to be provided by 0x0008 to 0x0009, and then the length occupied by the body can be calculated according to the specific constant type in the constant pool.

However, this is rather tedious, each constant type corresponds to a table, according to the different table to consult the specific table structure to obtain information. The first digit of a constant type, U1, corresponds to the table structure of the constant, as follows

type mark describe
CONSTANT_Utf8_info 1 The character string is utF-8 encoded
CONSTANT_Integer_info 3 Integer literals
CONSTANT_Float_info 4 Floating point literals
CONSTANT_Long_info 5 Long integer literals
CONSTANT_Double_info 6 A double – precision floating-point literal
CONSTANT_Class_info 7 Symbolic reference to a class or interface
CONSTANT_String_info 8 String type literals
CONSTANT_Fieldref_info 9 Symbolic reference to a field
CONSTANT_Methodref_info 10 Symbolic references to methods in a class
CONSTANT_InterfaceMethodref_info 11 Symbolic references to methods in the interface
CONSTANT_NameAndType_info 12 A partial symbolic reference to a field or method
CONSTANT_MethodHandle_info 15 Method handles
CONSTANT_MethodType_info 16 Identify method types
CONSTANT_InvokeDynamic_info 18 Represents a dynamic method call point

This article is not intended to list the corresponding table structures unless necessary, but refer to the generic table of Class file structures for details

Here’s a primer.

The first constant type, indicated by 0x000A, has a value of 0A, 10 in decimal, and is of type CONSTANT_Methodref_info. The data types in the table are U1, U2, and U2, which consist of 5 bytes. If it is of type CONSTANT_Utf8_info, there is also a length attribute indicating that the literal takes up the length of bytes, which needs to be added. The second constant type is indicated by 0x000F with a value of 0F, decimal 09, and table type CONSTANT_Fieldref_info. And so on…

This step by step to find the corresponding constant is also more troublesome, fortunately, Java built-in class tool – Javap can be. Class file bytecode is analyzed by command

javap -verbose fileName

You get the following information :(only the constant pool part is shown)

Number of constant pools If the value ranges from 0x0008 to 0x0009, it is 23. If the value is converted to 35 in decimal notation, it indicates that the constant pool index ranges from 1 to 35. Looking at the two graphs above, the former index starts at 0 and the latter index starts at 1.

If you don’t know the way, the constant pool analysis is really confusing. Personally, the constant pool information is “building blocks”.

In this example, the constant types involved in the constant pool are:

  • CONSTANT_Methodref_info
  • CONSTANT_Fieldref_info
  • CONSTANT_String_info
  • CONSTANT_Class_info
  • CONSTANT_Interger
  • CONSTANT_NameAndType
  • CONSTANT_Utf8

Forget about the specific table structure for a moment, the above table type structure relationship is shown as follows:

This is just the composition of the constant types involved in the current example. A constant of any type, divided continuously, will either point to a constant of the base type CONSTANT_Utf8, or be itself a base type such as CONSTANT_Interger. The basic constant type CONSTANT_Utf8 does not have much meaning on its own, while the other types are scenes that give meaning to CONSTANT_Utf8.

CONSTANT_Utf8_info can be considered the most basic type

{// constant type u1 tag; // byte length u2 length; // utF-8 encoding bytes[length]; }Copy the code

When encountering a constant of type CONSTANT_Utf8_info, the bytes are encoded in utF-8 abbreviation to get the literal

The class level information

The class defined is

public class TestClass implements Serializable
Copy the code

It contains the following information:

  • The class itself: TestClass
  • Access flag: public
  • Implement the Serializable interface
  • The parent class is Object

From the.class file format table, the only data following the constant pool is class-level data

From 0x0143 to 0x014C:

  • Access_flags (U2): the hex value is 0x0021
  • This_class (U2): decimal value 5, pointing to the 5th constant in the constant pool, type CONSTANT_Class_info, class TestClass
  • Super_class (U2): decimal value 5, pointing to the sixth constant, type CONSTANT_Class_info, class Java /lang/Object
  • Interface_count (U2): indicates the number of realized interfaces
  • Interface [0] : points to the seventh constant in the constant pool, with type CONSTANT_Class_info and interface name Java/IO /Serializable

The CONSTANT_Class_info regular scale is structured as follows

{// constant type u1 tag; // Points to an index of type CONSTANT_Utf8_info with the constant pool offset of name_index, and // represents the class or interface's permission name u2 name_index; }Copy the code

Consistent with the previous statement that constant pools are building blocks, the following types of constant pools remain the same.

Access flags are represented by flag bits, and the meanings of each flag are shown in a table

Sign the name Flag values meaning
ACC_PUBLIC 0x0001 Whether the type is public
ACC_FINAL 0x0010 Final or not, only classes can declare it
ACC_SUPER 0x0020 Whether to allow the use of invokespecial bytecode instructions new semantics, in JDK 1.0.2 has changed the need to distinguish
ACC_INTERFACE 0x0200 Identifies this as an interface
ACC_ABSTRACT 0x0400 Whether the type is abstract
ACC_SYNTHETIC 0x1000 Indicates that this class is not generated by user code
ACC_ANNOTATION 0x2000 Mark this as an annotation
ACC_ENUM 0x4000 Identifies this as an enumeration

The current situation of 0 x0001 | 0 x0021 x0020 = 0

Attribute (Attribute table)

Property tables are special..class files, field tables, method tables, and so on can carry their own set of property tables to describe proprietary scenarios, and therefore prespecify this table.

The properties of the property sheet are:

  1. Rules are more relaxed, do not require strict order, length, content
  2. Any compiler can write custom property information to a property sheet, as long as it does not duplicate an existing property sheet, and the JVM ignores attributes it does not recognize.

The property table structure is

// refers to a constant of constant pool type CONSTANT_Utf8_info, representing u2 attribute_name_index; // Attribute table info Occupies the length u4 attribute_length; // This requires the implementation structure, which is attribute_length Info Info; }Copy the code

So an attribute list is u2 + u4 + attribute_length.

Java and definition to many property sheets, the article check out the involved to do the following description, others in the actual need to consult

The attribute name Use location meaning
Code Method table Bytecode instructions compiled into Java code
ConstantValue Field in the table Constant value defined by the final keyword
Exceptions Method table Method throws an exception
LineNumberTable Code attributes The relationship between the Java source line number and the bytecode instruction
SourceFile The class file Record the source file name

The attribute table carried by the field table and method table is not involved at present. The current node involves the SourceFile. Class carries a property table.

Range: 0x025A to 0x0262 Total U2 + U4 + ATtiBUte_LENGTH = 8 bytes SourceFile Attribute structure is as follows

Pseudo-code {// points to a constant of constant pool type CONSTANT_Utf8_info, representing u2 attribute_name_index; U4 attribute_length; // point to a constant of constant pool type CONSTANT_Utf8_info, representing the sourcefile u2 sourcefile_index; }Copy the code

So from the SourceFile property table, the SourceFile name is testclass.java

Field in the table

Refer to the.Class file format table, after the interface table, is the number of fields and the number of fields table

The value ranges from 0x014D to 0x016E, where 0x014D to 0x014E indicates the number of fields. The value 0x0003 indicates that the number of fields is 3. The field table structure is as follows

// access flag U2 access_flags // points to constant with constant pool type CONSTANT_Utf8_info, indicating field name U2 name_index // points to constant with constant pool type CONSTANT_Utf8_info, // u2 descriptor_index // attributes_count attribuite_info}Copy the code

Field tables, like.Class, can carry their own attribute tables to handle special scenarios; attribuite_info is optional. When attributes_count is 0, attribuite_info is not required. Fields also have access flags to further constrain them. Field names are represented by name_index constants that point to the constant pool, and field types are represented by descriptors, such as Int for I (forgetting the basics).

The current example defines the following fields:

    private int m = 123;
    private static int x = 10;
    private static final int y = 20;
Copy the code

Member variables m and class variables x, y are defined. Let’s take y for example

The positions are 0x015F to 0x016E, where:

  • Access flag: 0x001A
  • Field name index: 0x000B, decimal value 11, pointing to the 11th constant in the constant pool, y
  • The descriptor index is 0x0009, pointing to the ninth constant in the constant pool, which is I
  • The number of attribute tables is 0x0001, and the number is 1
  • The total length of the property table is u2 + U4 + 2 bytes, that is, 0x0167 to 0x016E

The meanings of field access flag bits are shown in the table:

Sign the name Sign a meaning
ACC_PUBLIC 0x0001 Whether or not to public
ACC_PRIVATE 0x0002 Whether it is private
ACC_PROTECTED 0x0004 Whether it is protected
ACC_STATIC 0x0008 Whether it is the static
ACC_FINAL 0x0010 Whether it is final
ACC_VOLATILE 0x0040 Whether it is volatile
ACC_TRANSIENT 0x0080 Whether for transient
ACC_SYNTHETIC 0x1000 Is generated automatically by the compiler
ACC_ENUM 0x4000 Field No Enum

The current private static final, namely 0 x0001 | 0 x0008 | 0 x0010, 0 x001a. Private static final int y = private static final int y = private static final int y = private static final int y

ConstantVulue, the property sheet, is used to assign values to variables with final and static modifications of the basic data type. In addition to the regular base data, the property table ConstantVulue also has a ConstantValue_index index of type U2 that refers to constants in the constant pool used to initialize data. The value between 0x016D and 0x016E is 0x000D, pointing to the value at index 13 of the constant table is integer 20.

In the instance, the variable M and the class variable x are assigned in the member initial function and the class initial function, as explained below.

Method table

Next to the field table is the number of methods and the collection of method tables

The value ranges from 0x016F to 0x0258, and the value is 0x0005. In addition to the example customizable increace(), m(), and Hello (), there are instance constructor ()v, class constructor () methods.

Method table structure is as follows:

Pseudo-code {// access flag U2 access_flags; // method name index, pointing to constant U2 name_index of constant pool type CONSTANT_Utf8_info; // The method returns the value descriptor index pointing to the constant U2 descriptor_index of the constant pool type CONSTANT_Utf8_info; U2 attributes_count; // Attribute table contents Info Info; }Copy the code

Methods can also carry property sheets to describe proprietary scenarios. From the above structure and the specific bytecode, you can deduce what the method contains. Where, the meanings of access flags are shown in the table:

Sign the name Flag values meaning
ACC_PUBLIC 0x0001 Whether or not to public
ACC_PRIVATE 0x0002 Whether it is private
ACC_PROTECTED 0x0004 Whether it is protected
ACC_STATIC 0x0008 Whether it is the static
ACC_FINAL 0x0010 Whether it is final
ACC_SYNCHRONIZED 0x0020 Whether it is synchronized
ACC_BRIDGE 0x0040 Is there a compiler generated bridge method
ACC_VARARGS 0x0080 Whether to accept indefinite parameters
ACC_NATIVE 0x0100 Whether it is native
ACC_ABSTRACT 0x0400 Whether it is the abstract
ACC_STRICTFP 0x0800 Whether for strictfp
ACC_SYNTHETIC 0x1000 Is generated automatically by the compiler

Take the instance method () as an example:

Information as follows:

  • Access_flags: 0x0001, public
  • Name_index: 0x000E, is 14, corresponding to the constant pool
  • Descriptor_index: 0x000F, is 15, get ()V for the constant pool
  • Attributes_count: 0x0001 is 1, and the number of attribute tables is 1

Function information can be deduced from the information. The instantiation function is expressed as public ()V, followed by access flag, function name, and return value.

A function method, and more importantly, how to express the function it provides. Essence, all the code in the function body is conducted in arithmetic operation, therefore, as long as the function body of code into a bytecode instruction, function is executed according to the execution can, again take a chance again the key information, can draw function performs, stack depth, the number of local variables, the bytecode file size.

After attributes_count, 0x0179 to 0x01A5 are the contents of the attribute sheet. According to the format of the property table convention described above, the position 0x0179 ~ 0x0180 is 0x0010, and the decimal value is 16, and the constant pool knows that the property table is of type Code.

The structure of the Code type attribute table is as follows:

// The index of the attribute_name_index refers to the constant U2 attribute_name_index of constant table type CONSTANT_Utf8_info; U4 attribute_length; // stack depth u2 max_stack; // local variables u2 max_locals; // bytecode instruction length u4 code_length; // bytecode instruction u1 code code_length; U2 exception_table_length; Exception_info EXCEPtion_table; U2 attributes_count; // Attribute_info attributes; }Copy the code

Method information can be obtained not only directly by reading bytecode files, but also by

javap -verbose className

You can also get it and stick it together

The red circle is the bytecode instruction set, the yellow circle is each bytecode instruction, the green circle is the basic information of the Code property table, and the blue circle is the instance function information parsed by the JavAP tool.

First, the maximum stack depth is 2, and the maximum operand stack depth is never exceeded at any point in the function execution. Then the number of local variables is 1, which represents the storage space required for the local variable table in Slot, which is the minimum unit that the JVM can use to allocate memory for local variables. Also in the blue circle is args_sige, the number of arguments accepted by the method, which is 1, or this. Finally, the function code block is converted into bytecode instructions.

Bytecode instructions are beyond the scope of this article, so you might as well look at them briefly.

A bytecode instruction represents a particular operation, which is represented by a single byte length, followed by zero to more than one of the required operands.

The initialization function in this example is translated as: 2A B7 00 01 2A 10 7B B5 00 02 B1 in the blue circle above:

1: invokespecial #1 4: aload_0 // Push 123 5: bitpush 123 // access field m and store 123 into 7: Putfield #2 // return 10: return}Copy the code

The bytecode of putfield #2 is B5 0002, where B5 indicates that the operation is performed 00. 02 is the parameter required for the operation, and the value is 0x0002, indicating that the constant pool type is CONSTANT_Fieldref_info. CONSTANT_Fieldref_info structure for

CONSTANT_Fieldref_info {u1 tag; CONSTANT_ClassInfo_info // represents the class u2 class_index of the field; // refers to a constant of constant pool type CONSTANT_NameAndType_info // represents the field name and type U2 name_AND_type_index; } CONSTANT_ClassInfo_info { u1 tag; // constant to the end of the constant pool CONSTANT_Utf8_info // represents the class name u2 name_index; } CONSTANT_NameAndType_info { u1 tag; // constant to the end of the constant pool CONSTANT_Utf8_info // represents the name U2 name_index; // Constant pointing to the end of the constant pool CONSTANT_Utf8_info // representing the type U2 descriptor_index; }Copy the code

The current value refers to the second constant.

B5 00 02 assigns 123 on the stack to testclass. m (private int m = 123).

The assignment of x is done in the class constructor, and the inicrea() and m() functions can be analyzed in the same way. No more statements, point to the point.

conclusion

At this point, you know what the.class file looks like. The Class file format table can parse out basic information about the file contents. The.class file can be regarded as a collection of multiple tables, according to the rules formulated by the table, follow the steps, naturally can find out the corresponding information. .class files are easy to understand how to present information, but it’s hard to have the patience to tease out trivial index relationships, especially in the constant pool and property table sections. The property sheet section provides plenty of room for more content depending on the scenario.

The article is only about parsing. Class file basic rules, further parsing rules interested or need to understand again, method unchanged.

reference

“Understanding the Java Virtual Machine in Depth” — chapter 6

In-depth understanding of JVM bytecode execution engines

What is the meaning of class files in Java?

Why is the Java language cross-platform