The most well-known slogan of the Java language is “compile once run everywhere,” where “compile” means that the compiler compiles Java source code into Java bytecode files (i.e..class files, which are not distinguished here) and “run” means that the Java virtual machine executes the bytecode files. Java can be cross-platform thanks to different JVM implementations on different platforms. As long as you provide a standard bytecode file, it can be executed by any JVM on any platform, so that the bytecode file can be run anywhere. This article will use a simple example to analyze the structure of bytecode, deepen the understanding of Java program operation mechanism.

1. Prepare the.class file

The first step is to prepare a bytecode file. Testbytecode.java:

package com.sinosun.test;

public class TestByteCode{
	private int a = 1;
	public String b = "2";
	
	protected void method1(a){}
	
	public int method2(a){
		return this.a;
	}
	
	private String method3(a){
		return this.b; }}Copy the code

Compile the above code using the javac command to get the corresponding testBytecode.class file, which completes the first step.

2. Manually parse.class files

After the previous step, we have the testBytecode.class file, which is the bytecode we need. Let’s take a look at the contents of the document first. (Note that IDEA is automatically decomcompiled when the.class file is opened, so use the HexView plugin for IDEA to view the.class file, or Sublime Text to open the.class file directly.) you can see that the bytecode file contains a large number of hex bytes. The red box in the image below is the actual contents of the.class file:

To understand a class file, you must first know its constituent structure. According to the JVM bytecode specification, a typical class file consists of ten parts: MagicNumber, Version, Constant_Pool, Access_flag, This_class, Super_class, Interface, Fields, Method, and Attributes. There are two types of data in bytecode: unsigned numbers and tables. Unsigned numbers include U1, U2, U4 and U8, representing 1 byte, 2 byte, 4 byte and 8 byte respectively. A table structure is composed of unsigned data.

According to the rules, a bytecode file has the following fixed format:

As can be clearly seen from the above table, bytecode adopts fixed file structure and data type to achieve the segmentation of content. The structure is very compact, and there is no redundant information, even no delimiter.

3. Magic number and version number

According to the structure table, the first four bytes of the.class file store the magic number of the.class file. The magic number is a fixed value: 0xcafebabe, which is also the JVM’s signature for recognizing. Class files. File types are usually distinguished by their suffixes, but the suffixes can be modified at will, so the virtual machine checks these four bytes before loading the class file and refuses to load it unless it is 0xCafebabe.

DZone explains why the magic number is 0xcafebabe.

The version number follows the magic number and consists of two 2-byte fields that represent the major and minor versions of the current. Class file, respectively. The version number corresponds to the actual JDK version as shown below. The version number of the.class file generated by compilation is dependent on the -target parameter used at compile time.

Compiler version – the target parameter The value is in hexadecimal notation Decimal representation
The JDK 1.6.0 _01 No (default -target 1.6) 00 00 00 32 50
The JDK 1.6.0 _01 – target 1.5 00 00 00 31 49
The JDK 1.6.0 _01 1.4 the source 1.4 – target 00 00 00 30 48
The JDK 1.7.0 No (default -target 1.6) 00 00 00 32 50
The JDK 1.7.0 – target 1.7 00 00 00 33 51
The JDK 1.7.0 1.4 the source 1.4 – target 00 00 00 30 48
The JDK 1.8.0 comes with No – the target parameter 00 00 00 34 52

In the. Class file obtained in Section 2, the value of magic number is 0x0000 0034, indicating that the corresponding JDK version is 1.8.0, which is the same as the JDK version used at compile time.

Constant pool

The constant pool is one of the key points of parsing a.class file, starting with the number of objects in the constant pool. According to section 2, the value of constant_pool_count is 0x001c, which is converted to decimal to 28. According to the JVM specification, the value of constant_pool_count is equal to the number of entries in the constant_pool plus 1, so there are 27 constants in the constant pool.

According to the JVM specification, constants in a constant pool have the following general format:

cp_info {
	u1 tag;
	u1 info[];
}
Copy the code

There are 11 types of data constants, with their tags and contents as shown in the following table:

Let’s take an example to see how to analyze constants. In the figure below, the red line is part of the constant pool.

The first tag is 0x0a. If you look at the table above, you can see that this constant corresponds to CONSTANT_Methodref_info, a reference to a method. The two 2-bytes following the tag point to a constant of type CONSTANT_Class_info and a constant of type CONSTANT_NameAndType_info in the constant pool. The complete data for this constant is: 0a 0006 0016, the 6th and 22nd constants in the two indexed constant pools.

0a 0006 0016 Methodref class#6 nameAndType#22

Since the sixth and 22nd constants have not been resolved, placeholders are used instead.

Similarly, other constants can be analyzed, and the complete constant pool obtained by analysis is as follows:

The serial number In hexadecimal notation meaning Constant values
1 0a 0006 0016 Methodref #6 #22 java/lang/Object.””:()V
2 09 0005 0017 Fieldref #5 #23 com/sinosun/test/TestByteCode.a:I
3 08, 0018 String #24 2
4 09 0005 0019 Fieldref #5 #25 com/sinosun/test/TestByteCode.b:Ljava/lang/String;
5 07 001a Class #26 com/sinosun/test/TestByteCode
6 07 001b Class #27 java/lang/Object
7 01, 0001, 61 UTF8 encoding a
8 01 0001 49 UTF8 encoding I
9 01, 0001, 62 UTF8 encoding b
10 01 0012 4c6a6176612f6c616e672f537472696e673b UTF8 encoding Ljava/lang/String;
11 01 0006 3c 69 6e 69 74 3e UTF8 encoding
12 01 0003 28 29 56 UTF8 encoding ()V
13 01 0004 43 6f 64 65 UTF8 encoding Code
14 01 000f 4c696e654e756d6265725461626c65 UTF8 encoding LineNumberTable
15 01 0007 6d 65 74 68 6f 64 31 UTF8 encoding method1
16 01 0007 6d 65 74 68 6f 64 32 UTF8 encoding method2
17 01 0003 28 29 49 UTF8 encoding ()I
18 01 0007 6d 65 74 68 6f 64 33 UTF8 encoding method3
19 01 0014 28294c6a6176612f6c616e672f537472696e673b UTF8 encoding ()Ljava/lang/String;
20 01 000a 53 6f 75 72 63 65 46 69 6c 65 UTF8 encoding SourceFile
21 01 0011 5465737442797465436f64652e6a617661 UTF8 encoding TestByteCode.java
22 0c 000b 000c NameAndType #11 #12 “”:()V
23 0c 0007 0008 NameAndType #7 #8 a:I
24 01 0001 32 UTF8 encoding 2
25 0c 0009 000a NameAndType #9 #10 b:Ljava/lang/String;
26 01 001d 636f6d2f73696e6f73756e2f746573 742f5465737442797465436f6465 UTF8 encoding com/sinosun/test/TestByteCode
27 01 0010 6a6176612f6c616e672f4f626a656374 UTF8 encoding java/lang/Object

The table above shows all the constants parsed out of the constant pool. Usage of these constants will be explained later.

5. Access flags

Access_flag identifies the access permissions and attributes of the current. Class file. As you can see from the table below, this flag contains information about whether the class file is a class or interface, external access, whether it is abstract, and if it is a class, whether it is declared final, etc.

Flag Name Value Remarks
ACC_PUBLIC 0x0001 public
ACC_PRIVATE 0x0002 private
ACC_PROTECTED 0x0004 protected
ACC_STATIC 0x0008 static
ACC_FINAL 0x0010 final
ACC_SUPER 0x0020 Used to be compatible with earlier compilers, new compilers all set this flag to be in useinvokespecialDirective to do specific processing for subclass methods.
ACC_INTERFACE 0x0200 Interface, also need to set: ACC_ABSTRACT. ACC_FINAL, ACC_SUPER, and ACC_ENUM cannot be set at the same time
ACC_ABSTRACT 0x0400 Abstract class that cannot be instantiated. Cannot be set at the same time as ACC_FINAL.
ACC_SYNTHETIC 0x1000 Synthetic, produced by the compiler, does not exist in the source code.
ACC_ANNOTATION 0x2000 Annotation type: ACC_INTERFACE, ACC_ABSTRACT
ACC_ENUM 0x4000 Enumerated type

The value of access_flag in the bytecode file in this document is 0021. The value cannot be directly queried in the above table because the value of access_flag is the union of a series of flag bits. 0x0021 = 0x0020+0x0001, therefore, the class is of public type.

The access flag will also be used several times in some of the properties below.

6. Class index, parent index, interface index

Class index this_class saved is the fully qualified name of the current class index in the constant pool, the values of 0 x0005, point to the constant pool in the fifth constants, look-up table shows the content is: com/sinosun/test/TestByteCode.

Super_class holds the index of the globally qualified name of the parent of the current class in the constant pool, with the value 0x0006, pointing to the sixth constant in the pool, with the value Java /lang/Object.

Interface information interfaces hold a list of interfaces implemented by the current class, including the number of interfaces and an array containing globally qualified name indexes for all interfaces. There are no interfaces implemented in the sample code for this article, so the number is 0.

7, field,

Next, the Fields section is parsed, and the first two bytes are fields_count, with a value of 0x0002, indicating a field number of 2. The structure of each field is field_info:

field_info {
    u2 access_flags;
    u2 name_index;
    u2 descriptor_index;
    u2 attributes_count;
    attribute_info attributes[attributes_count];
}
Copy the code

The content of the first field is 0002 0007 0008 0000, the access flag bit 0x0002 means that the field is private, the name index points to the 7th value in the constant pool is A, the type descriptor points to the 8th value in the constant pool is I, and the number of associated attributes is 0. The field is private I A, where I indicates an int.

The second field is public Ljava/lang/String. B. The Ljava/lang/String; Said the String.

The following table is a simple illustration of how field descriptors correspond to source code:

The descriptor The source code
Ljava/lang/String; String
I int
[Ljava/lang/Object; Object[]
[Z boolean[]
[[Lcom/sinosun/generics/FileInfo; com.sinosun.generics.FileInfo[][]

8, methods,

The number of methods is 0x0004, a total of four methods.

Not ah! Testbytecode.java has only three methods. Why is the number of methods in the.class file changed to four?

This is because a

method is automatically generated at compile time as the default constructor for the class.

Next, analyze each method, as usual, first understand the method format definition before analysis:

method_info {
    u2 access_flags;
    u2 name_index;
    u2 descriptor_index;
    u2 attributes_count;
    attribute_info attributes[attributes_count];
}
Copy the code

Public

()V, with an attribute. The first 8 bytes of the first method 0001 000b 000C 0001 are obtained. You can see that the method name is

. For attributes attached to a method, they have the following format:

attribute_info {
     u2 attribute_name_index;
     u4 attribute_length;
     u1 info[attribute_length];
}
Copy the code

Moving on to 000d, the name of this property is: Code by querying the constant pool. The Code property is a variable-length property in the method_info property table that contains auxiliary information about JVM instructions and methods, such as instance initializers or class or interface initializers. If a method is declared native or abstract, the property table in its method_info structure must not contain the Code attribute. Otherwise, the property sheet must contain a Code property.

The format of the Code attribute is defined as follows:

Code_attribute {
     u2 attribute_name_index;
     u4 attribute_length;
     u2 max_stack;
     u2 max_locals;
     u4 code_length;
     u1 code[code_length];
     u2 exception_table_length;
     { 
          u2 start_pc;
          u2 end_pc;
          u2 handler_pc;
          u2 catch_type;
     } exception_table[exception_table_length];
     u2 attributes_count;
     attribute_info attributes[attributes_count];
}
Copy the code

Compare the above structure to analyze the byte sequence 000D 00000030 0002 0001, this attribute is the Code attribute, the number of bytes contained in the attribute is 0x00000030, that is, 48 bytes, the length does not include the name index and length of the two fields. Max_stack indicates the maximum depth of the operand stack that the method can reach while running, which is 2; Max_locals represents the number of local variables created during the execution of the method, including the local variables used to pass parameters to the method when it executes.

Next comes the bytecode instructions, the real logical core of a method, and these JVM instructions are the real implementations of the method. Code_length indicates the length of the code. Here the value is 16, indicating that the following 16 bytes are the instruction content, 2a B7 0001 2A 04 B5 0002 2A 12 03 B5 0004 B1.

To make it easier to understand, translate these instructions into their corresponding mnemonics:

The bytecode mnemonics Instruction meaning
0x2a aload_0 Push the first reference type local variable to the top of the stack
0xb7 invokespecial Call superclass builder methods, instance initializer methods, private methods
0x04 iconst_1 Push int 1 to the top of the stack
0xb5 putfield Assigns a value to the instance field of the specified class
0x12 ldc Pushes int,float, or String constant values from the constant pool to the top of the stack
0xb1 return Returns void from the current method

It can be seen from the table that the meanings of these instructions are:

2a aload_0

b7 0001 invokespecial #1 //Method java/lang/Object.””:()V

2a aload_0

04 iconst_1

b5 0002 putfield #2 //Field a:I

2a aload_0

12 03 ldc #3 //String 2

b5 0004 putfield #4 //Field b:Ljava/lang/String;

b1 return

As you can see, the initializer pushes this_class itself, variables A and B in the class, and assigns values to the two variables, then the method ends.

After the instruction analysis is completed, there is the exception table in the method. No exception is thrown in this method, so the length of the table is 0000. The 0001 following indicates that there is an attribute following it. According to the previous attribute format, the name index of the attribute is 0x000E. By searching the constant pool, we know that the attribute is LineNumberTable.

Here is the structure of the LineNumberTable property:

LineNumberTable_attribute {
    u2 attribute_name_index;
    u4 attribute_length;
    u2 line_number_table_length;
    { 
    	u2 start_pc;
    	u2 line_number;
    } line_number_table[line_number_table_length];
}
Copy the code

Combined with this structure, 0000000E 0003 0000 0003 0004 0004 0009 0005 can be seen that the table has three items, the first number represents the byte position in the instruction code, the second number represents the number of lines in the source code.

Similarly, the latter method can be analyzed.

The second method, 0004 000F 000C 0001, means that the method name and access control character are protected method1 ()V, with an attribute attached. 000D 00000019, no doubt, the attribute is Code, 25 bytes long.

If max_stack is 0 and max_locals is 1, there is a local variable. All methods have a default parameter that points to their class. There is only one byte instruction in the method body, return, because the method is an empty method. 0000 0001 indicates no exception and has an attribute attached. 000E 00000006 0001 0000 0007 The attribute is LineNumberTable, and the content indicates that the first byte instruction corresponds to line 7 of the code.

In the latter two methods, three new byte instructions are used:

The bytecode mnemonics Instruction meaning
0xb4 getfield Gets the instance domain of the specified class and pushes it to the top of the stack
0xac ireturn Returns an int from the current method
0xb0 areturn Returns an object reference from the current method

0001 0010 0011 0001 000D 0000 001D The third method is public method2 ()I. The Code attribute is 0001 0001 00000005 2A B4 0002 AC. The exception information and LineNumberTable are still followed.

The fourth method is not repeated here.

0002 0012 0013 0001 000d 0000 001d private method3 ()Ljava/lang/String;

Code

0001 0001 00000005

2a b4 0004 b0 Retrieves variable B and returns

0000

LineNumberTable

0001 000e 00000006 0001 0000 000e //line 14 : 0

Thus, we parse out the methods in the class in the bytecode. Byte instructions are at the heart of method implementation, and they correspond to the same operations in any JVM, so bytecode files can run across platforms. But the details of how byte instructions are implemented vary from platform to platform, and this is the real step Java programs take across platforms.

9, attributes,

The last part is the Attributes of the class, with an amount of 0x0001, which is analyzed based on attribute_info.

attribute_info {
     u2 attribute_name_index;
     u4 attribute_length;
     u1 info[attribute_length];
}
Copy the code

The first two bytes correspond to name_index, which is 0x0014, that is, the 20th constant in the constant pool. If you look up the table and get SourceFile, this attribute is the SourceFile attribute. This property is an optional long property in the class file property table and is structured as follows:

SourceFile_attribute {
     u2 attribute_name_index;
     u4 attribute_length;
     u2 sourcefile_index;
}
Copy the code

SourceFile — testBytecode.java = SourceFile — testBytecode.java = SourceFile — testBytecode.java = SourceFile — testBytecode.java = SourceFile — testBytecode.java = SourceFile — testBytecode.java = SourceFile — testBytecode.java = SourceFile — testBytecode.java

10 and postscript

This concludes the article, and you should have a basic understanding of the structure of bytecode.

However, Java already provides a command line tool to do all of this. Go to the.class file folder, open the command line tool, and type the following command:

javap -verbose XXX.class
Copy the code

The result is as follows:

Javap -verbose testbytecode. class Classfile /E:/blog/Java bytecode.class Last modified The 2018-9-6; size 494 bytes MD5 checksum 180292e6f6e8e9e48807195b235fa8ef Compiled from"TestByteCode.java"
public class com.sinosun.test.TestByteCode
  minor version: 0
  major version: 52
  flags: ACC_PUBLIC, ACC_SUPER
Constant pool:
   #1 = Methodref #6.#22 // java/lang/Object."
      
       ":()V
      
   #2 = Fieldref #5.#23 // com/sinosun/test/TestByteCode.a:I
   #3 = String #24 // 2
   #4 = Fieldref #5.#25 // com/sinosun/test/TestByteCode.b:Ljava/lang/String;
   #5 = Class #26 // com/sinosun/test/TestByteCode
   #6 = Class #27 // java/lang/Object
   #7 = Utf8 a
   #8 = Utf8 I
   #9 = Utf8 b
  #10 = Utf8 Ljava/lang/String;
  #11 = Utf8 
      
  #12 = Utf8 ()V
  #13 = Utf8 Code
  #14 = Utf8 LineNumberTable
  #15 = Utf8 method1
  #16 = Utf8 method2
  #17 = Utf8 ()I
  #18 = Utf8 method3
  #19 = Utf8 ()Ljava/lang/String;
  #20 = Utf8 SourceFile
  #21 = Utf8 TestByteCode.java
  #22 = NameAndType #11:#12 // "
      
       ":()V
      
  #23 = NameAndType #7:#8 // a:I
  #24 = Utf8 2
  #25 = NameAndType #9:#10 // b:Ljava/lang/String;
  #26 = Utf8 com/sinosun/test/TestByteCode
  #27 = Utf8 java/lang/Object
{
  public java.lang.String b;
    descriptor: Ljava/lang/String;
    flags: ACC_PUBLIC

  public com.sinosun.test.TestByteCode();
    descriptor: ()V
    flags: ACC_PUBLIC
    Code:
      stack=2, locals=1, args_size=1
         0: aload_0
         1: invokespecial #1 // Method java/lang/Object."
      
       ":()V
      
         4: aload_0
         5: iconst_1
         6: putfield      #2 // Field a:I
         9: aload_0
        10: ldc           #3 // String 2
        12: putfield      #4 // Field b:Ljava/lang/String;
        15: return
      LineNumberTable:
        line 3: 0
        line 4: 4
        line 5: 9

  protected void method1();
    descriptor: ()V
    flags: ACC_PROTECTED
    Code:
      stack=0, locals=1, args_size=1
         0: return
      LineNumberTable:
        line 7: 0

  public int method2();
    descriptor: ()I
    flags: ACC_PUBLIC
    Code:
      stack=1, locals=1, args_size=1
         0: aload_0
         1: getfield      #2 // Field a:I
         4: ireturn
      LineNumberTable:
        line 10: 0
}
SourceFile: "TestByteCode.java"
Copy the code

That’s basically what we got from the previous analysis.

Of course, my intention in sharing these processes is not to turn myself or the reader into a decompile tool and see through the bytecode at a glance. People can’t do these things any better than tools, but understanding them can help us build better tools, such as CGlib, that implement dynamic proxies faster than JDK dynamic proxies using Java reflection by adding certain operations before class loading or simply generating bytecode directly.

I always think you should make good use of tools, but you should also be curious and curious about the details behind them. As far as this article is concerned, if it makes you a little more aware of bytecodes, it will have served its purpose. Bracket to smile

Refer to the article

  1. This article will help you understand Java bytecode
  2. An in-depth understanding of Java bytecode (.class) files for the JVM
  3. [A bytecode view of “HelloWorld”]
  4. JVM bytecode — Class file format
  5. JavaCodeToByteCode
  6. Table of JVM bytecode instructions