As an Android developer, we can whip up a Java file at the speed of years of single hands. But I believe that many people understand Java like a goddess, only see the bright appearance. But often sometimes we should see her take off the makeup appearance, take off…. , ahem. In a word, we should have a deep understanding, which can help us do a lot of interesting things.

How much is the ASM framework? It makes it easy to modify class bytecode files and insert our own code into the bytecode. Implementation of such as our Android no trace buried point, string encryption, method time statistics and other operations.

So this article focuses on taking the Java bytecode makeup off and looking at the most realistic class files in the virtual machine’s eyes. The content of this article is long, but I believe that you read slowly and patiently, in fact, it is not difficult, reading will be more smooth, of course, harvest is also full.

Remove the Java file makeup

Java bytecode files, which end in.class, are generated by the Java compiler compiling the.Java files we normally write. We can do this by command:

// Compile the Java file into the class file javac xxx.javaCopy the code

When compiled on the command line, we end up with an 8-bit binary file, a stream of binary files in which the parts are arranged in close order, with no gaps between adjacent items. This has the advantage of making class files more compact and smaller, easier to load in the JVM and transfer over the network.

Let’s write a Java source file called math.java, so let’s get a first look at the class file.

//Math.java
package com.getui.test;

public class Math {
    private int a = 1;
    private int b = 2;
    public int add() {returna+b; }}Copy the code

Execute command:

javac Math.java
Copy the code

After compiling, we get a math. class file, which we open using the 010Editor.

We can see the class bytecode in the image above, which is what Java looks like with its makeup off. Isn’t it beautiful?

010Editor will also parse the class bytecode files into different data items in a certain format.

Take a holistic look at the class file structure

A class file contains the following data items:

describe type explain
magic u4 Magic number, fixed: 0x CAFE BABE
minor_version u2 Java minor version number
major_version u2 Java major version number
constant_pool_count u2 Constant pool size
constant_pool[constant_pool_count-1] Cp_info (constant scale) String pooling
access_flags u2 Access tokens
this_class u2 Class index
super_class u2 Index of the parent class
interfaces_count u2 Interface counter
interfaces u2 Interface index set
fields_count u2 Number of fields
fields Field_info (field table) Field collection
methods_count u2 Method counter
methods Method_info (method table) Methods collection
attributes_count u2 Attribute counter
attributes Attribute_info Attribute set

The table above is a bytecode structure table, where U1, U2, U4, and U8 are unsigned numbers representing 1 byte, 2 byte, 4 byte, and 8 byte respectively. Cp_info, field_info, method_info, and attribute_info represent the constant table, field table, method table, and attribute table respectively. Each table has its own unique structure, which will be explained later.

With the overall structure in place, let’s go through the class bytecode structure from top to bottom.

Take a partial look at the class file structure

Magic number

Magic numbers are of type U4, so they occupy 4 bytes of the class file. A magic number is a flag used to identify the type of file, and the magic number for class bytecode files is always 0xCAFE BABE. Why 0xCAFE BABE? Take a look at the picture below.

Version number (minor_version+major_version)

Minor_version is of type U2 and takes up 2 bytes of the class file, so 0x00 00 represents compilation.

Major_version is also of type U2 and occupies 2 bytes of the class file, so 0x0034 in hexadecimal is converted to 52 in decimal, whereas JDK1.2 corresponds to 46 in decimal, so 52 represents JDK 1.8.

Based on the above analysis, THE current JDK version is 1.8.0.

We can verify this from the command line:

java -version
java version "1.8.0 comes with _112"
Java(TM) SE Runtime Environment (build 1.8.0_112-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.112-b16, mixed mode)
Copy the code

Constant pool size (constant_pool_count)

The constant pool size is also of type U2, occupying 2 bytes of the class file. 0x00 16, a hexadecimal 0x16 converted to base 10 for 22, represents our constant pool size of 22-1=21. The constant pool is like our repository of class bytecode, storing information about the class, such as class names, field names, method names, constant values, strings, and so on. This will be explained in the next section of the constant pool.

We can view the contents of the constant pool more easily from the command line:

javap -verbose ./Math.class
Copy the code

There is a lot of output about math.class bytecode after entering the command, but we’ll focus on Constant pool for now:

Constant pool:
   #1 = Methodref #5.#17 // java/lang/Object."
      
       ":()V
      
   #2 = Fieldref #4.#18 // com/getui/test/Math.a:I
   #3 = Fieldref #4.#19 // com/getui/test/Math.b:I
   #4 = Class #20 // com/getui/test/Math
   #5 = Class #21 // java/lang/Object
   #6 = Utf8 a
   #7 = Utf8 I
   #8 = Utf8 b
   #9 = Utf8 
      
  #10 = Utf8 ()V
  #11 = Utf8 Code
  #12 = Utf8 LineNumberTable
  #13 = Utf8 add
  #14 = Utf8 ()I
  #15 = Utf8 SourceFile
  #16 = Utf8 Math.java
  #17 = NameAndType #9:#10 // "
      
       ":()V
      
  #18 = NameAndType #6:#7 // a:I
  #19 = NameAndType #8:#7 // b:I
  #20 = Utf8 com/getui/test/Math
  #21 = Utf8 java/lang/Object
Copy the code

Constant Pool = Constant Pool = Constant Pool = Constant pool = Constant pool = Constant pool = Constant pool I’ll leave it out here, but I’ll cover it later when I introduce cp_info. Let’s just get a general idea.

There is also a problem that has not been solved yet. Why is the constant pool size reduced by 1? For example, the hexadecimal conversion of 0x16 to decimal is 22, why is the constant pool size 22-1=21? When we write the code, array subscripts start at 0, and as we see on the command line above, Constant pool starts at 1, which leaves the 0th Constant empty. The zeroth constant has a special function, that is, when other data items reference the zeroth constant, it means that this data item does not need any constant reference.

Cp_info (constant scale)

Cp_info mainly stores literal and symbolic references.

It mainly includes the following 14 types:

type mark describe
CONSTANT_utf8_info 1 The character string is utF-8 encoded
CONSTANT_Integer_info 3 Integer literal
CONSTANT_Float_info 4 Floating point literals
CONSTANT_Long_info 5 Long integer literals
CONSTANT_Double_info 6 A double – precision floating-point literal
CONSTANT_Class_info 7 Symbolic reference to a class or interface
CONSTANT_String_info 8 String type literals
CONSTANT_Fieldref_info 9 Symbolic reference to a field
CONSTANT_Methodref_info 10 Symbolic references to methods in a class
CONSTANT_InterfaceMethodref_info 11 Symbolic references to methods in the interface
CONSTANT_NameAndType_info 12 Symbolic reference to a field or method
CONSTANT_MethodHandle_info 15 Represents a method handle
CONSTANT_MothodType_info 16 Flag method type
CONSTANT_InvokeDynamic_info 18 Represents a dynamic method call point

The structure of each type is different. You can check the following table:

Let’s start parsing the constant part of the bytecode:

CONSTANT_Methodref_info{
	u1 tag;
	u2 class_index;
	u2 name_and_type_index;
}
Copy the code

When we start parsing the first constant, we see that the value of the constant’s tag is 0x0A, converted to 10 in decimal, and we can see that the constant is of type CONSTANT_Methodref_info. The second table, CONSTANT_Methodref_info, has two partial index values. The first is the value of Constant_Class_Info, which takes up two bytes, so its value is 0x00 05, converted to decimal 5. Now let’s look at the fifth constant in hexadecimal.

CONSTANT_Class_info{
	u1 tag;
	u2 name_index;
}
Copy the code

The fifth constant has a tag of 0x07, converted to decimal 7, and is of type CONSTANT_Class_info. We compare the remaining two indexes of the second table, CONSTANT_Class_info, which refer directly to the fully qualified constant entry 0x00 14, to decimal value 21. Let’s move on to see what melon constant number 21 sells.

CONSTANT_utf8_info{
	u1 tag;
	u2 length;
	length bytes[];
}
Copy the code

The 21st constant has a tag of 0x01, converted to decimal 1, and is of type CONSTANT_utf8_info. The 21st constant is of type CONSTANT_utf8_info. We can see that the second part of CONSTANT_utf8_info is 2 bytes, 0x00 10, which, converted to decimal 16, means that a UTF-8 encoded string of length 16 follows. The third part is 16 bytes long, i.e. 0x6A 61 76 61 2F 6C 61 6E 67 2F 4F 62 6A 65 63 74, indicating that the string is Java /lang/Object. The Java /lang/Object is a fully qualified name. A fully qualified name is a class except for its basic type, which is included in its package name. With /.

Good, our revolution is half done, remember when we analyzed the first constant, we only got to the second part? The third part is two bytes, pointing to the index of the name and type descriptor CONSTANT_NameAndType, whose value is 0x00 11 (for those who have forgotten, look up the first constant resolution image), converted to decimal 17, so we look at the 17th constant.

We look at the previous byte with a tag of 0C, converted to decimal to 12, representing the type CONSTANT_NameAndType_info. Without further further, we look at the second table. The second part, which is 2 bytes, points to the index of the constant item of the method or field name, whose value is 0x00 09, converted to decimal to 9. Let’s go straight to the ninth constant.

It is still a CONSTANT_utf8_info type. I will not go into details about its structure.

Parsing shows that it is a UTF-8 encoded string of length 6 with the value

.

Next, the third part of the CONSTANT_NameAndType, which takes up two bytes, points to the constant index of its field or method descriptor, with a value of 0x00 0A, converted to 10 in decimal; Look at constant number 10.

Is also a CONSTANT_utf8_info type whose structure means a UTF-encoded string of length 3 with a value of ()V.

Well, that seems like a weird value. ()V is something.

Here’s what descriptors mean. Descriptors are used to describe the data type of a field, the parameter list of a method (including the number, type, and order), and the return value. According to the descriptor rules, basic data types (byte, char, double, float, int, long, short, Boolean) and void types representing no returned values are represented by an uppercase character, while object types are represented by the character L plus the fully qualified name of the object, as shown in the following table:

identifier meaning
B Basic data type Byte
C Char, the base data type
D The base data type double
F The basic data type float
I Base data type int
J Base data type long
S The basic data type short
Z Boolean is the basic data type
V The base data type void
L Object types, such as Ljava/lang/Object

For arrays, we denote each dimension by prefacing the type with a [, such as an int[] array, we denote it by [I, such as a two-digit array java.lang.object [][], we denote it by [[Ljava/lang/Object;.

() in ()V represents the argument list of the method, where V represents the return value Void. We know that when we define a class in Java, we will automatically generate a constructor with no arguments for us. Since it is a no-argument constructor and returns Void, it is represented by ()V.

Public void add(int a, int b) = (II)V For example, public String getContent(int Type) is (I)Ljava/lang/Object.

All right, so in this long introduction, we’re actually directly introducing a constant in the constant pool.

Constant pool: // We just covered the followingThe constant #1
   #1 = Methodref #5.#17 // java/lang/Object."
      
       ":()V
      
   #2 = Fieldref #4.#18 // com/getui/test/Math.a:I
   #3 = Fieldref #4.#19 // com/getui/test/Math.b:I
   #4 = Class #20 // com/getui/test/Math
   #5 = Class #21 // java/lang/Object
   #6 = Utf8 a
   #7 = Utf8 I
   #8 = Utf8 b
   #9 = Utf8 
      
  #10 = Utf8 ()V
  #11 = Utf8 Code
  #12 = Utf8 LineNumberTable
  #13 = Utf8 add
  #14 = Utf8 ()I
  #15 = Utf8 SourceFile
  #16 = Utf8 Math.java
  #17 = NameAndType #9:#10 // "
      
       ":()V
      
  #18 = NameAndType #6:#7 // a:I
  #19 = NameAndType #8:#7 // b:I
  #20 = Utf8 com/getui/test/Math
  #21 = Utf8 java/lang/Object
Copy the code

In order to clarify a constant, unknowingly said so much. The rest of the constant I believe partners should have the ability to draw inferences from one another. In fact, more time we do not need to such a bytecode one by one, the reason for taking you to analyze, just to let you feel the charm of bytecode and constant structure. More often than not, we go through the commands we talked about earlier, one line at a time.

javap -verbose ./Math.class
Copy the code

Access to mark

The access identifier takes up two bytes and represents the access information for a class or interface. The label information is as follows:

Sign the name Hexadecimal flag value binaryTag values meaning
ACC_PUBLIC 0x0001 1 Whether the type is Public
ACC_FINAL 0x0010 10000 Only the class can set whether or not to be declared final
ACC_SUPER 0x0020 100000 Whether the new semantics of the Invokespecial bytecode instruction are allowed. This flag defaults to true for classes compiled after JDK1.0.2
ACC_INTERFACE 0x0200 1000000000 Flag this is an interface
ACC_ABSTRACT 0x0400 10000000000 Whether it is of the abstract type. This flag value is true for interfaces or abstract classes and false for other types
ACC_SYNTHETIC 0x1000 1000000000000 Indicates that this class is not generated by user code
ACC_ANNOTATION 0x2000 10000000000000 This is a note
ACC_ENUM 0x4000 100000000000000 Flag This is an enumeration

We know that our class is a public modified class with a hexadecimal value of 0x00 21, so we can refer to the hexadecimal table as ACC_SUPER+ACC_PUBLIC.

But how do we do that if we just want to know if this class has an identifier, for example if we just want to know if this class is qualified by ACC_PUBLIC, we can see from the binary column that each identifier has a value of 1 at some bit, We can determine whether this identifier is modified by an identifier by the binary of this identifier and the identifier fetch and fetch operation to determine. Such as:

The binary of 0x21 is 100001, the binary of ACC_PUBLIC is 1, 100001&1 is 1. So we can tell that this class contains the ACC_PUBLIC access identifier.

Class index

The class index takes 2 bytes and points to the class’s CONSTANT_Class constant, which has the value 0x00 04, converted to decimal 4, and the fourth constant.

#4 = Class #20 // com/getui/test/Math
Copy the code

We can see that the class index points to the fully qualified name of the class.

Index of the parent class

The superclass index takes two bytes to refer to the class’s parent, with a value of 0x00 05, converted to decimal 5, and the fifth constant.

#5 = Class #21 // java/lang/Object
Copy the code

Our Math class doesn’t inherit from any classes, so its default parent is the Object class.

Interface counter

An interface counter means that the class implements several interfaces, i.e., implements several interfaces. Since our Math does not implement an interface, its value is 0x00 00, which is also 0 when converted to decimal.

Interface index set

An interface index collection is a collection of indexes for all implemented interfaces, each of which takes 2 bytes to point to the interface in a constant.

Because math.java does not implement any interface, there is no value for this part. You can customize a class and implement several interfaces for authentication. It’s also very simple.

Number of fields

The number of fields is 2 bytes, indicating the number of subsequent fields. Fields are used to describe variables declared in a class or interface. The fields here contain class-level variables and instance variables, but not local variables declared inside the method.

We see that the number of fields has a value of 0x00 02, converted to decimal to 2, that is, there are two fields after that.

Field_info (field table)

field_info{ u2 access_flags; // Access tag u2 name_index; // Field name index U2 descriptor_index; // Descriptor index u2 attributes_count; // Attribute_info attributes; // Attribute set}Copy the code

Private int a = 1; And private int b = 2; , we only analyze the fields of int A here.

Sign the name Hexadecimal flag value Binary marker value meaning
ACC_PUBLIC 0x0001 1 Whether the field is public
ACC_PRIVATE 0x0002 10 Whether the field is private
ACC_PROTECTED 0x0004 100 Whether the field is protected
ACC_STATIC 0x0008 1000 Whether the field is static
ACC_FINAL 0x0010 10000 Whether the field is final
ACC_VOLATILE 0x0040 1000000 Whether the field is volatile
ACC_TRANSTENT 0x0080 10000000 Whether the field is TRANSIENT
ACC_SYNCHETIC 0x1000 1000000000000 Whether the field is generated automatically by the compiler
ACC_ENUM 0x4000 100000000000000 Whether the field is enum

The first part of access_flags is 2 bytes, and its value is 0x00 02, which is converted to 2 in decimal and 10 in binary. We can see from the table above that the access flag of this field is ACC_PRIVATE, i.e. private.

The second part, name_index, takes two bytes and has a value of 0x 00 06, converted to decimal 6. We find the sixth constant directly in the constant pool

 #6 = Utf8 a
Copy the code

We can see that the index of name_index refers to the variable name A.

The third part Descriptor_index takes 2 bytes and its value is 0x 00 07, converted to decimal 7. We find the seventh constant directly in the constant pool

   #7 = Utf8 I
Copy the code

We can see that the index to Descriptor_index refers to the type of the a variable, and I represents the int type.

The fourth part attributes_count takes two bytes and is 0x00 00. If converted to decimal 0, private A =1. There is no property set.

If the value of the fourth part is not 0, there will be an Attributes collection that you can explore on your own.

Method counter

The method counter is two bytes and indicates how many methods follow. Here our method counter has a value of 0x00 02, converted to decimal to 2.

Some of you might ask, well, you didn’t just define one add method, so why is there two? Remember? Java adds a constructor with no arguments by default when we customize a class, even if we don’t implement any constructors. So the Math class has the no-argument constructor and add methods, so the method counter has a value of 2.

Method_info (method table)

method_info{ u2 access_flags; // method access flag u2 name_index; // Method name index U2 descriptor_index; // Method descriptor index u2 attributes_count; Struct attribute_info{u2 attribute_name_index; // Index of attribute name u4 attribute_length; Attribute_length info[]}}Copy the code
Sign the name Hexadecimal flag value Binary flag value meaning
ACC_PUBLIC 0x0001 1 Whether the method is public
ACC_PRIVATE 0x0002 10 Whether the method is private
ACC_PROTECTED 0x0004 100 Whether the method is protected
ACC_STATIC 0x0008 1000 Whether the method is static
ACC_FINAL 0x0010 10000 Whether the method is final
ACC_SYHCHRONRIZED 0x0020 100000 Whether the method is synchronized
ACC_BRIDGE 0x0040 1000000 Method is a compiler generated method
ACC_VARARGS 0x0080 10000000 Whether the method accepts arguments
ACC_NATIVE 0x0100 100000000 Whether the method is native
ACC_ABSTRACT 0x0400 10000000000 Whether the method is abstract
ACC_STRICTFP 0x0800 100000000000 Whether the method is strictFP
ACC_SYNTHETIC 0x1000 1000000000000 Whether a method is generated automatically by a compiler

The access_flags constructor is defined by the ACC_PUBLIC modifier. The access_flags constructor is defined by the ACC_PUBLIC modifier. The access_flags constructor is defined by the ACC_PUBLIC modifier.

Part 3 Descriptor_index takes 2 bytes and has the value 0x00 0A, converted to decimal to 10. Let’s move on to the 10th constant in the constant pool

 #10 = Utf8 ()V
Copy the code

Descriptor_index represents the description of this method, previously resolved for ()V, which means that the method has no argument list and returns Void.

The fourth part attributes_count is two bytes, representing the attribute counter and recording how many attributes the method has. The value 0x00 01, converted to decimal to 1, means that the method has an attribute, and we move on.

This method has only one attribute, so there is only one attribute of type attribute_info. The first part of the structure of the attribute, attribute_name_index, takes two bytes and has a value of 0x00 0B, which is converted to 11 in decimal. Go ahead and find the constant pool

  #11 = Utf8 Code
Copy the code

This method is called code, which means that the property conforms to the code property table.

struct attribute_info{ u2 attribute_name_index; // Index of attribute name u4 attribute_length; // The length of the attribute u2 max_stack; // The maximum operand stack depth u2 max_locals; U4 code_length; // Length of bytecode instruction u1 code; //code_length is a code that stores the bytecode instruction U2 exception_table_length; Exception_info EXCEPtion_TABLE; // Exception_length Exception_info consists of exception table U2 attributes_count; // Attribute_info attributes; //attributes_count Specifies an attribute_info list.Copy the code

We extract the hexadecimal bytecode of the code attribute for easy viewing:

00 02 00 01 00 00 00 0A 2A B4 00 02 2A B4 00 03 60 AC 00 00 00 01 00 0C 00 00 00 06 00 01 00 00 00 07
Copy the code

Let’s start with the table above:

The value of attribute_name_index is 0x000b, which we already analyzed.

Attribute_length is four characters, and its value is 0x00 00 00 22, which in decimal form is 34, meaning that the next 34 bytes are code attributes.

Max_stack consists of two bytes and has a value of 0x00 02, indicating that the maximum depth of the operand stack is 2. For more information about the operand stack, the reader can manually Google it.

Max_locals is 2 bytes and has a value of 0x00 01, representing the contiguous space required by the local variable table as 1.

Code_length is 4 bytes with a value of 0x00 00 00 0A, converted to decimal 10, which means that the next 10 bytes belong to the part of the bytecode instruction set.

Code is 10 bytes, 0x2A B4 00 02 2A B4 00 03 60 AC, which is not converted to decimal. We can refer to this blog to compare the table, and convert the corresponding hexadecimal into an instruction set. The conversion to the instruction set is as follows

2A-> ALOad_0 B4-> getField 00-> NOp 02-> Field A :I 2A-> ALOad_0 B4-> getField 00-> NOp 03-> Field B :I 60->iadd  AC->ireturnCopy the code

The corresponding meaning command set meaning can be compared with the above blog for reference, I will also sort out a related blog.

Ok, ok, ok, ok, ok, ok, ok, ok, ok, ok, ok, ok, ok, ok, ok, ok, ok, ok, ok.

javap -verbose ./Math.class
  public int add();
    descriptor: ()I
    flags: ACC_PUBLIC
    Code:
      stack=2, locals=1, args_size=1
         0: aload_0
         1: getfield      #2 // Field a:I
         4: aload_0
         5: getfield      #3 // Field b:I
         8: iadd
         9: ireturn
      LineNumberTable:
        line 7: 0
Copy the code

Let’s move on to the analysis (bullshit)

Exception_table_length is 2 bytes with the value 0x00 00, which is converted to 0 in decimal notation. This is where information is stored to handle exceptions. Each Exception_TABLE entry consists of start_PC, END_PC, handler_PC, and catch_type. Start_pc and end_pc indicate that exceptions thrown by instructions in the code array from start_PC to END_PC (including start_PC but not end_PC) are handled by this entry. Handler_pc represents the beginning of the code that handles the exception. Catch_type represents the exception type to be handled and points to an exception class in the constant pool. When catch_type is 0, all exceptions are handled. This can be used to implement finally functionality.

Since we have a value of 0 here, I won’t go into that, but you can do it yourself.

Attributes_count takes two bytes and has a value of 0x00 01, which is converted to 1 in decimal; Indicates that there is an additional property.

Attribute_name_index is a two-byte value of 0x00 0C, which is converted to 12 in decimal; This means the location of the attached property in the constant pool, pointing to item 12 in the constant pool, which is of type LineNumberTable. Its structure is:

LineNumberTable_attribute { u2 attribute_name_index; u4 attribute_length; u2 line_number_table_length; struct line_number_table{ u2 start_pc; u2 line_number; }}Copy the code

Attribute_length takes four bytes and its value is 0x00 00 00 06, indicating that the following six bytes are attributes.

Line_number_table_length consists of two bytes and has the value 0x00 01, which is converted to 1 in decimal notation, indicating that LineNumberTable has one value.

Start_pc takes two bytes and has a value of 0x00 00, converted to decimal 0, representing the bytecode line number.

Line_number is two bytes with a value of 0x0007, converted to decimal 7, representing line 7 of the Java source code.

Additional attributes

Number of attributes (attribute_count)

Attribute_length is two bytes and has a value of 0x00 01, which is converted to a decimal value of 1, indicating an additional attribute value.

Attribute structure (attribute_info_Attributes)

SourceFile_attribute {
     u2 attribute_name_index;
     u4 attribute_length;
     u2 sourcefile_index;
}
Copy the code

Attribute_name_index is a 2-byte attribute with a value of 0x 00 0F and a decimal value of 15, which represents the 15th item in the constant pool. If you look at the 15th item, you can get SourceFile, indicating that the attribute is Source.

Attribute_length is 4 bytes, and its value is 0x00 00 00 02, which is converted to 2 in decimal notation, indicating that the content of the attribute is followed by 2 bytes.

Sourcefile_index is a 2-byte value 0x00 10, which in decimal form is 16, representing the 16th entry in the constant pool. Looking at the 16th entry gives the value math.java, representing the source name of the class bytecode file math.java.

conclusion

javap

The purpose of writing this article is to give you a more profound impression and understanding of bytecode, and to help us use bytecode petting frameworks like ASM more confidently and skillfully in the future.

Proficient with ASM bytecode staking framework, we can develop a lot of interesting operations with Gradle plugins and annotations, such as buried point statistics, Java layer string encryption, and a Butterknife like framework.

Finally, here is 010Editor for Mac

Link: pan.baidu.com/s/1vTxPTSfJ… Extraction code: PA8D