1, the background

Recently, I referenced an open source library from IBM because I needed to do string encoding parsing of text. But it’s just too big, 9M! To be fair, it would be an exaggeration to use 9M of code to parse this function in text encoding mode. So, I guess the library contains some class files that I don’t need. Therefore, I consider removing the class files I don’t need from the JAR. It is not easy to try manual deletion because there are too many files and reference relationships are not easy to determine. So, I decided to use Python scripts to slim down the JAR packages.

2. Constant pool parsing

Anyone who knows the JVM knows the structure of a class file. If we can parse out which classes the class refers to, we can add the referenced classes to the queue and iterate until we find all referenced classes, and then repackage those referenced classes. In this way, we extract only the classes we need to use, thus achieving the COMPRESSION of the JAR package. This is essentially a process of class parsing and breadth-first search.

Here, let’s look at a few solutions for resolving class constant pools.

2.1 Solution 1: Decompile class files using JavAP

To solve this problem, my first thought was to use Javap decompilation to get the classes referenced in the class file. When I started, I used the following command,

javap -c -p xxx.class
Copy the code

The result of this parsing is the following,

private static java.util.ArrayList<com.ibm.icu.text.CharsetRecognizer> createRecognizers(); Code: 0: new #14 // class java/util/ArrayList 3: dup 4: invokespecial #15 // Method java/util/ArrayList."<init>":()V 7: astore_0 8: aload_0 9: new #39 // class com/ibm/icu/text/CharsetRecog_UTF8 12: dup 13: invokespecial #40 // Method com/ibm/icu/text/CharsetRecog_UTF8."<init>":()V 16: invokevirtual #23 // Method java/util/ArrayList.add:(Ljava/lang/Object;) Z 19: pop 20: aload_0 21: new #41 // class com/ibm/icu/text/CharsetRecog_Unicode$CharsetRecog_UTF_16_BE 24: dup 25: invokespecial #42 // Method com/ibm/icu/text/CharsetRecog_Unicode$CharsetRecog_UTF_16_BE."<init>":()VCopy the code

So, we can get the class referenced by the class file by matching the // Method keyword. One problem with this approach, however, is that you can’t get it if a class doesn’t refer to it in a method but extends it. However, there is a workaround to inheritance. For inheritance, the compiler generates the constructor for the subclass by default and calls the parent class’s constructor in the constructor, so we can get the referenced class in the constructor. However, there is nothing you can do about interfaces. So, there’s a problem with this.

2.2 Solution 2: Usejavap -verboseInstruction output constant pool

The constant pool resolved in this way is as follows,

Constant pool: #1 = Methodref #102.#196 // java/lang/Object."<init>":()V #2 = Fieldref #101.#197 // com/ibm/icu/text/CharsetDetector.fInputBytes:[B #3 = Fieldref #101.#198 // com/ibm/icu/text/CharsetDetector.fByteStats:[S #4 = Fieldref #101.#199 // com/ibm/icu/text/CharsetDetector.fC1Bytes:Z #5  = Fieldref #101.#200 // com/ibm/icu/text/CharsetDetector.fStripTags:Z #6 = Fieldref #101.#201 // com/ibm/icu/text/CharsetDetector.fDeclaredEncoding:Ljava/lang/String;  #7 = Fieldref #101.#202 // com/ibm/icu/text/CharsetDetector.fRawInput:[B #8 = Fieldref #101.#203 // com/ibm/icu/text/CharsetDetector.fRawLength:I #9 = Fieldref #101.#204 // com/ibm/icu/text/CharsetDetector.fInputStream:Ljava/io/InputStream;  #10 = Methodref #205.#206 // java/io/InputStream.mark:(I)V #11 = Methodref #205.#207 // java/io/InputStream.read:([BII)I #12 = Methodref #205.#208 // java/io/InputStream.reset:()V #13 = Methodref #101.#209 //  com/ibm/icu/text/CharsetDetector.detectAll:()[Lcom/ibm/icu/text/CharsetMatch;  #14 = Class #210 // java/util/ArrayList #15 = Methodref #14.#196 // java/util/ArrayList."<init>":()V #16 = Methodref #101.#211 // com/ibm/icu/text/CharsetDetector.MungeInput:()V #17 = Fieldref #101.#212 // com/ibm/icu/text/CharsetDetector.fCSRecognizers:Ljava/util/ArrayList;  #18 = Methodref #14.#213 // java/util/ArrayList.iterator:()Ljava/util/Iterator;  #19 = InterfaceMethodref #214.#215 // java/util/Iterator.hasNext:()Z #20 = InterfaceMethodref #214.#216 // java/util/Iterator.next:()Ljava/lang/Object;Copy the code

So, we can get the referenced class by string matching after parsing out the constant pool.

This is possible, but it is less efficient than the following because the decompilation process generates a lot of information that we don’t need, and the string matching process is also less efficient.

So, in the end, I used the following solution.

2.3 Solution 3: Manually Resolve the Constant Pool

Here’s a review of the structure of the class file. Here is a simple diagram of class opened with HEX,

The first four bytes (CA FE BA BE) are magic numbers. The next two bytes (00 00) are the lowest JDK version supported by class, usually between 40 and 60. The next two bytes (00 31) are the highest supported JDK versions; The next two bytes (00 1F) are the size of the constant pool, where the value is 31. Note that the constant pool is numbered from 1 (see the javap-verbose output above). This is important because constants in the constant pool reference other constants by this number.

For constants in a constant pool, they are defined in part as follows

Here u1, U2, U4, and U8 mean one, two, four, and eight bytes respectively. Each constant is marked with a one-byte tag. The length of each constant is fixed except for constants with tag 1. For example, a tag of 3 takes up 4 bytes in addition to the tag; A tag of 5 takes up 8 bytes in addition to the tag. For constants with a tag of 1, it represents a string type and is usually used to define class and method names and so on. The two bytes following tag 1 represent the length of the constant in bytes, for example, if its value is 5, then the next five bytes are the contents of the string.

Now that we know the rules for constant pools, we can use Python scripts to parse constant pools,

class PoolParser:
    def __init__(self) :
        self.constants = {
            3: 4.4: 4.5: 8.6: 8.7: 2.8: 2.9: 4.10: 4.11: 4.12: 4.15: 3.16: 2.17: 4.18: 4.19: 2.20: 2
        }

    def parse(self, path: str) - >List[str] :
        '''Parse constant pool of class file with path.'''
        try:
            with open(path, 'rb') as fp:
                data = fp.read()
            constant_count = data[8] * 16 * 16 + data[9]
            cur_constant_index = 1
            index = 10
            sequences = {}
            classes_nums = []
            while cur_constant_index < constant_count:
                constant_code = data[index]
                if constant_code == 1:
                    char_length = data[index+1] *16*16+data[index+2]
                    char_start = index+3
                    char_end = char_start+char_length
                    # print(str(cur_constant_index) + '/' + str(constant_count) + ': ' + str(data[char_start:char_end], 'utf-8'))
                    sequences[cur_constant_index] = str(data[char_start:char_end], 'utf-8')
                    index = index + 2 + char_length + 1
                    cur_constant_index = cur_constant_index + 1
                elif constant_code == 5 or constant_code == 6:
                    # Special guideline for double and long type, see https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4.5
                    index = index + self.constants[constant_code] + 1
                    cur_constant_index = cur_constant_index + 2
                else:
                    # Get number of class in the constant pool.
                    if constant_code == 7:
                        classes_nums.append(data[index+1] *16*16+data[index+2])
                    index = index + self.constants[constant_code] + 1
                    cur_constant_index = cur_constant_index + 1
            return [sequences[num] for num in classes_nums]
        except BaseException as e:
            logging.error("Error while parsing constant pool:\n%s" % traceback.format_exc())
            print("Error while parsing constant pool for [%s]" % path)
            return []
Copy the code

The logic here is to resolve the length of the constant pool and iterate, incrementing by one at a time, until all constants have been resolved. Here are a few details to note:

  • For constants with tags 5 and 6 (which are essentially constants defined by long and double), they take two numbers. This definition feels weird. If you don’t read the official Oracle documentation, you won’t be able to find this in most books. That is, if the size portion of a class file’s constant pool has a value of 30 and the constant pool contains a double, it actually has 29 constants instead of 30, because double occupies two numbers, even though the layout on the class file is contiguous.

  • A constant with tag 7 is used to determine the class in the constant pool. The two bytes following this constant tag are the contents of the class. So, this is a way to find exactly the class in the constant pool. (Class is placed in a constant with a tag of 1, which is actually treated as a string of type UFT8)

  • Finally, Oracle for JVM define detailed documentation: docs.oracle.com/javase/spec… Those who are interested can find it on their own. It is more comprehensive than any book.

3. Overall script implementation

3.1 JAR Package Parsing

The JAR uses the ZIP file format. For Android application scenarios, since we don’t use jars as executables ourselves, the format requirements for jars are much lower. Basically use Python’s zip library to decompress, and then use the JAR directive to package. The logic of the Zip unpack here is,

def _unzip_jar(self) :
    '''Unzip the jar.'''
    print("Unzip jar ...")
    logging.info("Unzip jar ...")
    self.unzip_to = str(int(time.time()))
    try:
        with open(self.jar_path, 'rb') as fp:
            data = fp.read()
        self.unzip_to = hashlib.md5(data).hexdigest()
    except BaseException as e:
        logging.error("Error while reading the jar md5:\n%s" % traceback.format_exc())
        print("Error while reading the jar md5 [%s]" % self.jar_path)
    self.unzip_to = 'space/%s/unzip' % self.unzip_to
    if not os.path.exists(self.unzip_to):
        try:
            zipFile = zipfile.ZipFile(self.jar_path)
            zipFile.extractall(self.unzip_to)
            zipFile.close()
        except BaseException as e:
            logging.error("Error while unzip the jar:\n%s" % traceback.format_exc())
            print("Error while unzip the jar [%s]" % self.jar_path)
    else:
        logging.info("Unziped resources exists, ignore!")
        print("Unziped resources exists, ignore!")
Copy the code

The logic is simple, that is, first read the MD5 of the JAR file to determine whether the decompression file of the specified JAR exists. If so, use it directly; otherwise, decompress the file again.

3.2 Breadth-first search

Here I’m using the breadth-first algorithm based on queues.

def _search_by_read_constants(self) :
    '''Search by parsing constants.'''
    print("Searching classes by parsing constant pool ...")
    logging.info("Searching classes by parsing constant pool ...")
    start = int(time.time())
    self.classes = [os.path.join(self.unzip_to, cls.replace('. '.'/')) + '.class' for cls in self.classes]
    visits = []
    visits.extend(self.classes)
    count = 0
    parser = PoolParser()
    while len(visits) > 0:
        count = count + 1
        cur_path = visits.pop()
        exists = os.path.exists(cur_path)
        logging.info("Searching under [%d][%s][%s]" % (count, str(exists), self._simplify_class_path(cur_path)))
        print("Searching under [%d][%s][%s]" % (count, str(exists), self._simplify_class_path(cur_path)), end='\r')
        # Ignore if the class file not exists.
        if not exists:
            self.classes.remove(cur_path)
            continue
        # Parse classes from class constant pool.
        classes = parser.parse(cur_path)
        for cls in classes:
            self._handle_classes(cls, cur_path, visits)
    logging.info("Searching classes by parsing constant pool done in [%d]" % (int(time.time())-start))
    print("Searching classes by parsing constant pool done in [%d]" % (int(time.time())-start))

def _handle_classes(self, cls: str, cur_path: str, visits: List[str]) :
    '''Handle classes.'''
    pathes = [cls]
    # Handle if the class is an inner class.
    if cls.rfind('$') > 0:
        pos = cls.rfind('$')
        while pos > 0:
            cls = cls[0:pos]
            pathes.append(cls)
            pos = cls.rfind('$')
    for path in pathes:
        cls_path = os.path.join(self.unzip_to, path) + '.class'
        # Only search one class a time.
        if cls_path not in self.classes:
            self.classes.append(cls_path)
            visits.append(cls_path)
            logging.info("Found [%s] under [%s]" % (path, self._simplify_class_path(cur_path)))
Copy the code

The execution class passed in when the script is launched is used as a starting point for traversal to determine whether the file exists locally. Because there are some classes referenced in the standard library that are not in our JAR package, this operation is to exclude such files. Then, we use the class constant pool parsing logic we introduced earlier to parse the other classes to which this reference is referred from the class and queue them for further traversal. Until the entire queue is empty, the program is finished.

One detail to note here is the handling of inner classes. For things like AAA$BBB$CCC, there is a requirement that the outer classes must be entered into the package at the time of packaging, otherwise the inner classes will not be entered into the final JAR file. Here I’ve written a program to parse the inner class and its parent, the first few lines of the _HANDle_classes () method.

3.3 the jar packaging

For JAR packaging, it is best to use the official JAR directive. When packaging, one way is to concatenate all of our classes into a single instruction. However, because sometimes too many classes are included, this approach can result in very long packaged instructions. So, later, I switched to a different plan. Put all the class files you use into a directory, and then package them against that directory. That is,

def _pack_classes(self) :
    '''Packing casses.'''
    print("Packing classes ...")
    logging.info("Packing classes ...")
    if not os.path.exists(self.output):
        os.mkdir(self.output
    # Copy all filted classes to a directory.
    for cls in self.classes:
        dst = cls.replace(self.unzip_to, self.copy_to)
        par = os.path.dirname(dst)
        if not os.path.exists(par):
            os.makedirs(par)
        shutil.copyfile(cls, dst)
    # Get into the copied directory and use jar to package the directory.
    pack_command = 'cd %s && jar cvf %s/final.jar %s' % (self.copy_to, self.output, '. ')
    logging.info("PACK COMMAND: %s" % pack_command)
    ret = os.popen(pack_command).read().strip()
Copy the code

4, effects,

Since this is a slimming of the JAR package itself, the actual implementation will depend on the jar package. For example, the IBM JAR package I cited above was slimmed down from 9M to 1M.

5, restrictions,

  • The current solution is not the ultimate optimization: the solution that appears in a class file will be added to the final JAR, and the methods we need may not use those classes. So, if you do a method-level scan, you can eliminate more unnecessary class files. However, if you do a method-level scan, you also need to address the possibility that the crash stack may not match the actual code.
  • In addition, there is nothing you can do about the methods referenced in SO.

conclusion

Above is one of the Android package volume optimization tips. I’ve seen teams do dex level optimizations before, and if we follow the above thinking and optimize the entire Android package volume to exclude class files that are not needed, the results will be better than the results of dex level optimizations. However, with Android, things can be more complicated, such as making calls in SO and encryption and reflection.

Open source: github.com/Shouheng88/…