Steamed rice · 2016/01/06 9:11

0 x00 sequence


Josh Grunzweig at Pan Blog has published a series of tutorials on analyzing malware using IDAPython. I felt the content was very good, so I translated it into Chinese and shared it with you. Original address:

Part1: researchcenter.paloaltonetworks.com/2015/12/usi…

Part2: researchcenter.paloaltonetworks.com/2015/12/usi…

0 x01 background


IDA Pro is my daily activity as a Malware reverse engineer. This is not surprising, as IDA Pro is arguably the industry standard (though its alternatives, such as Radare2 and Hopper, are also gaining popularity). One of the most powerful features of IDA is the ability to use Python scripts (also known as IDAPython). IDAPython allows users to invoke a large number of IDA apis. Of course, users can also use IDAPython to get all the functionality provided by the scripting language.

Unfortunately, there is only a small amount of information available about IDAPython. The only information available is as follows:

  • The IDA Pro Book by Chris Eagle
  • The Beginner’s Guide to IDAPython by Alex Hanel
  • IDAPython Wiki for Magic Lantern

0x02 Use IDAPython to solve string encryption problem


In order to provide more tutorials for analysts, I am going to write an analysis article with examples for everyone to study. In the first part of this series, I’ll teach you how to write a script to resolve multiple string obfuscation on a malware sample.

While reverse-analyzing a virus sample, I came across a function like this:

Figure 1 string decryption function

Based on past experience, I suspect that this function is used for decryption. Numerous references to this function confirm my conjecture.

Figure 2. Numerous references to dubious functions

In Figure 2, we can see 116 references to this function. Each time the function is called, a piece of data is supplied to the function as an argument through the ESI register.

Figure 3. Instance of the suspicious function (405BF0) being called

At this point, I’m pretty sure this function is the function that malware uses to decrypt strings at run time. When faced with this situation, we generally have the following options:

  1. I can manually decrypt and rename these strings.
  2. I can debug this sample dynamically and rename the strings I encounter
  3. I can write a script that decrypts and renames these strings

I choose the first or second method if the malware decrypts only a few strings. However, as previously confirmed, this function was called 116 times, so it makes more sense to use IDAPython scripts to solve the problem.

The first step in resolving string obfuscation is to validate and override the decryption function. Fortunately, the decryption function is very simple. This function simply takes the first character of the data as the key of the XOR algorithm and decrypts the rest of the data.

E4 91 96 88 89 8B 8A CA 80 88 88

In the example above, we use E4 as the key to xor the remaining data. The final result is “urlmon.dll”. In Python, we can rewrite this decryption function as:

#! python def decrypt(data): length = len(data) c = 1 o = "" while c < length: o += chr(ord(data[0]) ^ ord(data[c][/c])) c += 1 return oCopy the code

As you can see, our test script gets what we expect:

#! bash >>> from binascii import * >>> d = unhexlify("E4 91 96 88 89 8B 8A CA 80 88 88".replace(" ",'')) >>> decrypt(d) 'urlmon.dll'Copy the code

The next step is to confirm which code references the decryption function and extract the data as parameters. Getting a reference to a function is very simple, just use the XrefsTo() API function to do this. In this script, I will hardcode this address in the script. As a test, I printed these addresses in hexadecimal:

#! python for addr in XrefsTo(0x00405BF0, flags=0): print hex(addr.frm) Result: 0x401009L 0x40101eL 0x401037L 0x401046L 0x401059L 0x40106cL 0x40107fL <truncated>Copy the code

Getting these cross-referenced parameters and extracting the raw data is tricky, but not difficult. The first thing we want to do is get “mov ESI, offset unk_??” The offset address in the instruction that passes arguments to the decryption function. To do this, we need to find the previous instruction that called the decryption function instruction. Once found, we can use the GetOperandValue() directive to get the value of the offset address. The following code looks like this:

#! python def find_function_arg(addr): while True: addr = idc.PrevHead(addr) if GetMnem(addr) == "mov" and "esi" in GetOpnd(addr, 0): Print "We found it at 0x%x" % GetOperandValue(addr, 1) break Example Results: Python>find_function_arg(0x00401009) We found it at 0x418be0Copy the code

Now we just need to extract the string from that offset address. Normally we would use the GetString() API function, but in this case the strings are raw binary data, so using this API might not be appropriate. The solution is to write our own function and read the data character by character until we hit an empty terminator. The code is as follows:

#! python def get_string(addr): out = "" while True: if Byte(addr) ! = 0: out += chr(Byte(addr)) else: break addr += 1 return outCopy the code

Finally, we put all the code together:

#! python def find_function_arg(addr): while True: addr = idc.PrevHead(addr) if GetMnem(addr) == "mov" and "esi" in GetOpnd(addr, 0): return GetOperandValue(addr, 1) return "" def get_string(addr): out = "" while True: if Byte(addr) ! = 0: out += chr(Byte(addr)) else: break addr += 1 return out def decrypt(data): length = len(data) c = 1 o = "" while c < length: o += chr(ord(data[0]) ^ ord(data[c][/c])) c += 1 return o print "[*] Attempting to decrypt strings in malware" for x in XrefsTo(0x00405BF0, flags=0): ref = find_function_arg(x.frm) string = get_string(ref) dec = decrypt(string) print "Ref Addr: 0x%x | Decrypted: %s" % (x.frm, dec) Results: [*] Attempting to decrypt strings in malware Ref Addr: 0x401009 | Decrypted: urlmon.dll Ref Addr: 0x40101e | Decrypted: URLDownloadToFileA Ref Addr: 0x401037 | Decrypted: wininet.dll Ref Addr: 0x401046 | Decrypted: InternetOpenA Ref Addr: 0x401059 | Decrypted: InternetOpenUrlA Ref Addr: 0x40106c | Decrypted: InternetReadFile <truncated>Copy the code

We can see all the decrypted strings. It would be even better if we could further provide the decrypted string as an annotation for the reference address of the string and the encrypted data. To do this, we need the MakeComm() API function. Adding two lines of code like this adds the necessary comments to the program:

#! python MakeComm(x.frm, dec) MakeComm(ref, dec)Copy the code

By adding this step, we can see the cross-referenced data very clearly. As shown below, we can easily tell which strings are referenced:

Figure 4. String cross-reference interface after running the script

In addition, we can also see these decrypted strings as comments in disassembly code:

Figure 5 Disassembly code after running the script

0x03 Use IDAPython to solve hash obfuscation for function/library calls


In decompiling we often see shellcode and malware using hash algorithms to confuse loaded functions or libraries. For example, reverse engineers often see confused function names in ShellCode. Overall, the process is pretty straightforward. The code first loads knerel32.dll at runtime. It then uses the loaded image to identify and store the LoadLibraryA function, which is used to load more libraries and functions. This particular technique usually uses some kind of hash algorithm to identify functions. The most commonly used hash algorithm is generally CRC32, although other variants, such as ROR13, are also very common.

For example, when I reverse a section of a malware, I see code like this:

Figure 6 Malware uses the CRC32 hash algorithm to load functions dynamically

Because the constant 0xEDB88320 is a common parameter of CRC32 algorithm. So we can tell that this example uses the CRC32 hash algorithm.

Figure 7 confirms the CRC32 algorithm

From Figure 7, we can confirm that the algorithm is CRC32 algorithm. Now, the algorithm and the function are defined. We can determine how many times this function is called by the number of cross-references (by x in IDA). You can see that this function has been called 190 times. Obviously, manually decrypting and renaming these hashes is not what we want. Therefore, we can use IDAPython to help us solve.

The first step doesn’t actually require IDAPython, but it does use Python. To verify which hash corresponds to which function, we need to generate a Hash list of Windows generic functions. To do this, we simply take a list of Windows common libraries and iterate over the list of functions for those libraries. The code is as follows:

#! python def get_functions(dll_path): pe = pefile.PE(dll_path) if ((not hasattr(pe, 'DIRECTORY_ENTRY_EXPORT')) or (pe.DIRECTORY_ENTRY_EXPORT is None)): print "[*] No exports for %s" % dll_path return [] else: expname = [] for exp in pe.DIRECTORY_ENTRY_EXPORT.symbols: if exp.name: expname.append(exp.name) return expnameCopy the code

We can then get a list of function names and compute their CRC32 hashes. The code is as follows:

#! python def calc_crc32(string): return int(binascii.crc32(string) & 0xFFFFFFFF)Copy the code

Finally, we write the results to a JSON file named “output.json”. This JSON file contains a very large dictionary in the following format:

#! bash HASH => NAMECopy the code

The full version of the code looks like this:

Github.com/pan-unit42/…

When this file is generated, we can return to IDA and continue writing our IDAPython script. The first thing our script does is read the JOSON data file ‘output.json’ that we created earlier. Unfortunately, JSON objects do not support integers as keys, so when the data is loaded, we need to manually convert the key from a string to an integer. The code is as follows:

#! python for k,v in json_data.iteritems(): json_data[int(k)] = json_data.pop(k)Copy the code

When the data is loaded, we will create an enumeration object that holds the mapping between the hash value and the function name. To learn more about enumerated objects, I recommend reading this tutorial:

www.cprogramming.com/tutorial/en…

Using enumeration objects, we can find the string corresponding to an integer, such as the function name corresponding to the CRC32 hash. To create new enumerated objects in IDA, we can use the function AddEnum(). To make the script more robust, we first use the GetEnum() function to check if the value for the enumeration already exists.

#! python enumeration = GetEnum("crc32_functions") if enumeration == 0xFFFFFFFF: enumeration = AddEnum(0, "crc32_functions", idaapi.hexflag())Copy the code

The value of this enumeration will be modified later. The next step is to determine the actual function address from the function hash value. This part looks a lot like part 1. By looking at the structure of the function, we can see that the CRC32 hash is the second argument to the load function.

The argument passed to load_function() in Figure 8

Again, we enumerate the previous instruction to find the second argument to the function. When we find it, we check the JSON data in output.json and make sure there is a function name that corresponds to the hash value. The code is as follows:

#! python for x in XrefsTo(load_function_address, flags=0): current_address = x.frm addr_minus_20 = current_address-20 push_count = 0 while current_address >= addr_minus_20: current_address = PrevHead(current_address) if GetMnem(current_address) == "push": push_count += 1 data = GetOperandValue(current_address, 0) if push_count == 2: if data in json_data: name = json_data[data]Copy the code

At this point, we use the AddConstEx() function to add the CRC32 hash and function name to the enumerated object we created earlier.

#! python AddConstEx(enumeration, str(name), int(data), -1)Copy the code

Once this data is added to the enumeration object, we can convert the hash value of CRC32 to the corresponding enumeration name. The following two functions are used to convert an integer to the corresponding enumeration data, and the other is used to convert data from an address to the corresponding enumeration data.

#! python def get_enum(constant): all_enums = GetEnumQty() for i in range(0, all_enums): enum_id = GetnEnum(i) enum_constant = GetFirstConst(enum_id, -1) name = GetConstName(GetConstEx(enum_id, enum_constant, 0, -1)) if int(enum_constant) == constant: return [name, enum_id] while True: enum_constant = GetNextConst(enum_id, enum_constant, -1) name = GetConstName(GetConstEx(enum_id, enum_constant, 0, -1)) if enum_constant == 0xFFFFFFFF: break if int(enum_constant) == constant: return [name, enum_id] return None def convert_offset_to_enum(addr): constant = GetOperandValue(addr, 0) enum_data = get_enum(constant) if enum_data: name, enum_id = enum_data OpEnumEx(addr, 0, enum_id, 0) return True else: return FalseCopy the code

Once we’ve converted this enumeration, we’ll look at how to change the value in DWORD, which holds the loaded function address.

Picture 9 When the function is loaded, the program stores the address of the function into a DWORD address

To do this, we need to not only iterate through the previous instruction, but also look for the instruction after it, which stores eAX to a DWORD address. When we find this instruction, we can rename the DWORD address to the correct function name. To prevent collisions, we prefix the function name with a “d_” string.

#! python address = current_address while address <= address_plus_30: address = NextHead(address) if GetMnem(address) == "mov": if 'dword' in GetOpnd(address, 0) and 'eax' in GetOpnd(address, 1): operand_value = GetOperandValue(address, 0) MakeName(operand_value, str("d_"+name))Copy the code

When all this is done, we’ll find that the assembly code, which used to be hard to read, becomes easy to understand. As shown in the figure:

Figure 10. Changes after running the script

Now, when we look at the DOWRDS list, we can get the actual function name. And these data can help us to carry out static analysis.

The complete code is as follows:

#! python import json def get_enum(constant): all_enums = GetEnumQty() for i in range(0, all_enums): enum_id = GetnEnum(i) enum_constant = GetFirstConst(enum_id, -1) name = GetConstName(GetConstEx(enum_id, enum_constant, 0, -1)) if int(enum_constant) == constant: return [name, enum_id] while True: enum_constant = GetNextConst(enum_id, enum_constant, -1) name = GetConstName(GetConstEx(enum_id, enum_constant, 0, -1)) if enum_constant == 0xFFFFFFFF: break if int(enum_constant) == constant: return [name, enum_id] return None def convert_offset_to_enum(addr): constant = GetOperandValue(addr, 0) enum_data = get_enum(constant) if enum_data: name, enum_id = enum_data OpEnumEx(addr, 0, enum_id, 0) return True else: return False def enum_for_xrefs(load_function_address, json_data, enumeration): for x in XrefsTo(load_function_address, flags=0): current_address = x.frm addr_minus_20 = current_address-20 push_count = 0 while current_address >= addr_minus_20: current_address = PrevHead(current_address) if GetMnem(current_address) == "push": push_count += 1 data = GetOperandValue(current_address, 0) if push_count == 2: if data in json_data: name = json_data[data] AddConstEx(enumeration, str(name), int(data), -1) if convert_offset_to_enum(current_address): print "[+] Converted 0x%x to %s enumeration" % (current_address, name) address_plus_30 = current_address+30 address = current_address while address <= address_plus_30: address = NextHead(address) if GetMnem(address) == "mov": if 'dword' in GetOpnd(address, 0) and 'eax' in GetOpnd(address, 1): operand_value = GetOperandValue(address, 0) MakeName(operand_value, str("d_"+name)) fh = open("output.json", 'rb') d = fh.read() json_data = json.loads(d) fh.close() # JSON objects don't allow using integers as dict keys. Little workaround for # this issue. for k,v in json_data.iteritems(): json_data[int(k)] = json_data.pop(k) conversion_function = 0x00405680 enumeration = GetEnum("crc32_functions") if enumeration == 0xFFFFFFFF: enumeration = AddEnum(0, "crc32_functions", idaapi.hexflag()) enum_for_xrefs(conversion_function, json_data, enumeration)Copy the code

0 x04 summary


In the previous section, we used IDAPython to successfully solve a hash obfuscation problem in which we used enumerated objects. Enumerating objects can be very helpful in analyzing these kinds of problems and can save us a lot of time. And this object can be easily extracted or loaded in IDA project, which is very helpful for us to do batch reverse analysis.