Runner, 2016/03/07 himself

0 x00 preface

Some time ago in the cloud knowledge library to see a more interesting article using machine learning malicious code classification. This article introduces the method adopted by the champion team in a malicious code classification contest on Kaggle, demonstrating the application and potential of machine learning in the field of security. However, the theme of this competition is the classification of malicious code, there is no further implementation of malicious code detection; Secondly, the competition code is only for Windows PLATFORM PE format, lack of research on mobile applications. Inspired by this, we try to use machine learning method to detect malicious code on Android platform, and finally get a certain detection effect.

0x01 Background

Android malicious code detection method

At present, malware detection methods mainly include signature-based detection method and behavior-based detection method. Signature code-based detection detects whether a file has the signature codes (such as a special code or a string) of known malicious software. Its advantages are fast, high accuracy and low false positive rate, but it cannot detect unknown malicious code. The behavior-based detection method matches the behavior of the monitoring program with the known malicious behavior pattern, so as to judge whether the target file has malicious characteristics. It has the advantage of detecting unknown malicious code variants, but has the disadvantage of high false positive rate.

Behavior – based analysis methods are divided into dynamic analysis method and static analysis method. Dynamic analysis method refers to the use of “sandbox or simulator” to simulate running programs, through monitoring or interception to analyze the behavior of running programs, but it consumes resources and time. The static analysis method is to extract the features of the program by reverse means and analyze the instruction sequence. This paper uses static analysis method to detect malicious line code.

Weka and machine learning classification algorithms

Weka (Waikato Environment for Knowledge Analysis) is a free, non-commercial open source machine learning and data minining software based on JAVA Environment. Weka stores data in the attribute-relation File Format (ARFF) File, which is an ASCII text File. In this paper, feature data are generated into ARFF format files, and Weka’s own classification algorithm is used for data training and model testing.

Machine learning is divided into supervised learning and unsupervised learning. Supervised learning is to use learning algorithms to learn a model based on training sets, and then use test sets to evaluate the accuracy and performance of the model. Classification algorithm belongs to supervised learning, which needs to establish a model first. Common classification algorithms include Random Forest and support vector machine (SVM).

Basic format of APK

APK (Android Application Package) is available on Wikipedia.

The APK file format is a ZIP-based format that is constructed in a similar way to JAR files. It is the Internet media type application/VND. Android. Package – archive;

An APK file usually contains the following files:

Classes. dex: Dalvik bytecode, which can be executed by the Dalvik virtual machine.
Androidmanifest.xml: An Android manifest file that describes the application’s name, version number, required permissions, registered services, and linked other applications. This file uses XML file format.
Meta-inf folder: There are three files under it
- Manifest.mf: MANIFEST information
- Cert. RSA: saves the certificate and authorization information of the application program
- Cert. SF: Saves the list of SHA-1 information resources
Res: resource folders required by APK
Assets: directory of original resource files that do not need to be compiled
Resources.arsc: compiled binary resource file
Lib: library file directory

Of all the files to watch out for is classes.dex, where android’s executable code is compiled and packaged.

Dalvik Virtual machine with disassembly

Unlike JAVA Virtual Machines (JVMS), Android virtual machines are called Dalvik Virtual Machines (DVMS). The Java virtual machine runs Java bytecode, and the Dalvik virtual machine runs Dalvik bytecode. The Java VIRTUAL machine is based on the stack architecture, and the Dalvik Virtual machine is based on the register architecture.

DVM has its own DEX executable file format and instruction set code. Smali and Baksmali are assembler and disassembler for DEX execution file format, and DEX file will be generated after disassembly. Smali code has a specific format and syntax, and smali language is an interpretation of Dalvik VM bytecode.

Apktool is based on the smali tool for encapsulation and improvement, in addition to the DEX file assembly and disassembly functions, but also APK has been compiled into binary resource files for decompilation and recompilation. Instead of using smali and baksmali tools, this article uses apkTool directly to disassemble APK files.

#! bash java -jar apktool.jar d D:\drebin\The_Drebin_Dataset\set\apk\DroidKungFu\xyz.apkCopy the code

Successful command execution results in the following level 1 directory structure in the out directory:

Androidmanifest.xml configuration file
Yml Decomposes the generated file for use by apktool
Assets/Directory of assets files that do not need decompilation
Lib/Directory for library files that do not need decompilation
Res/Decompiled resource file directory
Smali/Decompile generated smali source file directory

The smali directory structure corresponds to the original Java source SRC directory.

0x02 Feature Engineering

Classification and description of Dalvik instruction

Smali is an interpretation of DVM bytecode, and while it is not an official standard language, all statements follow a set of syntactic specifications. Dalvik opcodes pallergabor detail can refer to this article. The uw. Hu/androidblog… , which lists the meaning, usage and examples of Dalvik Opcode in detail.

Since there are more than 200 Dalvik instructions, classification and simplification are needed to remove irrelevant instructions, leaving only the core instruction set of M, R, G, I, T, P and V, and only the opcode field is retained and parameters are removed. Seven types of instruction set M, R, G, I, T, P and V respectively represent seven types of instruction: move, return, jump, judge, get data, save data and call method. The instruction is classified and described once. See the following figure for details:

OpCode N-gram

N-gram is a concept in the field of natural language processing, but it is also often used to handle the analysis of malicious code. OpCode n-gram is to extract n-gram features from the field of the instruction OpCode. N can be 2,3,4, etc. OpCode N-gram for an smALI format assembler file is shown below:

0x03 System Design and Implementation

The whole system is divided into two parts: establishing malicious code detection model and testing malicious code detection model.

The malicious code detection model is established as follows:

Several programs were written in C++ to process the data in the process of model building:

Total.exe: Used to summarize all smali files in the project directory generated after a single APK disassembly into a file
Simplication. Exe: Used to extract instructions, classify and describe them
Ngramgen. exe: Used to generate n-gram sequences of specified N
Arff.exe counts the number of each feature and generates Arff files suitable for Weka

The test malicious code detection model is as follows:

The machine learning tool Weka was used to test the model and test the accuracy of the model. The model with high accuracy can be used to predict whether unknown Android code is malicious code.

0x04 Experimental evaluation

Experimental data source

Experimental data are divided into malicious code samples and normal code samples. Normal code samples are downloaded from the Android Market; The data of malicious code samples came from the Drebin project, which collected 5,560 APK sample files of 178 kinds from August 2010 to October 2012. The data volume distribution of 178 malicious code families is shown in the figure below:

The experimental results

540 malicious samples and 560 benign samples were used in this paper, with a total of 1100 samples in 2 categories. The classification algorithm adopts random forest, 150 decision trees, n is 3, and ten fold cross verification is carried out.

The accuracy rate is shown in the figure below. 1045 samples were correctly classified, while 5 samples failed to be classified. For malware, where TPR (true positive rate)= 0.981, FPR (false positive rate)=0.08, Precision(Precision)= 0.922

The ROC curve effect is as follows: The Receiver Operating Characteristic (ROC) curve and AUC are often used to evaluate a binary classifier The pros and cons of classifier), specific knowledge about the AUC please see en.wikipedia.org/wiki/Area_u…

0x05 Summary and Outlook

In general, the experimental results show that the detection of malicious code has a high accuracy
If other features can be combined, accuracy should be further improved
In addition to using the random forest algorithm, you can also try the effect of other classification algorithms, such as support vector machines

0x06 References

Using machine learning to classify malicious code
Kaggle’s Malware classification
Weka software download
Description of Dalvik Opcodes
Random forest algorithm
Apktool tools
Smali study notes
Drebin project introduction and download

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Detect Android malicious code with machine learning

0 x00 preface

0x01 Background

Android malicious code detection method

Weka and machine learning classification algorithms

Basic format of APK

Dalvik Virtual machine with disassembly

0x02 Feature Engineering

Classification and description of Dalvik instruction

OpCode N-gram

0x03 System Design and Implementation

0x04 Experimental evaluation

Experimental data source

The experimental results

0x05 Summary and Outlook

0x06 References

Detect Android malicious code with machine learning

0 x00 preface

0x01 Background

Android malicious code detection method

Weka and machine learning classification algorithms

Basic format of APK

Dalvik Virtual machine with disassembly

0x02 Feature Engineering

Classification and description of Dalvik instruction

OpCode N-gram

0x03 System Design and Implementation

0x04 Experimental evaluation

Experimental data source

The experimental results

0x05 Summary and Outlook

0x06 References

Related Posts

Tapable helps you parse the plugin system for WebPack

Optimize Linux kernel parameters to improve server concurrency

Git submits the pull request to the main project