Bytedance Terminal Technology — Tan Zijing

Through the CI/CD stage, the business security compliance test analyzes and checks the new code, controls the combined code, traces the source of problems, and controls the release of the construction products, so as to avoid the security compliance risks caused by privacy and compliance issues brought online.

background

1. Business background

With the rapid development of Internet technology, domestic and foreign laws and regulations related to user privacy data are becoming more and more perfect, administrative control is becoming more and more normalized, and users’ privacy data security awareness is also gradually improving. The risk of privacy data security compliance problems after mobile applications are launched is becoming more and more uncontrollable.

The following is a brief list of domestic and foreign privacy compliance laws, regulations and notices in recent years:

2021-03-21 | Ministry of Industry and Information Technology | Notification on APP Infringement of Users’ Rights and Interests (The 3rd Batch in 2021, the 12th Batch in Total)

2020-07-24 | Ministry of Industry and Information Technology | Notice on Carrying out Special Rectification Action for APP Infringement on Users’ Rights

2019-12-30 | Ministry of Industry and Information Technology | “App Illegal Collection and Use of Personal Information behavior Identification Method”

2019-02-27 | Us | Data Privacy Act

2018-05-25 | GDPR | General Data Protection Regulation

2. Technical background

As shown in the figure below, after business security inspection is incorporated into the quality inspection system, inspection bayonets are mainly established in CI (Continuous Integration) and CD (Continuous Delivery) phases. Once business security compliance risks are found, That is to prevent code incorporation or application release to avoid business security compliance risks.

Current situation and difficulties

The status quo

Firstly, for the core requirements (such as sensitive API, sensitive permission, sensitive string), we implemented the detection of Android compiler intermediates based on Gradle Transform and ASM, namely CI intermediates detection.

Detection of CI intermediates

The general scanning process is shown in the figure below:

When feature1 has completed development testing and is ready to join the Develop branch,

First, the feature1 branch and the Develop branch will be wrapped separately.

Then, the two nodes analyze and check the compilation intermediate products and record the method of matching the rules as an issue in their respective construction process. After the construction, the “initial full problem” and “current full problem” in the diagram can be obtained respectively.

Then, diff (difference) the initial full problem and the current full problem to obtain the new problems between the creation of feature1 branch and the merging of feature1 branch.

Finally, control the MR with incremental problems and prevent the branch code from merging into the main branch until RD fixes all incremental problems or reports and approves.

CD product detection

Scanner is the product detection tool in the Android CD stage. Based on command line tools such as AAPT, apktool, keytool, strings and so on, it implements the security check of apK/AAR/AAB/SO and other Android related binary products. The general working process is shown in the figure below:

Scanner Scans the most important bytecode in the Android binary artifacts, based on the smali files decompiled by apkTool. The smali files are decompiled and then concurrently scanned line by line for security compliance issues.

  • Schematic diagram of smali files

  • Dex is decompiled to generate smali sequence diagrams

The difficulties in

1. CI business security detection cannot cover the source code

How do I check the License information contained in the open source code? Obviously, CI intermediate product detection cannot meet the requirements and can only be carried out based on source code.

2. CI service security detection is costly to locate problems

Because the source code corresponding to the Android compilation intermediate product has undergone a series of optimization processing such as deicing during the construction, the original information such as syntax sugar and line number in the source code file can not be restored, which increases the workload of locating and troubleshooting the detected problems.

3. CD product detection is based on the potential leakage risk of sensitive call points

Smali file-based scans in the CD phase tend to scan only the direct call locations of sensitive apis, not all the call chains, which makes auditing problems challenging.

As shown in the figure below, the same sensitive API may have both regular and illegal call scenarios. When the illegal call occurs in the new indirect call scenario, and the call point is evaluated as “no rectification” in the old version, there will be a risk of problem leakage. At the same time, the call point has limited information during the troubleshooting, and it often needs to spend some energy to search/analyze, which is not conducive to the problem location and solution.

Improvements and Benefits

In order to solve the problems that THE DETECTION of CI intermediates cannot cover the related detection of source code and the high cost of locating detection problems, we implemented a set of CI incremental source code detection. The general idea is to obtain all new source code (including two and two components) corresponding to each MR based on Git Diff, and then conduct various security checks and code control on these incremental source code, that is, CI incremental source code detection.

CI incremental source detection

The main steps of obtaining incremental source code in a MR process include the following three sub-processes: obtaining source code change information sub-process, source code DIff sub-process, and accurately obtaining component incremental source code sub-process. The following is a detailed description of the three key sub-processes based on the flow chart:

1. Obtain the source code change information subprocess

The source change information includes the source repository and COMMIT information of the primary/sub-repository, the source repository and commit information of the change component.

Take the Android project for example,

First, when the developer commits the MR, he triggers a check, and can directly obtain the source code change information of the master/subbin (warehouse address, base commit, review commit).

Then, according to the main repository source change information, download the main repository of two commit project source;

Then, use gradle command (iOS via pod command) to obtain the component dependency tree information of the two commit, parse the component dependency tree and diff it (difference), you can get the component change information between the two commit (new component and updated component);

Finally, through the component management module, according to maven coordinates and version numbers in the component change information, the original Git repository of the component and the commit information corresponding to the two version numbers can be obtained, that is, the source repository and commit information of the changed component.

Component changes include adding components, updating components, and deleting components. Incremental source code only needs to focus on adding components and updating components.

The component management module is responsible for the release and upgrade of components, and records the corresponding relationship between the original Git repository, version number and commit of components. The unrecorded components are three-party components, which have no source code and do not need to be considered.

2. Source Diff subprocess (key)

Based on the obtained source code change information, the source code DIFF subprocess is shown in the figure below:

  • Main/sub-warehouse and main steps for updating components

First, download the source code for the commit project (review COMMIT in the diagram) that MR is preparing to join the main branch.

Then, git diff command is used to obtain the code change information between base commit and review commit.

Next, iterate over the code change information (filter the code change information in the component directory for the updated component), obtain the files with new or updated source code and the changed line number information (change lines beginning with “+” in the Diffs result). All source change files and their change line number information are recorded and compressed into a ZIP package (i.e., incremental source package);

Finally, the incremental source package is uploaded to the server for use by various downstream inspection services.

  • The new component

The difference between a new component and an updated component is that a new component requires full access to the source code contained in the component. Every line in the source file of the component needs to be checked. Other steps are exactly the same as those for an updated component.

3. Precise acquisition of component incremental source code subprocess

There may be a large number of components (even mixed Android and iOS components) in a library project (Git project). The code changes obtained from the source repository where the changed component resides may contain the contents of other components. In order to avoid false positives, we need to obtain the precise path of the component in its source repository

Take the Android project for example,

First, get the module name in the source repository where the component resides through the component management module, and inject Gradle custom task into the source repository of the component.

Then, execute gradle custom task to get the module name and corresponding source path of all components in the source repository;

Then, match the module name of the changed component and obtain the source path of the changed component;

Finally, when the source code diff subprocess traverses the code change information, the incremental source information of the changed component is filtered out through the component path, and the incremental source code of the changed component is put into the incremental source code package to achieve accurate acquisition of the incremental source code of the component.

4. Complete process

earnings

1. Covers Android/iOS dual-end source detection

The original detection based on compilation intermediates can only support Android projects. For iOS, an additional detection scheme needs to be implemented. The CI incremental source code detection not only covers the source code detection requirements of Android/iOS dual-end primary/sub-repository, but also covers the incremental source code detection of all new/changed components involved in this compilation, which meets the requirements of License compliance detection and security compliance detection of open source projects.

2. The automatic precise positioning and problem aggregation of detected problems are realized

Based on the source code, we realized the problem of detection from its corresponding code warehouse, directory components, source code files, the problem of rows for automatic precision, at the same time can also be carried out in accordance with the warehouse/component dimensions to problems such as automatic distributed to different polymerization, the Owner to follow up treatment, greatly improve the efficiency of the consumption of the problem, to reduce the cost of the positioning of the problem to solve.

CD product detection

Android Product Detection

During scanner scanning, there are a large number of IO operations (decompiling smali files and scanning smali files one by one). Despite concurrent scanning, the scanning time is still long (it takes 175.28s on average to scan the same 86M package several times).

  • Smali Scan process

There are two time-consuming steps in the smali scanning process. One is to scan the DEX/APK to generate smali files, and the other is to perform security compliance checks on smali files generated by batch scanning.

Thought: If we could perform security compliance checks on sensitive information directly in memory, without generating smali files, could the scanning time be greatly reduced?

  • Dex Scan Process

In our experiment, the method callgraph was extracted directly based on Dex, and the same package (86M) was scanned multiple times with the same rule. The average scanning time was further reduced from the original 175.28s to 32.72s, which greatly improved the efficiency of packet detection and scanning.

BDAnalysis engine

For the previously mentioned “potential risk of omission based on sensitive call points”, assuming that we can know the function call relationship, we can relate the call points to the upper-level business code, extend the detection dimension from the point to the chain, and fundamentally solve the problem of omission of indirect call detection.

  • Call chain

As the name implies, the link from the call point to “main”.

For Android applications, there is usually no main entry. You need to simulate a fake main entry, usually called DummyMain, according to Android features: four major components and Application lifecycle functions; Xml-bound functions such as onclick, databinding, and so on.

  • The advantages of the call chain

    • Associating upper-level business code and facilitating the quick positioning and solving of problems;

    • Covering all call scenarios based on CallGraph, facilitating SDK dependency sorting and API call sorting.

earnings

1. Generated the call chain based on BDAnalysis

The generation of the call chain is of great significance to the investigation of sensitive problems. The business side can find the actual call link of the problem according to the generated call chain and avoid the risk of leakage caused by indirect calls.

2. The call chain technology is applied to API call and SDK dependency combing and other scenarios

After we generated the call chain, assisted the quick locating and solving of problems, and realized the sorting of API call scenarios, we further realized the sorting of SDK dependencies. The so-called SDK dependency combing is to scan out all public unconfused apis in SDK, and then further scan out all call chains of these apis in APK. Through SDK dependency combing, We can easily tease out the specific business scenarios and interfaces through which APK depends on the SDK. This makes it easy to identify sensitive API calls and also facilitates decoupling between SDKS/modules.

Summary and Prospect

conclusion

This paper starts with the business background and technical background of business security compliance testing, introduces the current situation and difficulties of business security compliance testing in CI/CD phase, and then introduces the improvements we have made in CI/CD phase respectively: CI incremental source code testing, BDAnalysis engine, and the corresponding benefits.

CI incremental source code detection covers the source code detection of Android/iOS dual-end main/sub-storehouse and its dependent components, realizes the automatic and precise location and problem aggregation of detected problems, and greatly improves the consumption efficiency of detected problems. CD product detection is based on BDAnalysis engine to realize the generation of call chain, which makes up for the possible risk of leakage. Meanwhile, we also use DEX scan instead of Smali scan to reduce the average time of CD product detection from 175s to 32s, which greatly improves the accuracy and speed of the detection tool.

There are still some deficiencies in CI/CD business security compliance testing, such as the construction of effect indicators, CI/CD data integration, etc. In view of these deficiencies, we will gradually improve and continue to escort the operation of security compliance of Android applications of major businesses.

Looking forward to

CI service security check

  • Tool positioning. Mainly promote CI incremental source code detection and give full play to its advantages of accurate traceability; CI intermediate product detection assists in verifying test results to avoid issue leakage.

  • Index construction. On the basis of the existing technical indicators, improve the effect indicators, and build the issue “detection rate”, “consumption rate”, “false positive rate”, “satisfaction rate” and other indicators.

  • Data goes through. Break through the CI/CD call chain and upgrade the issueID break through solution.

CD Service security check

  • Tool positioning. Detection tool gradually abandoned Scanner tool, to build a powerful new detection tool BDInspect.

  • Index construction. Build technical specifications and automatic alarm mechanism for new and old detection tools and BDAnalysis engine.

  • Engine construction. Build BDAnalysis engine to realize the invocation chain and assignment chain landing to more business scenarios.

About the Byte Terminal Technology team

Bytedance Client Infrastructure is a global r&d team (with r&d teams in Beijing, Shanghai, Hangzhou, Shenzhen, Guangzhou, Singapore and Mountain View, USA) responsible for the construction of the entire big front-end Infrastructure of Bytedance. Improve performance, stability and engineering efficiency across the company’s product line; The supported products include but are not limited to Douyin, Toutiao, Watermelon Video, Feishu, Guagualong, etc., which have been deeply researched on mobile terminal, Web, Desktop and other terminals.

Now is the time! Client/front-end/server/intelligent algorithm/test development for global recruitment! Let’s use technology to change the world. If you are interested, please contact [email protected]. Email subject: Resume – Name – Job Objective – Expected city – Phone number.