By Ali Yun-Qin Qi

This article is a series of articles, mainly introduces many interesting tasks involved in the field of Code Intelligence, specifically from the introduction of these tasks, history and status quo and other dimensions, hoping to let you have a deep understanding of Code Intelligence (there are series of articles at the end of the article).

The main character of this article is the technology of code cloning detection, that is, to determine whether two pieces of code are similar, that is, to determine whether they are “copied”. Here can not help but someone to ask, programmer thing that can call copy? That’s learning, borrowing. The rationality of code cloning (commonly known as Ctrl C, Crtl V) will be discussed later, but let’s first focus on the subject itself, how to determine the similarity between two pieces of code.


Clone Detection

Clone detection is also called duplicate code, similar code, this proposition is easy to understand, is to see whether two copies of code is the same. For programmers this thing is very easy, is not a copy, look at it. But machines are different, and we need to tell machines what to do. The earliest code cloning detection started in the 1990s, so far has more than 20 years of research history, but also produced a lot of excellent algorithms and research, can be said to have developed into a very mature topic.

General classification of code clones

The classification of code cloning is aimed at the design of detection schemes for different cloning methods. At present, there are four general classifications of code cloning, and the difficulty of detection is gradually increasing:

  1. The code is identical except for Spaces and comments. So I just copied it and I deleted the space comment, and everything else is the same
  2. All the same code except for variable names, type names, function names, and so on. This is an improvement over the first one, you know how to change the variable name or something, at least it’s not immediately obvious, right
  3. Some statements were added, deleted, and converted, such as adding an unimportant line of code, changing the order of if, and changing if to switch or something, but it was still basically the same
  4. Same function, different way of writing it. Might not technically be cloning anymore. Code refactoring?

That doesn’t sound like much, so let’s just go to the code. For example, I was stumped by a recent business that needed to implement a method that generates a 1 to N continuous array. But don’t worry, turn on the all-powerful Baidu (Google) and you’ll find everything. Stackoverflow has a similar problem. Copy the answer to this question and delete the space.

var foo = [];
for (var i = 1; i <= 10; i++) {
   foo.push(i);
}
Copy the code

But think about it. There’s a bunch of code on the web that needs to be changed, so it looks something like this:

const array = [];
for (var i = 1; i <= 10; i++) {
   array.push(i);
}
Copy the code

After thinking it over, I think it is not enough to change a variable or what can be seen at a glance. Let me change it again:

Array.from(Array(10)).map((item, index) = > index + 1)
Copy the code

Can not help but stroke must smile, wonderful wonderful… After a few days, it seemed that it could be further optimized, so there was:

Array.from(Array(10).keys()).map(item= > item + 1)
/ / further
[...Array(10).keys()].map(item= > item + 1)
// Take it one step further
const [, ...result] = Array(11).keys();
Copy the code

Detection method of code cloning

Back to clone detection, there are also different detection methods for the four methods, which can be roughly divided into the following categories.

  • Text similarity-based detection method: this method is the most common and easiest detection method to execute. It is only applicable to code clones with small text differences, that is, the first and second cloning methods mentioned above, with relatively high detection accuracy. Once the text is too different, the effect drops dramatically
  • ** Token-based detection method: ** This method uses the parser to divide the source code into sequences of symbols, which are then organized into symbolic statements, and finally the statements composed of these symbols are compared. This method can easily detect the behavior of adding and deleting statements, but is not sensitive to transposing orders.
  • ** Grammar-based detection method: ** This method will simultaneously convert two pieces of code into abstract syntax tree (AST), and then compare the subtrees through tree matching algorithm. If they are the same, it will be considered as a code clone. This approach is also insensitive to code order and does not recognize differences in identifiers or text, but can detect any other subtle changes.
  • ** Semantically based detection methods: ** The most common semantically based methods are graph-based detection methods. This method first generates data flow graph and control flow graph according to the code, which can reflect the change of data and logic at the same time, and then transforms the problem into the problem of detecting similarity graph. This approach relies on graph generation, which can often lead to incorrect results if graphs are generated differently by different languages or programs.

To sum up, it can be seen that it is difficult to accurately identify the existence of code cloning by using one method, and relatively accurate results can be obtained by combining multiple detection methods.

There are also a number of popular clone detection tools and websites.

  • NICad, which supports most cloning methods for types 1, 2, and 3, claims to support any language and officially provides C, Java, C #, Python, PHP, Ruby, ATL, and WSDL plugins
  • CCCD, which uses Concolic analysis to detect code clones, is software-based and therefore works well for types 3 and 4

Copy Or learn

When it comes to code cloning, the issue is unavoidable. So is cloning copying or borrowing? Let’s take a look at the definitions:

  • Plagiarism (English: plagiarism), also known as plagiarism, according to the Ministry of Education mandarin dictionary definition, in order to copy other people’s works as their own, for the original work without or basically without modification of the transcription, this is a kind of infringement. —- Wikipedia
  • Reference: to other people or things as a mirror, compared with their own experience or lessons, in order to learn from each other. — Baidu Baike

We can see from two definitions, one is plagiarism, the other is learning from each other’s strengths. Back to the code, strict plagiarism is difficult to define, for example, the previous Google Oracle code infringement case took 10 years, no matter what the final result, at least it shows that code plagiarism is difficult to identify legally. Let’s go back to the example mentioned above. When people encounter problems in business that they don’t know how to write or understand, their first reaction is to go to Google to check whether there is a similar implementation and then learn from it. I think there is no problem with the operation itself, but the difference is knowing and knowing why. If you can understand how it works, learn how to use it, and even think of a better way to write it, then there is no “copying”, it becomes “your” code.

Code recommendations series

  • A brief description of defect detection (Defect detection for Code Intelligence)

  • Let me tell you about smart code recommendations



    Tao department front – F-X-team opened a weibo! (Visible after microblog recording)
    In addition to the article there is more team content to unlock 🔓