Translation/les

Original text: research. Google/pubs/pub483…

Ali Mesbah/Andrew Rice/Emily Johnston, Nick Glorioso, and Edward Aftandilian

Published in: Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19)

Abstract

Developers often spend a lot of time fixing code compilation errors. Through observation, we found that the process of fixing errors followed a specific pattern and was very mechanical. We propose a novel approach that automatically learns these patterns through deep neural networks and proposes fixes for the most time-consuming compile failures.

We described how to collect all of Google’s build errors and the code that led to them and turn them into successful builds. We generate a DIFF of an abstract syntax tree (AST) from code changes, transforming it into a DSL called Delta that describes the code changes necessary for compilation success. The compiled diagnostic information (as the source) and the Delta modification that resolves the diagnostic error (as the target) are then given to the NEURAL machine translation network for training.

For the two most common and time-consuming Java compilation errors (missing symbols and method signature mismatches), our DeepDelta system fixed 19,314 of 38,788 compilation errors (50% accuracy). Correct fixes were in the top three recommended fixes 86% of the time.

keywords

Compiler errors, program fixes, neural machine translation

1 Background

One advantage of using compiled programming languages is that programming errors are exposed at compile time, not run time. Failed builds often lead to an edit-build cycle, with developers constantly trying to fix diagnostic errors as prompted, recompile, and repeat the process.

A previous large-scale study reported that professional developers build code an average of 7 to 10 times a day. Studies have found that build-time compile errors are common and costly, requiring a great deal of developer time and effort to resolve. Build-time compilation errors occur when developers compile code through build management systems such as Bazel, Gradle, or Maven.

Our goal is to help developers automatically fix build bugs. Previous research in the area of automated program repair has focused on finding patches in the event of test failures through fixed templates and search-based techniques. In this article, we present a novel approach, called DeepDelta, to automatically fix build-time compile errors.

Our view is that there are some common patterns in how developers modify code to address compiler errors. Such patterns can be learned automatically by extracting the abstract syntax tree (AST) changes between the failed and resolved snapshots of code and presenting them as abstract features to the deep neural network. This paper makes the following contributions:

  • We did a lot of research on compile errors and their fixes to find the most common and costly types of errors in practice. Our data set comes from code changes made by developers across thousands of Java projects at Google and contains 300 million lines of code. Our research shows that 51% of compiler diagnoses are related to the compiler’s inability to resolve specific symbols, which is also the most expensive category of compiler errors to fix.
  • We model the automatic repair process as a neural machine translation (NMT) problem, where the problem source contains fault information and the target is a set of AST changes that resolve the fault, expressed in a DSL called Delta.
  • We introduced an example method called DeepDelta that automatically generates source and target characteristics from developer history data and learns in practice the repair patterns for two of the most common and costly Java compilation errors.
  • We demonstrated the effectiveness of this technique by conducting extensive empirical evaluations of 38,788 undetected compile errors in Google code. The evaluation showed that DeepDelta had a 47% to 50% chance of generating the correct fix. Correct fixes are in the top three of recommended fixes 85 to 87 percent of the time.

2 Collect compilation errors

The first step in learning the fix pattern is to get developer data about compilation errors. We collect this data during Google’s daily development cycle and also use this data set to represent the real-world prevalence and cost of various compile errors.

2.1 Data Collection

Every build a developer starts is automatically logged by Google for detailed information about each build, including all compiler diagnostics (i.e., detailed error messages) and snapshots of the code being built.

In this study, we collected two months of build logs, from January 23, 2018 to March 23, 2018. The collected logs are then parsed and analyzed to understand the most common build errors in practice. Although our build diagnostic framework is language-neutral, in this article we will focus only on build errors related to Java projects.

2.2 Diagnosis Types

We group compiler error messages by diagnostic type. Diagnostic types represent errors with the same cause. The compiler inserts a concrete name into the error message template, such as javac using the template “{0} is abstract;” for an attempt to instantiate an abstract class. Always be instantiated “and the abstract. Cant.. Be instantiated. Key to reference it. We built a parser to map specific error messages from the build log back to the message template from which they were generated (see Table 2).

2.3 From diagnosis to solution

A failed build can contain a lot of diagnostic information. Some of these diagnoses are new, and some may be the same as diagnoses in historical builds. Therefore, we begin by transforming the build sequence containing a specific diagnosis into a _ repair session _.

Definition 1 Repair Session Resolution Session (RS).A build diagnosisRepair sessions (RS) refer to two or more sequentially constructed sequences,, meet the conditions:Was first introduced inFor the first time, and buildAnd buildThe interval does not exceed the specified time T.

The purpose of using a repair session is to capture the time period that the developer uses to resolve the diagnostic problem. Therefore, we define the time window T as one hour. This window represents the “task switch window”, meaning that if the developer did not execute the build within T time, it is likely that they are already busy doing something else. Like going to lunch or leaving your desk.

We quantify developer costs in terms of the time it takes to solve a particular diagnosis. To diagnoseAnd repair sessionsFor example, set || forThe number of diagnoses generated by the build is calledOn _ active diagnosis _. setTo buildThe time to start,To buildTime to finish. So we get the concept of active repair cost as follows:

Definition 2 Active Resolution Cost (ARC).diagnosisActive repair cost (ARC) represents the developer’s use to resolve), not including the build itself, divided by the number of diagnoses during the intermediate build process of its repair session:

Figure 1 depicts our process of building a repair session. As shown in Figure 1A, D1 and D2 first appear in Build 1 and D3 first appear in Build 2. We consider the absence of a diagnosis from the build to be resolved (refer to definition 1), as D3 no longer appears in Build 4, so its repair session includes Build 2-4, as shown in Figure 1B.


Figure 1. Build a repair session

2.4 Data sets and discovery

Table 1 summarizes our data set. We processed 4.8 million builds, of which 1 million were failures, or about 20% of all build failures.

Table 1 Data set


Recall that build failures can involve multiple compilation errors, known as diagnostics. These build failures total 3.3 million diagnoses, from which we can derive 1.9 million repair sessions. For the remaining 1.5 million diagnoses we didn’t find a fix session, either because the developer abandoned the change or because the interval between builds was more than an hour.

Table 2 lists the top 10 diagnostic types by cost of active repair

Table 2 lists the top 10 most common and time-consuming errors in our data set. The table shows the type of diagnosis, the cost of active fixes (average, minimum, and maximum), the number of subsequent builds to solve the diagnosis problem (average, minimum, and maximum), the ratio and number of each diagnosis type, the cost of active fixes relative to the total, and a literal description of the type of diagnosis.

There are 1,853,417 compile diagnostics, which are then fixed in the repair session. The table shows that 51% (949,325) of diagnoses are related to the compiler’s inability to resolve a symbol, namely the cant. Resolve diagnosis type, with an error message of “Cannot find symbol”. Compared to the results of the 2014 survey, it seems that the lack of symbols developers are experiencing is increasing. The second type of diagnosis in the table is cant. Apply. Symbol error, accounting for 8% of the total. Cant. Apply. Symbol occurs when the compiler cannot find a method declaration of a given type.

For each diagnosis type, we also calculated the relative cost of build errors: the number of diagnostic instances built multiplied by the average active repair cost. Two months of data and a total cost of 57,215,441 seconds. In two months, it took the developer about 21 months to fix the build bug. Cant. Resolve remained the most time-consuming diagnostic type, accounting for 47% of the total cost of active repair (approximately 10 months). Cant. Apply. Symbol represents 11% of the total cost of active restoration (approximately 2 months). These two errors alone accounted for 58% of the total time spent, which is about a year of developer work in our data set.

3 Running Example

In view of the ubiquity and cost of cant. Resolve and cant. Apply. Symbol in practice, we focus on generating repair proposals for these two types of build errors. However, our method has enough generality that it can be applied to any type of diagnosis with some modifications. This article uses Cant. Resolve as a working example. The compiler must recognize an identifier before it can be used. Inconsistencies between the definition of an identifier and its usage, including cases where the identifier cannot be found, are the root cause of this build error. This can happen, for example, when a dependency is missing (for example, in another library), an import is missing, or a symbol name is wrong.

Take the following code snippet for example:

import java.util.List;
class Service {
    List <String > names () {
    	returnImmutableList. Of (pub, sub); }}Copy the code

When this code is built, the compiler will report the following error:

Service.java :4: error: cannot find symbol   symbol: variable ImmutableList

The developer forgot to import the package for ImmutableList (from the Google Guava library), so the compiler could not parse the symbol. The developer resolves this error by finding the correct package for the symbol and adding the appropriate import statement:

import java.util.List;
+++ import com.google.common.collect.ImmutableList;
class Service {
    List <String > names () {
    	returnImmutableList. Of (pub, sub); }}Copy the code

Because ImmutableList is defined in a different project, you also declare dependencies on the build system. If Bazel is used to build the system, the fix might be to remove the existing reference to the Guava base package and add its COLLECT package as an alternative:

Java_library (name = Service, SRCS = [service.java,], deps = [-- --"//java/com/google/common/base", + + +"//java/com/google/common/collect"],...).Copy the code

4 Find the fix changes

After the diagnostic information is collected and the fix session is built, we input the fix session into the solution detection segment. The goal of this step is to systematically examine how developers have changed their source code to address build errors.

4.1 Retrieving code snapshots

At Google, developers’ code changes are automatically saved to the cloud by the IDE as temporary snapshots. This means that a complete record of code changes is maintained using retrievable snapshot identifiers. This allows us to go back and retrieve a snapshot of the code at a point in time for a particular build.

For each build repair session, we first extract the snapshot ids of the first and last build. The former corresponds to the code snapshot that led to the diagnosis, and the latter to the code snapshot that resolved the diagnosis.

We then query the snapshot cloud server with the snapshot ID to get the specific code that matches that point in time.

4.2 AST comparison

At this point we have two code snapshots for each repair session, one in the error state and one in the repair state. To see how developers can modify code to address build errors, we compared the code in the error state with the code in the fix state.

The conventional method for detecting source code changes is Unix Line difference (DIFF), which only calculates changes at the granularity of line-level add and delete actions in text. This causes diff to depend on the format of the source code. Although line differences are a popular method of artificial comparison, such as in code review, it is difficult to automatically infer syntactic changes in code from text line differences.

To analyze syntax-level code changes, we use a tree approach for difference comparison. We parse the error snapshot and repair snapshot for each repair session to generate the corresponding abstract syntax tree (AST). We created parsers for the Java and Bazel build languages in this project, although support for other languages can easily be added.

The AST of the faulty snapshot and the repaired snapshot are then placed into the tree’s comparison algorithm.

Definition 3 Abstract syntax tree differences AST Diff.Given two AST: sourcesAnd the target“AST Diff” means to takeintoA set of node modification operations.

The AST difference algorithm first tries to compare node labels and thus the source ASTEach node in is mapped to the target ASTOn the node of. If a mismatched node is detected, it calculates a short edit path that can be set toConverted to. Finding the shortest edit path is an NP-hard problem, so it can be calculated using a number of techniques  到The conversion. The final output of the tree difference calculation is a set of change operations representing the movement, update, deletion, or insertion of nodes on both the source AST and the target AST:

  • Mobile:The node (and its children) on:The top is moved to another position
  • Update:The value of the same node on () is:Change on
  • Delete: :The node (and its children) on:]() was deleted
  • Insert:Added:The node does not exist on

Figure 2 shows the AST in our example (see Section 3), Figure A shows the initial error state, and Figure B shows the AST that was fixed after the developer added the import statement.

Figure 2 Java AST changes Node changes detected by the AST DIff algorithm are shown in gray in Figure 2. The AST diff running the example detects that an IMPORT node is inserted under the root section COMPILATION_UNIT. The full package name of ImmutableList is also inserted as a child node under the new IMPORT node. For the build file, the detected change action is an update to the dependencies (DEPS), that is, updating the generic Base package to the generic Collect package.

4.3 Repairing Changes

We believe that there are some recurring patterns in developers’ practice of addressing build errors. We can automatically infer these patterns from the fix changes and use them to aid development.

Definition 4 Fix changes Resolution Change (RC). Repair changes (RC) refers to the AST Diff between the faulty snapshot and the repair snapshot in the repair session.

5 Fix build errors

Recent studies have shown that models originally used for analyzing natural languages, such as the N-Gram model, can also perform source code reasoning effectively. This is known as the _ naturalness hypothesis _ of software languages, meaning that large sets of code are universally similar to natural language texts, since coding is also a human interaction. Following this hypothesis, we believe that probabilistic machine learning models for natural language processing can also be used to aid programming.

Specifically, the idea is to treat the task of suggesting fixes as a neural machine translation (NMT) problem. In our case, the goal is not to translate one natural language into another, but rather to “translate” a given source to build diagnostic features into fixed change features that solve the diagnostic problem.

5.1 Feature Extraction

We generate eigenvalues based on each fix change, that is, build diagnosis and fix diagnosis edits. We use the generated features as input to the neural network to learn the transformation pattern between diagnosis and repair.

For each fix change in the fix session, we generate a source feature and a target feature. A pair of source-target characteristics reflects information about build failures and AST changes that are repaired accordingly.

The source characteristics. The source characteristics are related to building diagnostic types and their literal descriptions. These can be included in the source characteristics of any type of diagnosis, regardless of the diagnosis information itself.

To provide more context for the machine learning algorithm, we can optionally add diagnosis-related information to the source feature. For example, with Cant. Resolve, we parse a snapshot of the code that reported an error into an AST, find the missing symbol on the AST based on information provided by the compiler, such as tags and locations, and then traverse the tree to get its AST path.



Figure 3 AST path of ImmutableList


Definition 5 Abstract syntax tree Path AST Path (AP).Missing symbolsThe AST path is from the root to the AST of the offending codeThe node sequence of the parent node of

Figure 3 highlights the AST path where symbols are missing from our running example. The evaluation results show that the AST path can improve the accuracy of the results by 15%-20%. The AST path provides contextual information about missing symbols for deep neural networks, such as whether symbols are local or class variables within methods. In addition to the SYMBOL’s AST path, we add its node type, label, and child nodes (for example, type parameters for method calls) to the source characteristics. In the running example, the source characteristics include:

  • Diagnosis of type: compiler. Err. Cant. Resolve
  • Diagnostic text: Cannot find symbol
  • AST path: COMPILATION_UNIT CLASS METHOD RETURN METHOD_INVOCATION of
  • Symbol type: IDENTIFIER
  • Symbol label: ImmutableList

For cant. Apply. Symbol errors, we use three types of data to extend the source characteristics, namely, anticipation, discovery and cause. The expected data is the expected type inferred by the compiler. The discovery data shows the type found in the code because the data is the literal description of the symbol that the compiler cannot apply.

Target characteristics. Target characteristics contain information about fix changes, that is, AST changes made to fix code that failed to compile.

To capture the characteristics of the fixed changes, we defined a domain-specific language (DSL) called Delta. The syntax for Delta is based on the EBNF specification definition of the syntax parser, as shown in Listing 1.


Listing 1 Delta syntax

















Listing 2 Delta Java example





Listing 3 Delta build file sample feature generation. After feature extraction, we analyze the source and target characteristics respectively, two corresponding to generate the first | V | high-frequency words (token) vocabulary. For example, because the example COMPILATION_UNIT is the root node of the AST path, it is frequent in source features. So this word will appear in the source vocabulary. Likewise, the example BUILDFILE appears in many of the target features, and therefore in the target vocabulary. These high-frequency words are used for embedding in training and inference, that is, word cases in the vocabulary map to a real vector for training, and in the inference process, the vector representation of the results is mapped back to the vocabulary. Finally, the feature data set was randomly divided into three parts according to the proportion of 70%, 15% and 15%, which were respectively used for training, online model evaluation and offline model evaluation for invisible features.

5.2 Learn repair change mode

The deep NMT model has recently achieved good results in the field of natural language translation. Set with an encoder, decoder converts the source feature x y target characteristics, also known as seq2seq, NMT through the study of modeling and the process of conditional probability p (y | x) to obtain the results. The encoder is responsible for the representation of each source feature X without making any prediction. The decoder predicts the next word in the sequence based on the expression of the source, resulting in translation Y.

Our deep neural network (DNN) is built on TensorFlow and consists of a deep LSTM feedback neural network (RNN) of 1024 units, consisting of 4 encoder layers and 4 decoder layers. Encoder We choose Google Neural Machine Translation (GNMT) encoder, which consists of 1 bidirectional layer and 3 unidirectional layer.

We introduce the stochastic gradient descent (SGD) algorithm as an iterative method with a learning rate of 1.0. To reduce overfitting, we set the dropout value to 0.2. By ignoring a subset of neurons at random, dropout prevents collaborative adaptation of the training set. LSTM does well with short – to medium-length input sequences, but not with long text. The Attention Mechanism solves this limitation well by expanding the Attention range of the network. Since the characteristic sequence of the repair changes can be long, we used standard Bahdanau attention.

5.3 Deducing Repair Suggestions


The entire process is pipelined sequentially, including generating fix sessions, fix changes and eigenvalues, and training models. So our entire learning process is automated and repeatable.



After the model training is completed, it is uploaded to the server for query and repair prediction. The model can generate a large number of translations for any input. In our NMT setup, translations are searched using Beam Search, a novel search algorithm that compromises translation time and accuracy. The input of the model is the source feature x representing compilation failure, and the output is suggested to be n sequences {}. eachRepresents different fix proposals for X, consisting of a series of fix changes.

6 assessment

We conducted an empirical evaluation of the effect of Delta on two of the most common and time-consuming compilation errors, cant. Resolve and cant. Apply. Symbol.

6.1 Data Generation

Section 2.4 describes the data set we used for training and evaluation, which consists of build data that Google collected over a two-month period. We first calculate the AST difference between failed and resolved code for all green single resolution sessions. Then limit the number of AST changes to between zero and six, because large numbers of changes often involve refactoring, and earlier research has shown that fixes are generally less than six changes. A total of 110,219 green single-resolution sessions were calculated, covering 37,867 Java files and 7,011 build files. Due to the code base to be processed is a large business, we will source and target vocabulary words case limit | V | set to 30000. The feature generation step results in two vocabularies of source and target, each containing up to 30,000 non-repeating word cases. Resolve and cant. Apply. Symbol yielded 265,456 and 25,201 source/target features, respectively. As shown in Table 3, these eigenvalues were randomly scrambled and divided into three parts: training set (for training), validation set (for online evaluation during training) and test set (for offline evaluation of the model). Each diversity contains a pair of source/target characteristics. Table 3 generated eigenvalues

6.2 training

We extracted features and trained models for cant. Resolve and cant. Apply. Symbol to compare DeepDelta performance on different diagnostic types. On top of the DNN setting described in Section 5.2, we configure the network as follows: The maximum sequence length of both source and target is set to 100; The batch size is 128. The number of training steps is 100,000; Configure the inference algorithm to generate 10 suggestions at a time (see Section 5.3).

We input the source/target characteristics and vocabulary of the training set into the neural network to train the model. The model first makes an embed to the source/target word case, so the source and target vocabularies are provided to ensure that each word case is not repeated. We also input the validation set into the network for training verification. The model was trained on a Google internal cloud node with a GPU.

6.3 Evaluation Methods

We evaluated with a set of tests that also included source/target characteristics. These are data that the model has never been exposed to. We evaluate each diagnostic type separately, capture the source characteristics of each piece of data in the test set, input it to the inference server, and come up with repair recommendations. The target characteristics of the test data set, that is, the fixes that the developers actually implemented, were used as a baseline for the evaluation of the 10 proposed fixes generated by DeepDelta.

The following indicators were used to evaluate the effectiveness of the repair proposal.

Perplexity: The Perplexity reflects the quality of the model’s prediction samples, and the low Perplexity (single digit) indicates that the model is good at predicting a given sequence. BLEU: BELU is a common indicator used to evaluate the quality of sentences in machine translation. Practice shows that it has a good correlation with artificial judgment. BELU uses the N-gram model to calculate the matching of input and output statements based on words and their ordering. The value of BELU index is in the range of 1-100. A score of 25-40 in natural language translation is considered a high score.

Syntactic validation: To verify syntactic correctness, we generated a lexer and parser from the Delta syntax using ANTLR4. We input the fix suggestions into the Delta parser/parser so that we can evaluate whether the proposed fix changes from the model conform to the expected fix change syntax. The output is a Boolean value, that is, match or dismatch.

Correctness of fixes: Fixes are considered correct if at least one of the 10 recommended fixes is valid and perfectly matches the developer’s manual fixes (the target feature baseline for the test set). When comparing fix suggestions to benchmarks, we use text string equality comparisons.

Ranking of correct fixes: The ranking of correct fix proposals in the overall list of proposed fixes indicates the model’s ability to generate correct fixes. The higher the ranking, the earlier the fix can be applied. When DeepDelta generates the correct fix proposal, we mark its location in a list of 10 fix proposals.

6.4 the results

Table 4 shows the results for the two diagnostic types we evaluated. The table shows the confusion and BLEU score for each diagnostic type, the number of failed compilations in the test set, the number of recommendations generated (10 for each failed compilation), the average percentage of valid recommendations, and the percentage of correct fixes among the top three fix suggestions.

Table 4 results

Confusion versus BLEU scores: As mentioned earlier, the model seeks low confusion (single digits) and high BLEU scores (25-40). Table 4 shows that our model achieved a low degree of confusion of 1.8 and 8.5 for cant. Resolve and cant. Apply. Symbol, respectively. BLEU scores were also high, with Cant. Resolve scoring 42 and Cant. Apply. Symbol scoring 43.

Validation: Figure 4 shows the distribution of valid recommendations across 10 generated recommendations for each fault in the form of a histogram. Our validation of Delta syntax shows that 71% of the proposed fixes for cant. Resolve errors are valid. For cant. Apply.symbol errors, the percentage of valid suggestions is higher, at 98%, because code changes are syntactically simpler (such as method parameter changes). As you can see, most of the generated statement sequences are valid. We will discuss the main reasons for invalid recommendations in Section 7.

Figure 4. Distribution of valid suggestions among 10 fix suggestions for a single build error

Accuracy: Table 4 shows that 18,051 (50%) of the 36,101 errors reported by Cant. Resolve in our test set resulted in correct fixes, meaning that one out of 10 of the errors was an exact match for the developer’s manual fixes. Cant. Apply. Symbol also obtained a similar accuracy of 1,263 (47%) out of 2,687.

Ranking: For cases where the correct fix proposal is given, we evaluate the ranking of the correct proposal in the list of suggestions. Figure 5 depicts the location distribution of the correct recommendations among the 10 fix recommendations. The data shows that most of the correct advice is number one. Cant. Resolve had 85% of correct repairs in the top three, and cant. Apply. Symbol had 87% in the top three.

Figure 5 Location distribution of correct repair

7 Analysis and Risk

Correctness: DeepDelta suggested correct fixes for about half of the build failures in its tests. Correct fixes are not simple binary output, but consist of complex sequences of code changes to the AST. Our 50% correct rate is quite impressive in practice, compared to the ideal 27% for grammatical errors. This relatively high accuracy may be attributed to the following characteristics of the data set and method:

  • There are some general patterns in the way developers resolve build failures that DeepDelta can extract from fixing changes.
  • Delta, our target language, is highly structured so that models learn patterns in a systematic way.

Invalid sequences: We investigated the main causes of invalid recommendations, which can be summed up as:

  1. Unknown word examples: Word examples appear in the recommended target sequence. The model prediction gives a vectorized embedding, which the decoder then attempts to convert into word examples. Sometimes the embedding vector will be too far from the valid word cases in a given vocabulary to match the high-frequency word cases. This can happen when the repair pattern for a fault does not appear among the patterns learned by the model, or when the vocabulary does not contain the word case because it is not one of the high-frequency word cases.
  2. Incomplete sequence: The generated statement sequence is missing part of the Delta syntax sequence. This problem occurs especially on long sequences (more than 90 word examples), resulting in lost words at the end of the sequence.

Generalization: These results and techniques may not be generalizable to other development environments, such as where ides are widely used and different environments may have different types of common diagnostics. Although our data sets are all from the same company, almost all of our projects are in Google’s large source repository. So our data set does:

  • Covers more than 10,000 Java projects with 300 million lines of code. Many of these are open source projects, though only project data that Is being developed by Google.
  • Code changes made by thousands of professional developers from different backgrounds around the world, using about 5 editors/ides.
  • Contains 110,219 build compilation errors and their fixes, over 37,867 Java files and 7,011 build files collected over a two-month period

So it is reasonable to believe that our fix technique can be generalized to any other similar development environment, as long as there is a pattern in the way developers address diagnostic types. Even if our model learning and suggested fixes cannot be used in other code bases and build systems, the technique itself is universal and can be applied to other source code bases and build systems. For example, suppose we want to add external project dependencies to the Bazel build file, which is common in other build systems. As long as parsers are provided to parse other types of build profiles (for example, POM files for Maven, build.gradle), our system can learn to add dependencies in the right way.

Most common fixes: We can use the generated eigenvalues to study how developers resolve compilation errors. We are concerned with changing the schema, so we abstract the identifier name (e.g. ‘ImmutableList’) by replacing it with a constant string (i.e. ‘{X}’) in the target feature. These abstract features are vectored using the Word2VEc model with n-gram segmentation. Then the feature vector is input into the K-means algorithm for clustering, and k=20 is set. Table 5 shows the common patterns that cant. Resolve appears in the six largest clusters. The table contains a pattern description for each cluster, a fixed change pattern, and concrete examples.

Replicability: Our dataset contains AST of proprietary code, but these cannot be distributed. Our data set is unique because it contains snapshots of code in progress, not just committed code. This fineness is the key to making the whole approach work. We are currently not aware of any open source data sets that provide details of such errors/fixes. Our success can be replicated if we can collect fine-grained records of developer actions and use the Delta language specification and the open source TensorFlow NMT library.

Diagnostic types: Our study used only Java projects for diagnostics. But the technique is universal; the difference is just the AST. The only language-specific part is the parser used to build the AST. As long as we implement the parser, we can support other languages.

In this paper, we only focus on two types of errors (cant. Resolve and cant. Apply. Symbol) for two reasons: First, these two errors account for the vast majority of Java compilation errors encountered by Google developers, accounting for 59% of the number and 58% of the repair costs (Table 2). Our data shows that regardless of how difficult it is to fix (R1), developers spend a significant amount of time manually fixing it. An automated fix frees up the developer’s manpower to focus on other issues. Second, adding new error types requires collecting relevant developer data and retraining the model, which is time consuming.

Our results show that DeepDelta is suitable for different types of errors in both repair modes. We expect this to apply to other build error types as well, and intend to extend support for other types. At present, we have not found any other tools for fixing Java compilation errors that can be compared. The most similar study was an article on fixing syntax errors (only one type) in C language [18], which had a much lower success rate.

Parsable AST: Our study is based on AST difference comparisons, so the code that failed should support AST parsing. The code is always resolvable to a unique AST for symbol missing errors, but not necessarily for other diagnostic types. The design of the parser can be optimized to move away from the failing fast mechanism and instead use recovery from syntax errors. If you are dealing with an incomplete AST, you may need to change to this parser.

8 Future Planning

Converting the Delta program to code modification: The Delta program is essentially a difference between two versions of the same program, and DeepDelta predicts the difference based on diagnostic input. For the Delta program given by the prediction, we also apply it to the unknown program, especially to know where to make changes. Here we outline some ideas for solving this problem, using several common Delta changes as examples.

Take a look at the common repair change patterns in Cant. Resolve shown in Table 5. For “Add Missing import statement” and “Delete import”, we do not need any location information because the import must be located in a specific location in the Java source file. For “Change method call” and “Change Missing symbol,” we know from the diagnostic text where the missing symbol is, so we can modify the code there. For “Add Missing Build dependencies” and “Change build dependencies,” we know which build rule produces an error, so we can make changes to its dependency list.

In general, we can usually find reasonable code changes based on construction (only one syntactically valid change location) or location information in diagnostic text.

Landing code fix tools: We plan to integrate DeepDelta into our own developer environment. To do this, we validate the fix proposal before applying it to user code. We plan to build an application that will tentatively apply the suggested fixes to the code snapshot and perform the build. We can then discard fixes that do not compile and provide users only fixes that can be passed. There are some interesting interactions to make sure the tool is easy to use.

9 Related Studies

Extracting change patterns: We also mentioned extracting change patterns from code earlier. However, most existing technologies require predefined rules or human intervention to extract patterns. Fluri and Gall [14] defined 41 basic Java change types and used hierarchical clustering algorithms to study complex changes. Pan [37] created a database of basic changes using Unix differences and manually specified rules for more complex change types using Datalog.

Liveshits and Zimmerman [26] proposed an approach to investigate API usage patterns by mining association rules from the code histories of two Java projects, which they used for conflict detection. Kim et al. [25] manually examined the hand-written patches and extracted six common repair patterns in Java, which they then used for automated procedural fixes.

Hanam et al. [19] provided a semi-automatic detection method of error repair mode. They extracted the AST changes of language construction that solved runtime errors into feature vectors, grouped them into repair mode and sorted them by clustering, and then manually checked the clustering to obtain the error repair mode. Our approach requires no human intervention and can learn the _ unknown _ fix change pattern.

Synthesizing transformations through Examples: Another related area of research is code-conversion techniques such as LASE [33], Genesis [27], NoFAQ [8], and REFAZER [41], which can extract and synthesize syntax-conversion from examples. These techniques also have potential applications for program repair. Unlike our work, they operate with a single example or a small number of examples; The effect of extracting patterns from a large number of examples is not yet known, nor is it clear how it would be applied in a fix scenario.

Automatic program repair: Our research is in the field of automated program repair, that is, fixing errors through automated techniques. Program fixes have now been applied to areas as diverse as data structures [11,12], user interfaces [49], and source code for various programming languages such as C [16,24,29], Java [10,25], JavaScript [36], and PHP [42] (the best languages).

Patch search technology has many shortcomings in practice. First, they usually need predefined error pattern templates and cannot learn new patterns. Second, the patch generation process requires a huge amount of program space to search, which can be costly because thousands of patches have to be generated to fix the bug. Finally, studies have shown that they produce a lot of false positives, which often fix test failures rather than actual errors.

Machine learning for program repair: Allamanis et al. [2] provide a comprehensive overview of recent advances in source code analysis techniques using machine learning. Wang et al. [47] proposed a fault prediction technique by providing abstract semantic features of codes to neural networks for classification. Raychev et al. [40] used neural networks to complete the missing API calls in the code. Seidel et al. [43] achieved target type error location through the classification method of supervised learning.

Gupta et al. [18] proposed a SEQ2SEQ machine learning method to repair syntactic errors in C language. The main difference from our work is that they input the entire source code of the faulty and fixed versions into the network and achieved a 27% fix rate. We focused only on learning the characteristics of the fault and the AST changes associated with resolving the fault, which allowed us to achieve a higher accuracy rate (50%). They target syntactic errors in C, while we target build-time compile-time errors in Java. In addition, their test sample uses small-scale student work, whereas our assessment is based on a large amount of real developer data.

Our paper is the first to address compilation errors by learning AST change patterns. Studies on build errors [20,31] are different from ours: they focus only on build files, whereas we focus on both Java and build files; They use predefined fix templates, while we learn fix patterns; And on a larger scale, we used 37 and 175 failures [31] and [20], respectively, whereas we evaluated the DeepDelta algorithm with 38,788 failures.

10 summary

In this article, we looked at build errors and how developers can change the code to resolve them manually. We proved that there is a pattern in such code changes. We propose a common technique for extracting AST differences from faulty and fixed versions to learn code change patterns. We define automatic program fixes as machine translation problems.

Our DeepDelta technology uses a deep neural network to recommend fixed AST changes based on input from build diagnostics. Our evaluation of large amounts of real developer data from Google shows that DeepDelta generated correct fix suggestions on both compile diagnostic types 50% of the time, with correct fixes in the top three fix suggestions 85% to 87% of the time.

Our system currently supports the repair of the two most time-consuming diagnostic types, cant. Resolve and Cant. Apply. Symbol. We are planning to expand to support other diagnostic types.


AD time

Imgcook focuses on visual drafts in the form of Sketch, PSD and static images as input, and generates maintainable front-end code with one click through intelligent technology, including view code, data field binding, component code, part of business logic code, etc. Imgcook is expected to become a P5 level front-end engineer by using intelligent means to achieve a high degree of restoration under the premise of light constraints on the design draft, release the front-end productivity, help the front-end and designers to effectively collaborate, and let you focus on more challenging things!

Welcome to join us: [social recruitment/school recruitment] [Hangzhou] [Ali Tao Department of technology – channel and D2C intelligence] front-end recruitment: at this moment, not me!


reference

Note: only the articles mentioned in relevant studies are retained, the rest are omitted. [2] Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, And Charles Sutton. 2017. A Survey of Machine Learning for Big Code and Naturalness. ArXiv Preprint arXiv:1709.06182 (2017). [10] Favio DeMarco, Jifeng Xuan, Daniel Le Berre, and Martin Monperrus. 2014. Automatic Repair of Buggy if Conditions and Missing Preconditions with SMT. In Proceedings of the 6th InternationalWorkshop on Constraints in Software Testing, Verification, [11] Brian Demsky and Martin Rinard. 2005. Data Structure Reconstruction Using Goaldirected Reasoning Proceedings of the International Conference on Software Engineering (ICSE). 176 — 185. [12] Bassem Elkarablieh, Ivan Garcia, Yuk Lai Suen, and Sarfraz Khurshid. 2007. Assertion-based Repair of Complex Data Structures. In Proceedings of the twentysecond IEEE/ACM international conference on Automated software engineering. ACM, [14] Beat Fluri and Harald C Gall. 2006. Extraction Change Types for quentin Couplers Comprehension, 2006. ICPC 2006. 15th IEEE International Conference on IEEE. IEEE, 35 — 45. [16] C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. IEEE Transactions on Software Engineering 38, 1 (Jan 2012), 54-72. Doi.org/10.1109/TSE… [18] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning.. In Proceedings of the Conference on Arti. Social Intelligence (AAAI). 1345 — 1351. [19] Quinn Hanam, Fernando S de M Brito, and Ali Mesbah. 2016. Discovering Bug Patterns in JavaScript. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). ACM, [20] Foyzul Hassan and Xiaoyin Wang. 2018. HireBuild: An Automatic Approach to History-driven Repair of Build Scripts. In 2018 IEEE/ACM 40th International Conference on [24] Y. Ke, K. T. Stolee, C. L. Goues, ET al. and Y. Brun. 2015. Repairing Programs with Semantic Code Search. In Proceedings of the International Conference on Automated Software Engineering (ASE). IEEE Computer Society, 295 — 306. doi.org/10.1109/ASE… [25] Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In Proceedings of the International Conference on Software Engineering (ICSE). 802 — 811. [26] Benjamin Livshits and Thomas Zimmermann. 2005. DynaMine: Finding Common Error Patterns by Mining Software Revision Histories. In ACM SIGSOFT Software Engineering Notes, Journal of ACM, 296-305. [27] Fan Long, Peter Amidon, and Martin Rinard. 2017. Automatic Inference of Code Transforms for Patch Generation. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering. ACM, [29] Fan Long and Martin Rinard. 2016. An Analysis of the Search Spaces for Generate and Validate Patch Generation Systems. In Proceedings of the International Conference on Software Engineering (ICSE). ACM, New York, NY, USA, 702-713. Doi.org/10.1145/288… [31] Christian Macho, Shane McIntosh, and Martin Pinzger. 2018. Automatically Repairing Dependency-related Build Breakage. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 106 — 117. [33] and Kathryn S McKinley. 2013. LASE: Locating and Applying Systematic Edits by Learning from Examples. IEEE Press. [36] Frolin Ocariza, Karthik Pattabiraman, and Ali Mesbah. 2014. Vejovis: Suggesting Fixes for JavaScript Faults. In Proceedings of the International Conference on Software Engineering (ICSE). ACM, 837 — 847. [37] Kai Pan, Sunghun Kim, and E James Whitehead. 2009. Toward an Understanding of Bug Fix Patterns. Empirical Software Engineering 14, 3 (2009), [40] Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Acm Sigplan Notices, Vol. 49. ACM, [41] Reudismam Rolim, Gustavo Soares, Loris D ‘Antoni, Oleksandr Polozov, Sumit Gulwani, Rohit Gheyi, Ryo Suzuki, And Bjorn Hartmann. 2017. Learning Syntactic Program stocked from Examples. In Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 404 — 415. [42] Hesam Samimi, Max Schafer, Shay Artzi, Todd Millstein, Frank Tip, and Laurie Hendren. 2012. Automated Repair of HTML Generation Errors in PHP Applications Using String Constraint Solving. Proceedings of the International Conference on Software Engineering (ICSE). IEEE, 277 — 287. [47] Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically Learning Semantic Features for Defect Prediction. In Proceedings of the International Conference on Software Engineering (ICSE). ACM, 297-308.