Code intelligence: Problems and Solutions

By Xu Lun, F(X)Team, Ali Tao Department

In today’s natural language processing revolution based on pre-training large model, code intelligence technology is rapidly following the development. So, what is code intelligence basically doing? Maybe a lot of students have sci-fi ideas, like programmers losing their jobs. However, a lot of work is not that mysterious, it’s very basic. So what problem are we trying to solve with code intelligence?

Determine whether two pieces of code implement similar functions
Searches for the code that most closely resembles the current code snippet
Check for bugs in your code
Automatically fix bugs in your code
Automatically comment a piece of code
Recommend the most similar snippet based on the text
Generate code from text

Did it make you feel more magical? How can you solve such a difficult problem? To be honest, each of these sub-problems is hard, even for humans to learn. But just as humans learn step by step, machines are improving. What we need is not necessarily the almighty machine god, but also ordinary robots like us. They have great limitations, but they can help us reduce a lot of workload.

And what we’ll see in the last video is that the way to deal with so many complicated problems is very simple, a boha, and we can do it with a model.

Let’s take a look at some of these questions in detail.

Problem: Clone Detection

The task of code intelligence starts from clone detection. The so-called clone detection is to look for similar written and functional code. Don’t underestimate code repetition, it can significantly reduce the effectiveness of code intelligence training. If we look at the graph below, there’s repetition in the training set, there’s repetition in the test set, there’s still repetition in their intersection set, There is a detailed analysis in The thesis The Adverse Effects of Code Duplication in Machine Learning Models of Code.

Predict whether two pieces of code are similar

The following examples are from the BigCloneBench dataset. The address of the paper is arxiv.org/pdf/2002.08…

Here are a few examples of what is similar:

Code 1:

    private StringBuffer encoder(String arg) {
        if (arg == null) {
            arg = \"\"; } MessageDigest md5 = null; try { md5 = MessageDigest.getInstance(\"MD5\"); md5.update(arg.getBytes(SysConstant.charset)); } catch (Exception e) { e.printStackTrace(); } return toHex(md5.digest()); }Copy the code

Code 2:

    public String kodetu(String testusoila) {
        MessageDigest md = null;
        try {
            md = MessageDigest.getInstance(\"SHA\"); md.update(testusoila.getBytes(\"UTF-8\")); } catch (NoSuchAlgorithmException e) { new MezuLeiho(\"Ez da zifraketa algoritmoa aurkitu\", \"Ados\", \"Zifraketa Arazoa\", JOptionPane.ERROR_MESSAGE); e.printStackTrace(); } catch (UnsupportedEncodingException e) { new MezuLeiho(\"Errorea kodetzerakoan\", \"Ados\", \"Kodeketa Errorea\", JOptionPane.ERROR_MESSAGE); e.printStackTrace(); } byte raw[] = md.digest(); String hash = (new BASE64Encoder()).encode(raw); return hash; }Copy the code

The string in code 2 is written in Bask. They use different algorithms, nullation and exception handling are also different, but we think they are very similar, belong to the clone recognition of the same or highly similar.

Let’s look at another couple of examples:

Code 1:

    public static void test(String args[]) {
        int trace;
        int bytes_read = 0;
        int last_contentLenght = 0;
        try {
            BufferedReader reader;
            URL url;
            url = new URL(args[0]);
            URLConnection istream = url.openConnection();
            last_contentLenght = istream.getContentLength();
            reader = new BufferedReader(new InputStreamReader(istream.getInputStream()));
            System.out.println(url.toString());
            String line;
            trace = t2pNewTrace();
            while((line = reader.readLine()) ! =null) {
                bytes_read = bytes_read + line.length() + 1;
                t2pProcessLine(trace, line);
            }
            t2pHandleEventPairs(trace);
            t2pSort(trace, 0);
            t2pExportTrace(trace, new String(\"pngtest2.png\"), 1000, 700, (float) 0, (float) 33); T2pExportTrace (trace, new String(\"pngtest3.png\"), 1000, 700, (float) 2.3, (float) 2.44); System.out.println(\"Press any key to contiune read from stream !!! \ "); System.out.println(t2pGetProcessName(trace, 0)); System.in.read(); istream = url.openConnection(); if (last_contentLenght ! = istream.getContentLength()) { istream = url.openConnection(); istream.setRequestProperty(\"Range\", \"bytes=\" + Integer.toString(bytes_read) + \"-\"); System.out.println(Integer.toString(istream.getContentLength())); reader = new BufferedReader(new InputStreamReader(istream.getInputStream())); while ((line = reader.readLine()) ! = null) { System.out.println(line); t2pProcessLine(trace, line); } } else System.out.println(\"File not changed ! \ "); t2pDeleteTrace(trace); } catch (MalformedURLException e) { System.out.println(\"MalformedURLException !!! \ "); } catch (IOException e) { System.out.println(\"File not found \" + args[0]); }; }Copy the code

Code 2:

    private static String loadUrlToString(String a_url) throws IOException {
        URL l_url1 = new URL(a_url);
        BufferedReader br = new BufferedReader(new InputStreamReader(l_url1.openStream()));
        String l_content = \"\"; String l_ligne = null; l_content = br.readLine(); while ((l_ligne = br.readLine()) ! = null) { l_content += AA.SL + l_ligne; } return l_content; }Copy the code

Although this does not involve small languages, but obviously the code length difference is huge. Still, we think they’re similar.

Let’s look at a pair of dissimilar ones:

    private void setNodekeyInJsonResponse(String service) throws Exception {
        String filename = this.baseDirectory + service + \".json\"; Scanner s = new Scanner(new File(filename)); PrintWriter fw = new PrintWriter(new File(filename + \".new\")); while (s.hasNextLine()) { fw.println(s.nextLine().replaceAll(\"NODEKEY\", this.key)); } s.close(); fw.close(); (new File(filename + \".new\")).renameTo(new File(filename)); }Copy the code

Code 2:

    public void transform(String style, String spec, OutputStream out) throws IOException {
        URL url = new URL(rootURL, spec);
        InputStream in = new PatchXMLSymbolsStream(new StripDoctypeStream(url.openStream()));
        transform(style, in, out);
        in.close();
    }
Copy the code

I won’t explain the ones that aren’t similar.

The BigCloneBench dataset is a manual marking result that provides two pieces of code and whether they are similar.

TXT, valid. TXT and test. TXT, all of which have the same format:

idx1 idx2 0/1
Copy the code

Where IDx1 and IDx2 are the index values of two pieces of code in data.jsonl, and the last one is the manual marking value of whether they are similar. The code is stored in data.jsonl in the following format:

{"func":"Code"."idx":"Independence idx value"}
Copy the code

For example, the first two lines of the train.txt training set look like this:

13988825	8660836	0
80378	18548122	1
Copy the code

The corresponding structure of 13988825 in data.jsonl looks like this:

{"func": " private void setNodekeyInJsonResponse(String service) throws Exception {\n String filename = this.baseDirectory + service + \".json\"; \n Scanner s = new Scanner(new File(filename)); \n PrintWriter fw = new PrintWriter(new File(filename + \".new\")); \n while (s.hasNextLine()) {\n fw.println(s.nextLine().replaceAll(\"NODEKEY\", this.key)); \n }\n s.close(); \n fw.close(); \n (new File(filename + \".new\")).renameTo(new File(filename)); \n }\n"."idx": "13988825"}
Copy the code

8660836 corresponds to:

{"func": " public void transform(String style, String spec, OutputStream out) throws IOException {\n URL url = new URL(rootURL, spec); \n InputStream in = new PatchXMLSymbolsStream(new StripDoctypeStream(url.openStream())); \n transform(style, in, out); \n in.close(); \n }\n"."idx": "8660836"}
Copy the code

And the results are not similar. As you can see, this is the third example we wrote up here.

Searches for the section of code that is most semantically similar to the current section

For this, we used the POJ-104 data set of Li Ge and Li Shi team of Peking University.

This data set needs to go to drive.google.com/uc?id=0B2i-…

Each code snippet is described by an index, and the code field is the complete code. Let’s look at an example:

{
        "label":"1"."index":"0"."code":" int f(int a,int x) { int count=1,i; for(i=x; i< a; i++) if(a%i==0) count+=f(a/i,i); if(i==a) return count; else return 0; } void main() { int n,a; scanf(\"%d\",& n); for(; n> 0; n--) { scanf(\"%d\",& a); if(a==1||a==2) printf(\"1\ \"); else printf(\"%d\ \",f(a,2)); }}"
    }
Copy the code

The goal of this task, then, is to find the most similar snippet for a particular piece of code. Take top 2 as an example. The output is as follows:

{"index": "0", "answers": ["3", "2"]} {"index": "1", "answers": ["0", "4"]} {"index": "2", "answers": ["0", "1"]} {"index": "4", "answers": ["1", "5"]} {"index": "3", "answers": ["4", "2"]} {"index": "5", "answers": [" 4 ", "3"]}Copy the code

That is, the most similar code segments for index 0 are index 3 and 2.

Index 3 looks like this:

void qut(int a,int b);                                       / /??????
int num=0;                                                    / /?????????????
int main(a)
{
 int i,n,g[1000];                                         / /?????????????
 cin>>n;
 for(i=0; i<n; i++)/ /????????
  cin>>g[i];
 for(i=0; i<n; i++) { qut(g[i],1);                                         / /??????
  cout<<num<<endl;
              num=0;
 }
 return 0;
}

void qut(int a,int b)  
{
 int i;
 if (a>=b)  
 {
  num++;  
  if (b==1)                                      
   b++;
  for(i=b; i<=a; i++) {if (a%i==0) 
   {
    qut(a/i,i);                                 / /?? a%i==0,??}}}}Copy the code

Problem: Defect detection

The data set of defect detection is very simple and crude, just a piece of marking code to identify whether there is a vulnerability.

Let’s look at a leaky example:

{
        "project":"FFmpeg"."commit_id":"aba232cfa9b193604ed98f3fa505378d006b1b3b"."target":1."func":"static int r3d_read_rdvo(AVFormatContext *s, Atom *atom) { R3DContext *r3d = s->priv_data; AVStream *st = s->streams[0]; int i; r3d->video_offsets_count = (atom->size - 8) / 4; r3d->video_offsets = av_malloc(atom->size); if (! r3d->video_offsets) return AVERROR(ENOMEM); for (i = 0; i < r3d->video_offsets_count; i++) { r3d->video_offsets[i] = avio_rb32(s->pb); if (! r3d->video_offsets[i]) { r3d->video_offsets_count = i; break; } av_dlog(s, \"video offset %d: %#x\ \", i, r3d->video_offsets[i]); } if (st->r_frame_rate.num) st->duration = av_rescale_q(r3d->video_offsets_count, (AVRational){st->r_frame_rate.den, st->r_frame_rate.num}, st->time_base); av_dlog(s, \"duration %\"PRId64\"\ \", st->duration); return 0; }"."idx":5
    }
Copy the code

That’s all there is to it. As for which line is the problem, there’s no training set.

Of course, most of the data set is still spotless, like the first one:

{"project": "FFmpeg"."commit_id": "973b1a6b9070e2bf17d17568cbaf4043ce931f51"."target": 0."func": "static av_cold int vdadec_init(AVCodecContext *avctx)\n\n{\n\n VDADecoderContext *ctx = avctx->priv_data; \n\n struct vda_context *vda_ctx = &ctx->vda_ctx; \n\n OSStatus status; \n\n int ret; \n\n\n\n ctx->h264_initialized = 0; \n\n\n\n /* init pix_fmts of codec */\n\n if (! ff_h264_vda_decoder.pix_fmts) {\n\n if (kCFCoreFoundationVersionNumber < kCFCoreFoundationVersionNumber10_7)\n\n ff_h264_vda_decoder.pix_fmts = vda_pixfmts_prior_10_7; \n\n else\n\n ff_h264_vda_decoder.pix_fmts = vda_pixfmts; \n\n }\n\n\n\n /* init vda */\n\n memset(vda_ctx, 0, sizeof(struct vda_context)); \n\n vda_ctx->width = avctx->width; \n\n vda_ctx->height = avctx->height; \n\n vda_ctx->format = 'avc1'; \n\n vda_ctx->use_sync_decoding = 1; \n\n vda_ctx->use_ref_buffer = 1; \n\n ctx->pix_fmt = avctx->get_format(avctx, avctx->codec->pix_fmts); \n\n switch (ctx->pix_fmt) {\n\n case AV_PIX_FMT_UYVY422:\n\n vda_ctx->cv_pix_fmt_type = '2vuy'; \n\n break; \n\n case AV_PIX_FMT_YUYV422:\n\n vda_ctx->cv_pix_fmt_type = 'yuvs'; \n\n break; \n\n case AV_PIX_FMT_NV12:\n\n vda_ctx->cv_pix_fmt_type = '420v'; \n\n break; \n\n case AV_PIX_FMT_YUV420P:\n\n vda_ctx->cv_pix_fmt_type = 'y420'; \n\n break; \n\n default:\n\n av_log(avctx, AV_LOG_ERROR, \"Unsupported pixel format: %d\\n\", avctx->pix_fmt); \n\n goto failed; \n\n }\n\n status = ff_vda_create_decoder(vda_ctx,\n\n avctx->extradata, avctx->extradata_size); \n\n if (status ! = kVDADecoderNoErr) {\n\n av_log(avctx, AV_LOG_ERROR,\n\n \"Failed to init VDA decoder: %d.\\n\", status); \n\n goto failed; \n\n }\n\n avctx->hwaccel_context = vda_ctx; \n\n\n\n /* changes callback functions */\n\n avctx->get_format = get_format; \n\n avctx->get_buffer2 = get_buffer2; \n\n#if FF_API_GET_BUFFER\n\n // force the old get_buffer to be empty\n\n avctx->get_buffer = NULL; \n\n#endif\n\n\n\n /* init H.264 decoder */\n\n ret = ff_h264_decoder.init(avctx); \n\n if (ret < 0) {\n\n av_log(avctx, AV_LOG_ERROR, \"Failed to open H.264 decoder.\\n\"); \n\n goto failed; \n\n }\n\n ctx->h264_initialized = 1; \n\n\n\n return 0; \n\n\n\nfailed:\n\n vdadec_close(avctx); \n\n return -1; \n\n}\n"."idx": 0}
Copy the code

The result is 0 or 1 for each index:

0, 0, 1, 2, 1, 3, 0, 4, 0Copy the code

Problem: Code autofixes

With the ability to identify code bugs, the next step is to learn how to fix code automatically.

The topic of automatic code repair is also very simple, one is the buggy code, the other is the fixed code.

Let’s look at an example:

The buggy code looks like this:

public java.lang.String METHOD_1 ( ) { return new TYPE_1 ( STRING_1 ) . format ( VAR_1 [ ( ( VAR_1 . length ) - 1 ) ] . getTime ( ) ) ; }
Copy the code

The fix looks like this:

public java.lang.String METHOD_1 ( ) { return new TYPE_1 ( STRING_1 ) . format ( VAR_1 [ ( ( type ) - 1 ) ] . getTime ( ) ) ; }
Copy the code

It’s really hard for the algorithm. People seem to have trouble.

Problem: Code translation

For example, C# language and Java language translation. As long as we have a set of code written in C# and Java, we can learn to translate each other.

Let’s look at a couple of examples. C# code:

public virtual ListSpeechSynthesisTasksResponse ListSpeechSynthesisTasks(ListSpeechSynthesisTasksRequest request){
  var options = new InvokeOptions();
  options.RequestMarshaller = ListSpeechSynthesisTasksRequestMarshaller.Instance;
  options.ResponseUnmarshaller = ListSpeechSynthesisTasksResponseUnmarshaller.Instance;
  return Invoke<ListSpeechSynthesisTasksResponse>(request, options);
}
Copy the code

The corresponding Java

public ListSpeechSynthesisTasksResult listSpeechSynthesisTasks(ListSpeechSynthesisTasksRequest request) {
  request = beforeClientExecution(request);
  return executeListSpeechSynthesisTasks(request);
}
Copy the code

Problem: Writing comments to code

In the training material, there is code and comments, and the purpose of this task is to write comments for new code. The metric is the linguistic accuracy of the generated annotations.

For this we use the CodeSearchNet dataset.

The format of each record in this dataset is as follows:

Repo: warehouse
Path: the file name
Func_name: name of a function or method
Original_string: indicates the unprocessed source string
Language: Programming language
Code /function: code information
Code_tokens /function_tokens: code result after word segmentation
Docstring: Comment string information
Docstring_tokens: Result of docstring participles
Url: a unique identifier for a natural language
Idx: The unique identifier of the code segment

Let’s look at an example:

{"repo": "ciena-blueplanet/bunsen-core"."path": "src/reducer.js"."func_name": ""."original_string": "function (state, action) {\n return _.defaults({\n isValidating: action.isValidating,\n lastAction: IS_VALIDA TING\n }, state)\n }"."language": "javascript"."code": "function (state, action) {\n return _.defaults({ \n isValidating: action.isValidating,\n lastAction: IS_VALIDATING\n }, state)\n }"."code_tokens":
["function"."("."state".","."action".")"."{"."return"."_"."."."defaults"."("."{"."isValidating".":"
, "action"."."."isValidating".","."lastAction".":"."IS_VALIDATING"."}".","."state".")"."}"]."docstrin
g": "Update is validating result\n@param {State} state - state to update\n@param {Action} action - action\n@retur ns {State}  - updated state"."docstring_tokens": ["Update"."is"."validating"."result"]."sha": "993c67e314e2b7
5003a1ff4c2f0cb667715562b2"."url": "https://github.com/ciena-blueplanet/bunsen-core/blob/993c67e314e2b75003a1ff4
c2f0cb667715562b2/src/reducer.js#L394-L399"."partition": "train"}
Copy the code

For the generated natural language, we adopt the scoring Method of ORANGE: A Method for Evaluating Automatic Evaluation Metrics for Machine Translation.

Problem: Matching the most appropriate code snippet for natural language text

We still use the CodeSearchNet data set from the previous section.

This search results in something like the following:

{" url ":" url0 ", "answers:",11,12,13,14 [10]} {" url ":" url1 ", "answers:",12,11,13,14 [10]} {" url ":" url2 ", "answers" : ,11,12,10,14 [13]} {" url ":" url3 ", "answers:",14,12,13,11 [10]} {" url ":" url4 ", "answers:",11,12,13,14 [10]}Copy the code

With the UI, it looks like this:

Or this:

Problem: Generate code from natural language

This is the ultimate task, which is to generate a piece of code from a text description.

The format is very simple, just a piece of code and a piece of text.

Let’s look at an example of a training sample:

{"code": "void function ( Binder arg0 ) { EventBus loc0 = new EventBus ( ) ; AmbariEventPublisher loc1 = new AmbariEventPublisher ( ) ; repla
ceEventBus ( AmbariEventPublisher . class , loc1 , loc0 ) ; arg0 . bind ( AmbariEventPublisher . class ) . toInstance ( loc1 ) ; }", "nl": "force the eventb us from ambarievent publisher to be serialand synchronous . concode_field_sep PlaceHolder placeHolder concode_field_sep void registerAlertListeners concode_elem_sep EventBus synchronizeAlertEventPublisher concode_elem_sep void replaceEventBus concode_elem_sep void registerAmbariListeners"}
Copy the code

This NL part is a bit messy, I can’t help it, in order to increase the amount of data, there are not so many hands to make accurate marks.

Let’s look at one more:

{"code": "byte [ ] function ( Class < ? > arg0 , Configuration arg1 ) { return AuthenticationTokenSerializer . serialize ( org . apache . acc
umulo . core . client . mapreduce . lib . impl . ConfiguratorBase . getAuthenticationToken ( arg0 , arg1 ) ) ; }", "nl": "do n't use this . n
o , really , do n't use this . you already have an authenticationtoken with org.apache.accumulo.core.client.mapreduce.lib.impl.configuratorba
se #getauthenticationtoken class , configuration . you do n't need to construct it yourself . gets the password from the configuration . warn
ing : the password is stored in the configuration and shared with all mapreduce tasks ; it is base64 encoded to provide a charset safe conver
sion to a string , and is not intended to be secure . concode_field_sep PlaceHolder placeHolder concode_field_sep String getPrincipal concode
_elem_sep void setLogLevel concode_elem_sep Level getLogLevel concode_elem_sep Boolean isConnectorInfoSet concode_elem_sep String getTokenCla
ss concode_elem_sep void setZooKeeperInstance concode_elem_sep void setMockInstance concode_elem_sep Instance getInstance concode_elem_sep St
ring enumToConfKey concode_elem_sep void setConnectorInfo"}
Copy the code

Isn’t the quality any better? This is what the CONCODE dataset looks like.

Solution: Multi-task learning based on large-scale pre-training model

402 years ago, when Nurhaci faced the siege of the Ming Dynasty, he adopted the strategy of “how many ways you come, I only go all the way” to win the battle of Sarhu. We also learn from the wisdom of the ancients, let your data set change, and we use only one tool – a large-scale pre-training model.

Here is a brief history of the pre-training model:

Taking Microsoft’s Codebert model we showed at the beginning as an example, we can handle the most complex code generation tasks above with a single command:

python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run.py \ --data_dir=$DATADIR \ --langs=$LANG \ --output_dir=$OUTPUTDIR \ --pretrain_dir=$PRETRAINDIR \ --log_file=$LOGFILE \ --model_type=gpt2 \ --block_size=512 \ --do_train --node_index 0 \ --gpu_per_node $PER_NODE_GPU \ --learning_rate= 5E-5 \ --weight_decay=0.01 \ --evaluate_during_training \ --per_gpu_train_batch_size=6 \ --per_gpu_eval_batch_size=12 \ --gradient_accumulation_steps=2 \ --num_train_epochs=30 \ --logging_steps=100 \ --save_steps=5000 \ --overwrite_output_dir \ --seed=42Copy the code

If you use two 2 NVIDIA P100 GPU cards, you can finish training in about 22 hours.

Reasoning is also done in one sentence:

python -u run.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=512 \
        --do_infer \
        --logging_steps=100 \
        --seed=42
Copy the code

With only one P100 card, it can be done in about 40 minutes.

With the above foundation, we can play the game. The data sets described above are all questions for the competition:

The data sets mentioned above can be found at github.com/microsoft/C…

Welcome to the world of code intelligence!

Appendix: A quick start guide

Put weng Yun: paper come zhongjue shallow, must know this to practice. Now we will be on the ground, the code of intelligent model training and reasoning run ~ ~ ~

Step 1: Install the Transformers framework, because Codebert is based on it:

pip install transformers --user
Copy the code

Step 2: Install PyTorch or Tensorflow as the backend of the Transformers. As of July 5, 2021, PyTorch version 1.5.0 or above is required. If the driver works, just install the latest one:

pip install torch torchvision torchtext torchaudio --user
Copy the code

Step 3: Download Microsoft’s data set

git clone https://github.com/microsoft/CodeXGLUE
Copy the code

Step four, let’s start with BigCloneBench

Go to code-code/clone-detection-bigclonebench/Code and run:

python run.py --output_dir=./saved_models --model_type=roberta --config_name=microsoft/codebert-base --model_name_or_path=microsoft/codebert-base --tokenizer_name=roberta-base --do_train --train_data_file=.. /dataset/train.txt --eval_data_file=.. /dataset/valid.txt --test_data_file=.. /dataset/test.txt --epoch 2 --block_size 400 --train_batch_size 16 --eval_batch_size 32 --learning_rate 5e-5 - 1.0 - evaluate_during_training max_grad_norm - seed 123456 2 > &1 | tee "train". The logCopy the code

Then the training is up and running:

07/05/2021 16:29:24 - INFO - __main__ -   ***** Running training *****
07/05/2021 16:29:24 - INFO - __main__ -     Num examples = 90102
07/05/2021 16:29:24 - INFO - __main__ -     Num Epochs = 2
07/05/2021 16:29:24 - INFO - __main__ -     Instantaneous batch size per GPU = 8
07/05/2021 16:29:24 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 16
07/05/2021 16:29:24 - INFO - __main__ -     Gradient Accumulation steps = 1
07/05/2021 16:29:24 - INFO - __main__ -     Total optimization steps = 11264
Copy the code

It takes about 40 minutes to train on two V100 cards. Training is followed by verification, which saves the best results so far to checkpoint for inference

07/05/2021 17:10:04 - INFO - __main__ -   ***** Running evaluation  ***** 40950/41541 [00:10<00:00, 2785.61it/s]
07/05/2021 17:10:04 - INFO - __main__ -     Num examples = 41541
07/05/2021 17:10:04 - INFO - __main__ -     Batch size = 32
07/05/2021 17:16:05 - INFO - __main__ -   ***** Eval results  *****
07/05/2021 17:16:05 - INFO - __main__ -     eval_f1 = 0.9531
07/05/2021 17:16:05 - INFO - __main__ -     eval_precision = 0.9579
07/05/2021 17:16:05 - INFO - __main__ -     eval_recall = 0.9484
07/05/2021 17:16:05 - INFO - __main__ -     eval_threshold = 0.97
07/05/2021 17:16:06 - INFO - __main__ -     ********************
07/05/2021 17:16:06 - INFO - __main__ -     Best f1:0.9531
07/05/2021 17:16:06 - INFO - __main__ -     ********************
07/05/2021 17:16:08 - INFO - __main__ -   Saving model checkpoint to ./saved_models/checkpoint-best-f1/model.bin
Copy the code

Training for two rounds, the effect of the second round increased to 0.97 + :

07/05/2021 17:56:43 - INFO - __main__ - ***** Running evaluation ***** 40950/41541 [00:12<00:00, 3535.62 IT /s] 07/05/2021 17:56:43-info-__main__ - Num examples = 41541 07/05/2021 17:56:43-info-__main__ - Batch size = 32 [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) 07/05/2021 18:02:44 - INFO - __main__ - ***** Eval results ***** 07/05/2021 18:02:44-info-__main__ -eval_F1 = 0.9701 07/05/2021 18:02:44-info-__main__ -eval_precision = 0.9772 07/05/2021 18:02:44-info-__main__ -eval_recall = 0.9633 07/05/2021 18:02:44-info-__main__ - Eval_threshold = 0.97 07/05/2021 18:02:45 - INFO - __main__ - * * * * * * * * * * * * * * * * * * * * 07/05/2021 18:02:45 - INFO - __main__ - Best f1:0.9701 07/05/2021 18:02:45 - INFO - __main__ - * * * * * * * * * * * * * * * * * * * * 07/05/2021 18:02:47 - INFO - __main__ - Saving model checkpoint to ./saved_models/checkpoint-best-f1/model.binCopy the code

Then let’s use our trained model to reason:

python run.py \ --output_dir=./saved_models \ --model_type=roberta \ --config_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --tokenizer_name=roberta-base \ --do_eval \ --do_test \ --train_data_file=.. /dataset/train.txt \ --eval_data_file=.. /dataset/valid.txt \ --test_data_file=.. /dataset/test.txt \ --epoch 2 \ --block_size 400 \ --train_batch_size 16 \ --eval_batch_size 32 \ --learning_rate 5e-5 \ - 1.0 \ max_grad_norm evaluate_during_training \ | 2 > 123456 - seed & 1 tee test. The logCopy the code

Finally, we ran evaluator.py to see the test results:

python .. /evaluator/evaluator.py -a .. /dataset/test.txt -p saved_models/predictions.txtCopy the code

The output is as follows:

{'Recall': 0.9677421599288263, 'Prediction': 0.9557057904236594, 'F1': 0.9616080550111168}
Copy the code

Accuracy 0.956, recall rate 0.968, not bad ~

Compare this to CodeXGLUE’s leaderboards:

Pretty much the same as CodeBert’s results on the list

Tao department front – F-X-team opened a weibo! (Visible after microblog recording)

In addition to the article there is more team content to unlock 🔓

Code intelligence: Problems and Solutions

Problem: Clone Detection

Predict whether two pieces of code are similar

Searches for the section of code that is most semantically similar to the current section

Problem: Defect detection

Problem: Code autofixes

Problem: Code translation

Problem: Writing comments to code

Problem: Matching the most appropriate code snippet for natural language text

Problem: Generate code from natural language

Solution: Multi-task learning based on large-scale pre-training model

Appendix: A quick start guide

Related Posts

Summary of Attention in deep learning

【 Optimization algorithm 】 Cat swarm optimization algorithm (CSO)

Construction of ElasticSearch system