In May this year, Facebook AI Research Institute (FAIR) published their research results fairseQ, in which they used a new convolutional neural network to do language translation, which is 9 times faster than the circular neural network, and also the highest accuracy of existing models. In addition, they posted the source code and trained system for the Fair Sequence Modeling Toolkit on GitHub, allowing other researchers to build their own models for translation, text summarization, and other tasks.

See 9 times Faster! Fairseq, Facebook’s open source machine learning translation project.

Facebook’s AI research team has opened source a version of Fairseq PyTorch on GitHub.

Related introduction

Fairseq is a serial-to-sequence learning tool published by Facebook AI Research. Fairseq was originally written (in no particular order) by Sergey Edunov, Myle Ott, and Sam Gross. The toolkit can implement Convolutional Sequence to Sequence Learning ( https://arxiv.org/abs/1705.03122) as described in the convolution model, and can be conducted on a machine much GPU training, also can produce quickly on the CPU and GPU beam search (beam search). In open source data, they provide pre-training models for Translating English into French and German.

reference

If you use FAIR code in your paper, you can quote it like this:

@inproceedings{gehring2017convs2s,

 author    = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},

 title     = “{Convolutional Sequence to Sequence Learning}”,

 booktitle = {Proc. of ICML},

 year      = 2017,

}

Tools and Installation

  • MacOS or Linux

  • If you want to train the new model, it is necessary to use NVIDIA GPU and NCCL (https://github.com/NVIDIA/nccl)

  • Python 3.6

  • Install PyTorch (pytorch.org/)

Currently FairSeq-Py requires PyTorch from the GitHub library, and there are several ways to install it. We recommend using Miniconda3 to perform the following steps.

1, install Miniconda3 (https://conda.io/miniconda.html); Activate the Python 3 environment

Install PyTorch

conda install gcc numpy cudnn nccl

conda install magma-cuda80 -c soumith

pip install cmake

pip install cffi



git clone https://github.com/pytorch/pytorch.git

cd pytorch

git reset –hard a03e5cb40938b6b3f3e6dbddf9cff8afdff72d1b

git submodule update –init

pip install -r requirements.txt



NO_DISTRIBUTED=1 python setup.py install

3. Copy and execute the following code on GitHub to install FairseQ-py

pip install -r requirements.txt

python setup.py build

python setup.py develop

Quick start

You will need to use the following command:

  • Python preprocess.py: Data preprocessing: Constructing lexical and binary training data

  • Python train.py: Train new models on one or more Gpus

  • Python generate.py: Translate preprocessed data using trained models

  • Python generate.py -i: Translate new text with trained models

  • Python score.py: Gives the BLEU score of the generated translation by comparing it with the reference translation

Evaluate the pre-training model:

First, download pre-trained models and vocabulary:

$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.en-fr.fconv-py.tar.bz2 | tar xvjf –

Model is used in the BPE vocabulary (https://arxiv.org/abs/1508.07909), the user must apply coding to the source text before translation. You can use the wmt14.en-fr. fconv-CUDa /bpecodes file in the apply_bpe.py script. @@ is a continuation tag. The original text can be restored by sed s/@@ //g, as can passing the –remove-bpe tag to generate.py. Before generating the BPE vocabulary. The input text needs to be marked with Tokenizer.perl in mosesdecoder.

Here is an example of generating. Py -i in Python with beam size 5:

$ MODEL_DIR=wmt14.en-fr.fconv-py $ python generate.py -i \ –path $MODEL_DIR/model.pt $MODEL_DIR \ –beam 5 | [en] dictionary: 44206 types | [fr] dictionary: 44463 types | model fconv_wmt_en_fr | loaded checkpoint /private/home/edunov/wmt14.en-fr.fconv-py/model.pt (epoch 37) > Why is it rare to discover new marine mam@@ mal species ? S Why is it rare to discover new marine mam@@ mal species ? O (95) H-0.08662842959165573 Pourquoi EST-il Rare de (18965) Nouvelles especes de mammiferes marins? A 0 1 3 3 5 6 6 10 8 8 8 11 12

Training new model

Data preprocessing

The FairseQ-Py toolkit includes a sample preprocessing script for the IWSLT 2014 German-English corpus. Data is preprocessed and binary encoded first:

$ cd data/

$ bash prepare-iwslt14.sh

$ cd ..

$ TEXT=data/iwslt14.tokenized.de-en

$ python preprocess.py –source-lang de –target-lang en \

 –trainpref $TEXT/train –validpref $TEXT/valid –testpref $TEXT/test \

 –thresholdtgt 3 –thresholdsrc 3 –destdir data-bin/iwslt14.tokenized.de-en

This results in binary data that can be used to train the model.

training

To train the new model in Python train.py, here are some sample Settings that work well with the IWSLT 2014 dataset.

$mkdir -p checkpoints/fconv $CUDA_VISIBLE_DEVICES=0 Python train.py data-bin/ iwslt14.tokenization.de-en \ –lr 0.25 — Clip-norm 0.1 — Dropout 0.2 — Max-Tokens 4000 \ — ARCH Fconv_iwSLT_DE_en — Save-dir CHECKPOINTS/Fconv

By default, Python train.py takes up all available Gpus on your computer. You can use the CUDA_VISIBLE_DEVICES environment to select a specific GPU, or change the number of Gpus used.

Note that the batch size is set based on the maximum number of tokens per batch, and you need to select a smaller value based on the GPU memory available in the system.

Generate the translation

Once the model is trained, it can generate translations using Python generate.py (for binary data) or Python generate.py -i (for unprocessed text).

$ python generate.py data-bin/iwslt14.tokenized.de-en \

 –path checkpoints/fconv/checkpoint_best.pt \

 –batch-size 128 –beam 5

 | [de] dictionary: 35475 types

 | [en] dictionary: 24739 types

 | data-bin/iwslt14.tokenized.de-en test 6750 examples

 | model fconv

 | loaded checkpoint trainings/fconv/checkpoint_best.pt

 S-721   danke .

 T-721   thank you .

 …

If you want to use only one CPU, add the — CPU flag. You can remove the BPE flag by using –remove-bpe.

Trained model

The current open source full convolution sequence to sequence model is as follows:

  • Wmt14. En – fr. Fconv – py. Tar. The.bz2 (https://s3.amazonaws.com/fairseq-py/models/wmt14.en-fr.fconv-py.tar.bz2) : Model for WMT14 English translation, including vocabulary

  • Wmt14. En – DE. Fconv – py. Tar. The.bz2 (https://s3.amazonaws.com/fairseq-py/models/wmt14.en-de.fconv-py.tar.bz2) : Model for WMT14 English – German translation, including vocabulary

For the above model, the preprocessed and coded test sets are as follows:

  • Wmt14. En – fr. Newstest2014. Tar..bz2 (https://s3.amazonaws.com/fairseq-py/data/wmt14.en-fr.newstest2014.tar.bz2) : Newstest2014 test set for English translation of WMT14

  • Wmt14. En – fr. Ntst1213. Tar..bz2 (https://s3.amazonaws.com/fairseq-py/data/wmt14.en-fr.ntst1213.tar.bz2) : Newstest2012 and Newstest2013 test sets for English translation of WMT14

  • Wmt14. En – DE. Newstest2014. Tar..bz2 (https://s3.amazonaws.com/fairseq-py/data/wmt14.en-de.newstest2014.tar.bz2) : Newstest2014 test set for WMT14 English/German translation

The following is an example of using the test set to produce results on a GTX-1080TI, running in Batch mode:

$ curl https://s3.amazonaws.com/fairseq-py/models/wmt14.en-fr.fconv-py.tar.bz2 | tar xvjf – -C data-bin $ curl https://s3.amazonaws.com/fairseq-py/data/wmt14.en-fr.newstest2014.tar.bz2 | tar xvjf – -C data-bin $ python generate.py data-bin/wmt14.en-fr.newstest2014 \ –path data-bin/wmt14.en-fr.fconv-py/model.pt \ –beam 5 –batch-size 128 –remove-bpe | tee /tmp/gen.out … 95451 | 3003 sentences Translated (tokens) in 81.3 s (1174.33 tokens/s) | the Generate test with beam = 5: BLEU4 = 40.23, 67.5/46.4/33.8/25.0 (BP=0.997, Ratio =1.003, SYslen =80963, reflen=81194) # Scoring with score.py: $ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys $ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref $ python Py –sys/TMP /gen.out.sys –ref/TMP /gen.out.ref BLEU4 = 40.23, 67.5/46.4/33.8/25.0 (BP=0.997, ratio=1.003, syslen=80963, reflen=81194)

Via: GitHub (github.com/facebookres…

Lei Feng (public account: Lei Feng)AI technology review compilation and collation. Lei feng network

Lei Feng net copyright article, unauthorized reprint prohibited. See instructions for details.