Recently (yesterday, to be exact) someone released a Python library that uses deep learning to translate text. It’s very easy to call. Based on Facebook AI’s multilingual translation model, it supports 50 languages. To help you use it.

Note: No knowledge of deep learning may be required to use it, but a basic knowledge of Python is required.

use

The installation

Installing it is as simple as executing this line:

pip install dl-translate
Copy the code

However, it is recommended to create a new environment installation, which is based on the latest version of PyTorch and has not been tested on other versions. In order not to disturb the system environment, it is better to create a new environment.

Conda create -n torch1.8 python=3.8 conda activate Torch1.8 PIP install DL-translateCopy the code

use

The code given in the official guide, which is only four lines long, can be translated easily:

The import dl_translate as DLT mt = DLT. TranslationModel () # define model text_hi = "स ं य ु क ् त र ा ष ् ट ् र क े प ् र म ु ख क ा क ह न ा ह ै क ि स ी र ि य ा म े ं क ो ई स ै न ् य स म ा ध ा न न ह ी ं ह ै "mt. Translate (text_hi, source = DLT. Lang. HINDI, target = DLT. Lang. ENGLISH)Copy the code

Note that you will need to download the model for the first time, which may be slow because the model supports translation into 50 languages, so the model is quite large at 2.3GB. If you need, I can help you download down to Baidu network disk.

View the languages supported by the model:

For a long paragraph, the translation may be very slow, so it is recommended to first break it into sentence form, and then translate it sentence by sentence. For English, you can use the NLTK package to divide sentences, and then combine the translation results of each sentence.

import nltk

nltk.download("punkt")

text = "Mr. Smith went to his favorite cafe. There, he met his friend Dr. Doe."
sents = nltk.tokenize.sent_tokenize(text, "english")  # don't use dlt.lang.ENGLISH
" ".join(mt.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.FRENCH))
Copy the code

Batch translation

During the translation process, we can take advantage of GPU parallel computing to translate and adjust batCH_size, of course, the premise is that your GPU can hold so many sentences.

. mt = dlt.TranslationModel() mt.translate(text, source, target, batch_size=32, verbose=True)Copy the code

The input text, which can be either a list of strings or a single string, gives the corresponding result.

The performance test

Since the model is so large that my GPU (2080ti) can’t fit it, the following tests are CPU-based and should be much faster if you can afford to use a GPU. In my test, I translated this sentence 100 times to count the time:

Many technical approaches have been proposed for ensuring that decisions made by AI systems are fair, but few of these methods have been applied in real-world settings.

It corresponds to Google Translate:

Many technical approaches have been proposed to ensure that AI systems make fair decisions, but few of these approaches have been applied in real-world environments.

Test code:

import dl_translate as dlt import time from tqdm import tqdm time_s = time.time() mt = dlt.TranslationModel(model_options=dict(cache_dir="./")) # Slow when you load it for the first time time_e = time.time()  time_takes = time_e - time_s print("Loading model takes {:.2f} seconds".format(time_e - time_s)) text_english = "Many technical approaches have been proposed for ensuring that decisions made by AI systems are fair, but few of these methods have been applied in real-world settings." text_chinese = mt.translate(text_english, source=dlt.lang.ENGLISH, target=dlt.lang.CHINESE) print(text_chinese) time_s = time.time() texts = [text_english for i in range(100)] for t in tqdm(texts): mt.translate(t, source=dlt.lang.ENGLISH, target=dlt.lang.CHINESE) time_e = time.time() print("It takes {:.2f} seconds to translate 100 sentences, with an average of {:.2f} seconds each.".format(t ime_e - time_s, (time_e - time_s) / 100))Copy the code

Test results:

It can be seen that it takes nearly a minute and a half to load the model, and it takes about 4 seconds to translate a sentence of 27 words in English, which largely depends on the length of the sentence. While the results aren’t natural compared to Google Translate, they’re good enough for a translator that can be used offline.

Reference links:

  • Dl-translate:github.com/xhlulu/dl-t…
  • Usage Guide: xinghanlu.com/dl-translat…
  • MBART50 model: huggingface co/facebook/MB…
  • Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

Write last: if you find this article helpful, welcome to like my comments, thank you!

My public number: algorithm little brother Chris, welcome to tease!