This group of engineers, in their spare time, pushed The Chinese NLP forward a big step

Contents: **** What? Are Chinese NLP and English NLP studied in two directions? The long history of Chinese makes the exploration of NLP difficult? In this article, we will review some recent projects that have brought breakthroughs to the Chinese NLP field.

Key words: ****NLP Chinese pre-training model test benchmark

It was said that if you have studied NLP (Natural Language processing), you will know how difficult Chinese NLP is. \

Although both belong to NLP, there are great differences in the analysis and processing of English and Chinese due to the different language habits, and the difficulties and challenges are also different.

Some methods in Chinese NLP \

Moreover, most of the current popular models are developed in English, and many tasks (such as word segmentation) are very difficult due to the unique habits of Chinese language, leading to slow progress in the field of Chinese NLP.

But that may soon change, as there have been a number of very good open source projects since last year that have given the NLP Chinese space a boost.

Model: ALBERT of Chinese pre-training

BERT, the Bidirectional Encoder Decoder, is the latest version of Google’s Language model to appear on its NLP lists.

However, one disadvantage of BERT is that it is too large. Bert-large has 300 million parameters, which makes it very difficult to train. In 2019, Google AI introduced the lightweight ALBERT (A Little BERT), which is 18 times smaller in parameters than the BERT model, but with superior performance.

Performance comparison at the launch of ALBERT \

Although AlBERT solves the problem of high training cost and large number of participants in the pre-training model, it is still only aimed at English context, which makes engineers who focus on Chinese development feel a little helpless.

In order to make the model available in Chinese for the benefit of more developers, data engineer Xu Liang’s team opened source the first Chinese-language pre-trained ALBERT model in October 2019.

The project received more than 2,200 likes on GitHub

The project address

Github.com/brightmart/…

The ALBERT model of Chinese pre-training (called Albert_zh) is based on the training of a large amount of Chinese corpus from various encyclopedias, news and interactive communities. It contains 30 gigabytes of Chinese corpus with over 100 trillion Chinese characters.

In comparison, albert_ZH pre-training sequence length was set to 512, batch size was 4096, training generated 350 million training data, while roberta_ZH, another robust pre-training model, pre-training generated 250 million training data, sequence length was 256.

Albert_zh pretrains generate more training data and use longer sequences, and albert_zh is expected to perform better than Roberta_zh and handle longer text better.

Albert_zh performance comparison with other models

In addition, Albert_zh trained a series of ALBERT models with different numbers of parameters, from tiny edition to Xlarge edition, which greatly contributed to the popularity of ALBERT in the Chinese NLP field.

It is worth noting that In January 2020, Google AI released ALBERT V2, which slowly launched the Google Chinese version of ALBERT.

Benchmark: make ChineseGLUE of ChineseGLUE

Once you have models, how do you judge them? This requires a good benchmark, and last year ChineseGLUE, a benchmark for Chinese NLP, was opened source.

ChineseGLUE is based on GLUE, the industry’s well-known benchmark, which is a collection of nine English language comprehension tasks with the goal of promoting research into universal and robust natural language comprehension systems.

Previously, there was no Chinese version corresponding to GLUE, and some pre-training models could not be evaluated in public tests on different tasks, leading to misplacement in the development and application of NLP in The Chinese field, and even lag in technical application.

In response to this situation, Dr. LAN Zhenzhong, the first author of AlBERT, and Xu Liang, the developer of Ablbert-En and more than 20 other engineers jointly launched a benchmark for Chinese NLP: ChineseGLUE.

The project address

Github.com/chineseGLUE…

With the emergence of ChineseGLUE, Chinese has been included as an indicator for the evaluation of new models, and a complete evaluation system has been formed for the test of Chinese pre-training models.

This powerful testing benchmark covers the following aspects:

1) A Chinese task benchmark consisting of several sentences or sentence pairs, covering multiple language tasks of different levels.

2) Provide performance evaluation leaderboards, which will be updated periodically to provide a basis for model selection.

3) Collected some benchmark models, including starting code, pre-training model ChineseGLUE task benchmark, these benchmarks are available in TensorFlow, PyTorch, Keras and other frameworks.

4) Large raw corpora with pre-training or language modeling studies, approximately 10G (2019), and plans to expand to sufficient raw corpora (e.g. 100G) by the end of 2020.

The review site \ was added in October 2019

The launch and continuous improvement of ChineseGLUE is expected to witness the birth of a more powerful Chinese NLP model just as GLUE witnessed the emergence of BERT.

At the end of December and November 2019, the project moved to a more comprehensive, more technology-supported project: CLUEbenchmark/CLUE.

The project address

Github.com/CLUEbenchma…

Data: the most complete data set and largest corpus in history

With pre-training model and test benchmark, another important link lies in data set, corpus and other data resources.

This has led to a more comprehensive organization, CLUE, which stands for Chinese GLUE, an open source organization that provides evaluation benchmarks for Chinese language understanding, focusing on tasks and data sets, benchmarks, pre-trained Chinese models, corpora, and leaderboard publishing.

Some time ago, CLUE released the largest and most complete Chinese NLP dataset, including 142 datasets in 10 categories, CLUEDatasetSearch.

Final web interface display \

The project address

Github.com/CLUEbenchma…

Its contents include NER, QA, sentiment analysis, text classification, text allocation, text summarization, machine translation, knowledge mapping, corpus and reading comprehension.

As long as you type in the keywords on the website page, or the field of information, you can search for the corresponding resources. For each dataset, information is provided on name, update time, provider, description, keywords, category, and paper address.

More recently, CLUE has opened source 100 GB of Chinese corpus, as well as a set of high-quality Chinese pretraining models, and submitted a paper to arViv.

Arxiv.org/abs/2003.01…

In terms of Corpus, CLUE has open-source CLUECorpus2020: Large-scale Pre-training Corpus for Chinese 100G.

These are the data from a corpus cleaning of the Chinese part of the Common Crawl dataset. \

They can be used directly for pre-training, language modeling, or language generation tasks, or to publish small word lists specific to Chinese NLP tasks.

Performance comparison using small data sets on Bert-Base \

The project address

Github.com/CLUEbenchma…

In terms of model collection, cluepretraining Models: high-quality Chinese pretraining models collection — the most advanced large model, the fastest small model, and the similarity specialized model.

Large model (line 3) performance comparison \

Among them, the large model achieves the same effect as the current Chinese MODEL with the best NLP effect, and some tasks win. The speed of small model is about 8 times higher than that of Bert-Base. Semantic similarity model, which is used to deal with semantic similarity or sentence pair problems, is more effective than direct pre-training model with high probability.

The project address

Github.com/CLUEbenchma…

The release of these resources, to some extent, is like fuel for the development process, and sufficient resources may start the Chinese NLP industry to fly.

They make Chinese NLP Easy

From the perspective of language, Chinese and English are the two languages with the largest number of speakers and the greatest influence in the world. However, due to different language characteristics, research in the field of NLP also faces different problems.

Although the development of NLP in Chinese is indeed difficult and lags behind that of English research which is better understood by machines, it is precisely because those engineers in the paper who are willing to promote the development of NLP in Chinese continue to explore and share their achievements that these technologies can be better iterated.

** Several contributors to CLUE **

Thanks for their efforts and contribution to so many quality projects! At the same time, we also hope that more people can join in and jointly promote the development of Chinese NLP.

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code

This group of engineers, in their spare time, pushed The Chinese NLP forward a big step

Related Posts

In-depth understanding of C language Pointers

The boss gave a dead order to switch the logging system to Logback

“Data Structures and Algorithms” AVL tree (balanced binary tree)