Tokenization in NLP (English and Chinese word segmentation differences +3 difficult points +3 typical methods)

This article is originally published by easyAI – AI Knowledge Base for product Managers

Tokenization is a word segmentation in NLP. Tokenization is a word segmentation in NLP

Word segmentation is the basic task of NLP, which breaks down sentences and paragraphs into word units for subsequent analysis.

This paper will introduce the reasons for word segmentation, three differences between Chinese and English word segmentation, three difficult points in Chinese word segmentation, and three typical methods of word segmentation. Finally, it will introduce the Chinese word segmentation and English word segmentation tools.

What is a participle?

Word segmentation is an important step in NLP.

Word segmentation is to decompose long texts such as sentences, paragraphs and articles into data structures with words as units, which is convenient for subsequent processing and analysis.

Why participles?

Turn complex problems into mathematical problems

In the article on machine learning, machine learning seems to be able to solve many complex problems because it turns them into mathematical problems.

NLP has the same idea, text is some “unstructured data”, we need to first transform these data into “structured data”, structured data can be transformed into mathematical problems, and word segmentation is the first step of transformation.

Words are a good granularity

A word is the smallest unit of complete meaning.

The granularity of the word is too small to express the full meaning, such as “mouse” can be “mouse” or “mouse”.

However, sentence granularity is too large, carrying a lot of information, it is difficult to reuse. For example, “one of the important reasons for the segmentation of traditional methods is that traditional methods are weak in modeling long-distance dependence.”

3. In the era of deep learning, some tasks can also be “split”

In the era of deep learning, with the explosive growth of data volume and computing power, many traditional methods have been subverted.

Is Word Segmentation Necessary for Deep Learning of Chinese Representations? .

However, in some specific tasks, participles are still necessary. Such as: keyword extraction, named entity recognition, etc.

Three typical differences between Chinese and English participles

Difference 1: Chinese is more difficult due to the different ways of word segmentation

English has a natural space separator, but Chinese does not. Therefore, how to segment is a difficult problem. In addition, there are many meanings of a word in Chinese, which leads to ambiguity. The difficulties will be explained in detail in the following sections.

Difference 2: English words have many forms

There are abundant deformations in English words. To cope with these complex transformations, English NLP has some unique processing steps compared with Chinese, which are called Lemmatization and Stemming extraction. Chinese does not

Part of speech restoration: does, done, doing, did needs to be restored to do by part of speech restoration.

Cities, children, teeth, need to be converted to city, child, tooth

Difference 3: Granularity should be considered in Chinese word segmentation

For example, “University of Science and Technology of China” can be classified in many ways:

University of Science and Technology of China
China, Science and Technology, University
China science Technology University

The larger the granularity, the more accurate the meaning, but also leads to fewer recalls. So Chinese requires different scenarios and requires different granularity. That doesn’t exist in English.

Three difficult points in Chinese word segmentation

Difficulty 1: There is no unified standard

At present, there is no unified standard or accepted norm for Chinese word segmentation. Different companies and organizations have different methods and rules.

Difficulty 2: How to segment ambiguous words

For example, “ping-pong ball is up for auction” has two participles with two different meanings:

Table tennis \ auction \ over
The ping-pong racket is sold out

Difficulty 3: The recognition of new words

In the age of information explosion, a lot of new words emerge every day, and how to quickly identify these new words is a big difficulty. For example, the “blue thin mushroom” fire needs to be quickly identified.

Three typical word segmentation methods

Word segmentation methods can be roughly divided into three categories:

Dictionary-based matching
Based on statistical
Based on deep learning

Give dictionary matching word segmentation

Advantages: fast speed, low cost

Disadvantages: poor adaptability, big difference in effect in different fields

The basic idea is based on dictionary matching, the Chinese text to be segmented is segmented and adjusted according to certain rules, and then matched with the words in the dictionary. If the matching succeeds, the words are segmented according to the words in the dictionary. If the matching fails, the words can be adjusted or re-selected, and the cycle can be repeated. Representative methods include forward maximum matching, backward maximum matching and bidirectional matching.

Word segmentation method based on statistics

Advantages: Strong adaptability

Disadvantages: Higher cost, slower speed

The commonly used algorithms of this kind are HMM, CRF, SVM, deep learning and other algorithms. For example, Stanford and Hanlp word segmentation tools are based on CRF algorithm. Taking CRF as an example, the basic idea is to carry out annotation training for Chinese characters, which not only considers the frequency of occurrence of words, but also considers the context, and has good learning ability. Therefore, it has a good effect on the recognition of ambiguous words and unknown words.

Based on deep learning

Advantages: High accuracy, strong adaptability

Disadvantages: high cost, slow speed

For example, some people try to use bidirectional LSTM+CRF to achieve a word segmentation, which is sequence annotation in nature, so it is universal. This model can be used for named entity recognition, and it is reported that the character accuracy of the word segmentation can be as high as 97.5%.

Common word segmentation is the combination of machine learning algorithm and dictionary, on the one hand can improve word accuracy, on the other hand can improve domain adaptability.

Chinese word segmentation tool

The following rankings are based on the number of stars on GitHub:

Hanlp
Stanford participle
Ansj participle
Harbin institute of LTP
KCWS participle
jieba
IK
THULAC, Tsinghua University
ICTCLAS

English word segmentation tools

Keras
Spacy
Gensim
NLTK

conclusion

Word segmentation is to decompose long texts such as sentences, paragraphs and articles into data structures with words as units, which is convenient for subsequent processing and analysis.

Reasons for participles:

Turn complex problems into mathematical problems
Words are a good granularity
In the era of deep learning, some tasks can be “split”

Three typical differences between Chinese and English participles:

Different word segmentation, Chinese is more difficult
There are many forms of English words, which need part of speech reduction and stem extraction
Chinese word segmentation needs to consider granularity

Three difficult points in Chinese word segmentation

There is no standard
How to divide ambiguous words
Recognition of new words

Three typical participles:

Dictionary-based matching
Based on statistical
Based on deep learning

Tokenization in NLP (English and Chinese word segmentation differences +3 difficult points +3 typical methods)

What is a participle?

Why participles?

Three typical differences between Chinese and English participles

Three difficult points in Chinese word segmentation

Three typical word segmentation methods

Chinese word segmentation tool

English word segmentation tools

conclusion

Related Posts

Fire Detection Based on MATLAB GUI Fire detection

Uav 3D path planning based on MATLAB particle swarm Optimization algorithm

Alink: Data preprocessing for linear regression implementation