Chinese word segmentation principle and common Python Chinese word segmentation library introduction

The principle of

Chinese Word Segmentation: A Chinese Word sequence is segmented to obtain individual words. On the surface, word segmentation is actually so, but whether the word segmentation effect is good or not has a great impact on information retrieval, experimental results, and word segmentation is actually involved in a variety of algorithms.

Chinese word segmentation is very different from English word segmentation. In English, a word is a word, while In Chinese, characters are the basic writing unit. There are no obvious distinguishing marks between words, which need to be manually segmented. According to its characteristics, word segmentation algorithms can be divided into four categories:

Rule-based word segmentation method
Word segmentation method based on statistics
Semantic – based word segmentation method
Word segmentation method based on understanding

Let’s summarize each of these methods.

Rule-based word segmentation method

This method, also known as mechanical word segmentation or dictionary-based word segmentation, matches the string of characters to be analyzed according to certain strategies with entries in a “sufficiently large” machine dictionary. If a string is found in the dictionary, the match is successful. The method has three elements, namely, segmentation dictionary, text scanning order and matching principle. The text scan sequence includes forward scan, reverse scan and bidirectional scan. The matching principles mainly include maximum matching, minimum matching, word – by – word matching and best matching.

Maximum matching method (MM). The basic idea is: assuming that the number of Chinese characters contained in the longest term in the automatic word segmentation dictionary is I, the first I characters in the current string sequence of the processed material are taken as matching fields to search the word segmentation dictionary. If there is such an I-word in the dictionary, the matching succeeds and the matching field is segmented as a word. If such a word can not be found in the dictionary, the match fails, the last Chinese character is removed from the matching field, and the remaining characters are used as the new matching field, and so on, until the match is successful. Statistical results show that the error rate of this method is 1/169.
Reverse maximum matching (RMM). The word segmentation process of this method is the same as that of MM method, but the difference is that it starts from the end of the sentence (or article), and every time the matching is unsuccessful, the previous Chinese character is removed. Statistical results show that the error rate of this method is 1/245.
Word by word. Search the entire material in the dictionary word by word in descending order of length until all the words have been cut out. No matter how big the dictionary is and how small the material being processed is, the dictionary must be matched.
Set up the segmentation mark method. Syncopation marks are natural and unnatural. Natural segmentation marks refer to non-text symbols, such as punctuation marks, etc. Unnatural signs are words that use affixes and do not constitute words (including monophonic words, polyphonic words and onomatopoeia, etc.). To set up the segmentation mark method, firstly collect a large number of segmentation marks, find out the segmentation mark first, divide the sentence into some short fields, and then use MM, RMM or other methods for fine processing. This method is not a real word segmentation method, but a pre-processing method of automatic word segmentation. It consumes extra time to scan the segmentation marks and increases the storage space to store those unnatural segmentation marks.
Best matching method (OM). This method is divided into the forward best matching method and the reverse best matching method. Its starting point is to arrange the entries according to the word frequency order in the dictionary, so as to shorten the retrieval time of the word segmentation dictionary, achieve the best effect, thus reducing the time complexity of word segmentation, and speed up the word segmentation. In essence, this method is not a word segmentation method in the pure sense, it is only a way of organizing a word segmentation dictionary. In the word segmentation dictionary of OM method, each word must be preceded by a specified length of the data item, so its spatial complexity increases, has no impact on the accuracy of word extraction, and the time complexity of word segmentation processing decreases.

The advantages of this method are simple and easy to implement. But there are many disadvantages: slow matching speed; There are intersection and combinational ambiguity segmentation problems. There is no standard definition of word itself, no unified standard word set. Different dictionaries produce different ambiguities; Lack of self-learning intelligence.

Word segmentation method based on statistics

The main idea of this method is that words are stable combinations, so in the context, the more times adjacent words appear at the same time, the more likely they are to form a word. Therefore, the probability or frequency of the occurrence of adjacent words can better reflect the credibility of the word. The frequency of the combinations of adjacent words in the training text can be counted and the mutual occurrence information between them can be calculated. Mutual occurrence information reflects the closeness of the binding relationship between Chinese characters. When the compactness degree is higher than a certain threshold value, it can be considered that the word group may constitute a word. This method is also called dictionary-free segmentation.

The main statistical models used in this method include N-gram, Hiden Markov Model (HMM), maximum entropy Model (ME), Conditional Random Fields (CRF) and so on.

In practical application, this kind of word segmentation algorithm is generally combined with dictionary-based word segmentation method, which not only gives full play to the characteristics of fast matching word segmentation and high efficiency, but also makes use of the advantages of dictionary-free word segmentation combined with context recognition of new words and automatic disambiguation.

Semantic – based word segmentation method

Semantic lexical segmentation introduces semantic analysis to deal with the language information of natural language itself, such as extended transfer network method, knowledge lexical semantic analysis method, adjacency constraint method, comprehensive matching method, suffix lexical segmentation method, characteristic word base method, matrix constraint method, grammar analysis method and so on.

Extended transfer network method. The method is based on the concept of finite state machines. The finite state machine can only recognize regular language, and the first extension of the finite state machine gives it recursion capability, forming the recursive transfer network (RTN). In RTN, the marks on the arc can not only be terminal characters (words in the language) or non-terminal characters (speech classes), but also can call other subnetwork names to divide non-terminal characters (such as word formation conditions of words or strings). In this way, when a computer is running a sub-network, it can call another sub-network, and it can call recursively. The use of morphological extension transfer network makes it possible to interact with each other in the syntactic processing stage of word segmentation and language understanding, and effectively resolves the ambiguity of Chinese word segmentation.
Matrix constraint method. Its basic idea is: first to build a syntactic constraint matrix and a semantic constraint matrix, which elements have showed a certain part of speech of words and adjacent with another part of speech of the word is in line with the rules of grammar, belongs to a certain semantic class word and to another meaning of the class is logical, adjacent to the constraints of segmentation results when machine is in shard.

Word segmentation method based on understanding

The understanding based word segmentation method is to achieve the effect of word recognition by using computer to simulate human understanding of sentences. The basic idea is to use syntactic and semantic information to deal with ambiguity while word segmentation. It usually consists of three parts: word segmentation subsystem, syntactic semantic subsystem and general control subsystem. Under the coordination of the general control part, the word segmentation subsystem can obtain syntactic and semantic information about words and sentences to judge word segmentation ambiguity, that is, it simulates the process of understanding sentences. This method of word segmentation requires the use of a lot of language knowledge and information. At present, the word segmentation methods based on understanding mainly include expert system and neural network.

Expert systems are lexical. From the perspective of expert system, word segmentation knowledge (including common-sense word segmentation knowledge and disambiguation heuristic knowledge, namely ambiguity segmentation rules) is separated from inference machine to realize word segmentation process, so that the maintenance of knowledge base and inference machine do not interfere with each other, so that the knowledge base is easy to maintain and manage. It also has the ability to discover intersection ambiguity field and polysemic combination ambiguity field and certain self-learning function.
Neural network word segmentation. The method simulates parallel processing, distribution processing and numerical computation of human brain. It saves the implicit method dispersed by word segmentation knowledge into the neural network, modiates the internal weight through self-learning and training, and finally gives the automatic word segmentation result of the neural network, such as LSTM, GRU and other neural network models.
Integrated word segmentation for neural network expert system. The method first starts neural network for word segmentation. When the neural network cannot accurately segment the new words, the expert system is activated for analysis and judgment, and the preliminary analysis is obtained by reasoning according to the knowledge base, and the learning mechanism is activated to train the neural network. This method can give full play to the advantages of neural network and expert system and further improve the efficiency of high score words.

That’s the basic introduction to the word segmentation algorithm, and we’ll look at some of the more useful Python libraries for word segmentation and how to use them.

Word segmentation tools

Here are a few representative Python libraries that support word segmentation:

1. jieba

GitHub: github.com/fxsjy/jieba – Python library for word segmentation

Three word segmentation modes are supported:

Precision mode, which tries to cut sentences most accurately, is suitable for text analysis.
Full mode, which scans out all possible words in a sentence, is very fast, but cannot solve ambiguity.
Search engine mode: on the basis of accurate mode, the long words are segmented again to improve the recall rate, which is suitable for search engine segmentation.

In addition, Jieba supports traditional word segmentation and supports custom dictionaries.

The algorithm used is word segmentation method based on statistics, mainly including the following:

Realize efficient word graph scanning based on prefix dictionary and generate directed acyclic graph (DAG) composed of all possible word formation situations of Chinese characters in sentences
Dynamic programming is used to find the maximum probability path and find the maximum segmentation combination based on word frequency
For unknown words, HMM model based on Chinese word formation ability is adopted, and Viterbi algorithm is used

Exact pattern segmentation

Lcut (), which takes the same arguments as cut() but returns a list instead of a generator, defaults to exact mode. The code looks like this:

1234import jiebastring = 'This handle should be changed. I don't like Japanese kimono. Don't put your hand on my shoulder.result = jieba.lcut(string)print(len(result), '/'.join(result))Copy the code

Results:

1	38 the/hand/in/up /, / I/not/like / / / Japanese kimonos, / / handle/on/I / / shoulder / /, / ministry/female director/monthly/after/staff/departments / / to/mouth/account / 24 / mouth/switches/etc/technical/device / / install/work

Visible word segmentation effect is good.

Full modal segmentation

To use full-mode word segmentation, add the cut_all argument and set it to True as follows:

12result = jieba.lcut(string, cut_all=True)print(len(result), '/'.join(result))Copy the code

The results are as follows:

51 The/hand/in/up / / / I/not/like / / / / don’t/handle / / Japanese kimonos on/I / / shoulder / / / / ministry/virgin female director/director/month/period/after/staff/departments / / to/mouth/oral/account / 24 / oral/exchange/switches/replacement/etc/technical / / technical/sex/device Of/installers/installers/jobs

Search engine pattern segmentation

Using a search engine pattern segmentation calls the cut_for_search() method as follows:

12result = jieba.lcut_for_search(string)print(len(result), '/'.join(result))Copy the code

The results are as follows:

1	42 the/hand/in/up /, / I/not/like / / / Japanese kimonos, / / handle/on/I / / shoulder / /, / ministry director/director/female/monthly/after/staff/departments / / to/mouth/account / 24 / mouth/exchange/replacement/switches/etc/technology/technical/device / / install/work

In addition, we can add a custom dictionary. If we want to take the Japanese kimono as a whole, we can add it to the dictionary. The code is as follows:

123jieba.add_word('Japanese Kimono')result = jieba.lcut(string)print(len(result), '/'.join(result))Copy the code

The results are as follows:

1	37 the/hand/in/up /, / I/not/like/Japanese kimono /, / / handle/on/I / / shoulder / /, / ministry/female director/monthly/after/staff/departments / / to/mouth/account / 24 / mouth/switches/etc/technical/device / / install/work

It can be seen that in the segmentation result, the word “Japanese kimono” appears as a whole in the result, and the number of segmentation words is one less than the exact model.

The part of speech tagging

In addition, jieba supports partof-of-speech tagging, which outputs the part of speech of each word after the word segmentation, as follows:

12words = pseg.lcut(string)print(list(map(lambda x: list(x), words)))Copy the code

Running results:

[[‘ this’, ‘r’], [‘ hand ‘, ‘v’], [‘ the ‘, ‘r’], [‘ in ‘, ‘v’], [‘ the ‘, ‘ul’], [‘, ‘and’ x ‘], [‘ me ‘, ‘r’], [‘ no ‘, ‘d’], [‘ like ‘, ‘v’], [‘ Japanese kimono ‘and’ x ‘], [‘, ‘and’ x ‘], [‘ don’t ‘, ‘r’], [‘ hand ‘, ‘v’], [‘ on ‘, ‘v’], [‘ me ‘, ‘r’], [‘ the ‘, ‘uj], [‘ shoulders’,’ n ‘], [‘ on ‘, ‘f’], [‘, ‘and’ x ‘], [‘ ministry place ‘, ‘n’], [‘ female stewards’, ‘n’], [‘ monthly ‘, ‘r’], [‘ a ‘, ‘p’], [‘ subordinates’, ‘v’], [‘ department ‘, ‘n’], [‘ is’, ‘d’], [‘ to ‘, ‘v’], [‘ mouth ‘ ‘n’], [‘ replacement ‘, ‘n’], [‘ 24 ‘, ‘m’], [‘ mouth ‘, ‘n’], [‘ switch ‘, ‘n’], [‘ and ‘, ‘u’], [‘ technical ‘, ‘n’], [‘ device ‘, ‘n’], [‘ the ‘, ‘uj], [‘ install’ ‘v’], [‘ work ‘, ‘vn’]]

The instructions on the part of speech can be reference: gist.github.com/luw2007/601… .

2. SnowNLP

SnowNLP: Simplified Chinese Text Processing, which can easily process Chinese Text content, was inspired by TextBlob. Since most natural language Processing libraries are basically aimed at English, we wrote a class library that is convenient for Processing Chinese. And unlike TextBlob, there is no NLTK. All algorithms are self-implemented and come with trained dictionaries. GitHub address: github.com/isnowfy/sno… .

participles

The segmentation is Based on the character-based Generative Model, and the simulation results show that aclweb.org/anthology//… , we use the above examples, the relevant instructions are as follows:

123456from snownlp import SnowNLP string = 'This handle should be changed. I don't like Japanese kimono. Don't put your hand on my shoulder.s = SnowNLP(string)result = s.wordsprint(len(result), '/'.join(result))Copy the code

Running results:

1	40 the/handle/the / / /, / I/not/like/Japan/and / /, / don’t handle/on/I / / shoulder / /, / work/letter virgin director / / monthly/after/staff/departments / / to/mouth/account / 24 / mouth/switches/etc/technical/device / / install/work

After observation, it can be found that the participle effect is not ideal, kimono is separated, the information office is also separated, the female secretary is also separated.

SnowNLP also supports features such as pos tagging (HMM), sentiment analysis, Pinyin conversion (Trie tree), keyword and summary generation (TextRank).

Let’s take a quick look at an example:

123print('Tags:', list(s.tags))print('Sentiments:', s.sentiments)print('Pinyin:', s.pinyin)Copy the code

Running results:

123Tags: [('this'.'r'), ('handle'.'Ng'), ('the'.'r'), ('in'.'v'), ('了'.'y'), (', '.'w'), ('我'.'r'), ('no'.'d'), ('like'.'v'), ('Japan'.'ns'), ('and'.'c'), ('take'.'v'), (', '.'w'), ('Keep your hands off.'.'ad'), ('放在'.'v'), ('我'.'r'), ('the'.'u'), ('shoulder'.'n'), ('on'.'f'), (', '.'w'), ('work'.'j'), ('Trust the virgin'.'j'), ('director'.'n'), ('every month'.'r'), ('a'.'p'), ('subordinates'.'v'), ('department'.'n'), ('all'.'d'), ('to'.'v'), ('mouth'.'d'), ('replacement'.'v'), ('24'.'m'), ('mouth'.'q'), ('switch'.'n'), ('and'.'u'), ('technical'.'n'), ('device'.'n'), ('the'.'u'), ('install'.'vn'), ('work'.'vn')] Sentiments: 0.015678817603646866 Pinyin:'zhe'.'ge'.'ba'.'shou'.'gai'.'huan'.'liao'.', '.'wo'.'bu'.'xi'.'huan'.'ri'.'ben'.'he'.'fu'.', '.'bie'.'ba'.'shou'.'fang'.'zai'.'wo'.'de'.'jian'.'bang'.'shang'.', '.'gong'.'xin'.'chu'.'nv'.'gan'.'shi'.'mei'.'yue'.'jing'.'guo'.'xia'.'shu'.'ke'.'shi'.'dou'.'yao'.'qin'.'kou'.'jiao'.'dai'.'24'.'kou'.'jiao'.'huan'.'ji'.'deng'.'ji'.'shu'.'xing'.'qi'.'jian'.'de'.'an'.'zhuang'.'gong'.'zuo']Copy the code

3. THULAC

THULAC (THU Lexical Analyzer for Chinese) is a toolkit for Lexical analysis developed by natural Language Processing and Social Humanities Computing Laboratory of Tsinghua University, GitHub link: github.com/thunlp/THUL… , with Chinese word segmentation and part-of-speech tagging functions. THULAC has the following characteristics:

Strong ability. It is trained by integrating the world’s largest Chinese corpus of man-labor segmentation and pos tagging (about 58 million words), and has strong model tagging ability.
High accuracy. In the standard data set Chinese Treebank (CTB5), the F1 value of word segmentation and the F1 value of pos tagging can reach 97.3% and 92.9% respectively, which are equivalent to the best methods in the data set.
It’s faster. The speed of word segmentation and part-of-speech tagging at the same time is 300KB/s, which can process about 150,000 words per second. The speed of word segmentation can reach 1.3MB/s.

Let’s use an example to see the effect of word segmentation:

123456import thulac string = 'This handle should be changed. I don't like Japanese kimono. Don't put your hand on my shoulder.t = thulac.thulac()result = t.cut(string)print(result)Copy the code

Running results:

[[‘ this’, ‘r’], [‘ hand ‘, ‘n’], [‘ the ‘, ‘v’], [‘ in ‘, ‘v’], [‘ the ‘, ‘u’], [‘, ‘, ‘w’], [‘ me ‘, ‘r’], [‘ no ‘, ‘d’], [‘ like ‘, ‘v’], [‘ Japan ‘, ‘ns], [‘ kimono’, ‘n’], [‘, ‘, ‘w’], [‘ don’t handle, ‘n’], [‘ put ‘, ‘v’], [‘ in ‘, ‘p’], [‘ me ‘, ‘r’], [‘ the ‘, ‘u’], [‘ shoulders’, ‘n’], [‘ on ‘, ‘f’], [‘, ‘, ‘w’], [‘ ministry place ‘, ‘n’], [‘ female ‘and’ a ‘], [‘ director ‘, ‘n’], [‘ monthly ‘, ‘r’], [‘ a ‘, ‘p’], [‘ subordinates’, ‘v’], [‘ department ‘, ‘n’], [‘ is’, ‘d’], [‘ to ‘, ‘v’], [‘ mouth ‘, ‘d’], [‘ replacement ‘, ‘v’], [‘ 24 ‘, ‘m’], [‘ mouth ‘, ‘q’], [‘ switch ‘, ‘n’], [‘ and ‘, ‘u’], [‘ technical ‘, ‘n’], [‘ device ‘, ‘n’], [‘ the ‘, ‘u’], [‘ install ‘, ‘v’], [‘ work ‘, ‘v’]]

4. NLPIR

NLPIR word segmentation system, formerly ICTCLAS lexical analysis system released in 2000, GitHub link: github.com/NLPIR-team/… Is a Chinese word segmentation system developed by Dr. Zhang Huaping from Beijing Institute of Technology. After more than ten years of continuous improvement, it has rich functions and powerful performance. NLPIR is a set of software that processes and processes raw text sets. It provides a visual display of the effects of middleware processing, and can also be used as a tool for processing small-scale data. The main functions include Chinese word segmentation, part-of-speech tagging, named entity recognition, user dictionary, new word discovery and keyword extraction, etc. In addition, for word segmentation, it has a Python implementation version, GitHub link: github.com/tsroten/pyn… .

The usage method is as follows:

123456import pynlpir pynlpir.open()string = 'This handle should be changed. I don't like Japanese kimono. Don't put your hand on my shoulder.result = pynlpir.segment(string)print(result)
Copy the code

The running results are as follows:

1 [('this'.'pronoun'), ('the'.'preposition'), ('hand'.'noun'), ('the'.'pronoun'), ('in'.'verb'), ('了'.'modal particle'), (', '.'punctuation mark'), ('我'.'pronoun'), ('no'.'adverb'), ('like'.'verb'), ('Japan'.'noun'), ('and'.'conjunction'), ('take'.'verb'), (', '.'punctuation mark'), ('don't'.'adverb'), ('the'.'preposition'), ('hand'.'noun'), ('let'.'verb'), ('in'.'preposition'), ('我'.'pronoun'), ('the'.'particle'), ('shoulder'.'noun'), ('on'.'noun of locality'), (', '.'punctuation mark'), ('work'.'noun'), ('letter'.'noun'), ('virgin'.'noun'), ('director'.'noun'), ('every month'.'pronoun'), ('a'.'preposition'), ('subordinates'.'verb'), ('department'.'noun'), ('all'.'adverb'), ('to'.'verb'), ('mouth'.'adverb'), ('replacement'.'verb'), ('24'.'numeral'), ('mouth'.'classifier'), ('switch'.'noun'), ('and'.'particle'), ('technical'.'noun'), ('device'.'noun'), ('the'.'particle'), ('install'.'verb'), ('work'.'verb')]Copy the code

Here the handle and the kimono are also separated.

5. NLTK

NLTK is a Natural Language Toolkit for NLP processing. GitHub links to github.com/nltk/nltk.

However, NLTK does not support Chinese word segmentation, as shown in the following example:

12345from nltk import word_tokenize string = 'This handle should be changed. I don't like Japanese kimono. Don't put your hand on my shoulder.result = word_tokenize(string)print(result)Copy the code

Results:

1 ['This handle should be changed. I don't like Japanese kimono. Don't put your hand on my shoulder.]Copy the code

If you want to use Chinese word segmentation, you can use FoolNLTK, which is trained by BI-LSTM, including word segmentation, part-of-speech tagging, entity recognition, and other functions, but also supports custom dictionaries, can train their own models, can be batch processing.

The usage method is as follows:

12345import fool string = 'This handle should be changed. I don't like Japanese kimono. Don't put your hand on my shoulder.result = fool.cut(string)print(result)Copy the code

Running results:

1 [['this'.'handle'.'the'.'in'.'了'.', '.'我'.'no'.'like'.'Japan'.'kimono'.', '.'don't'.'the'.'hand'.'let'.'in'.'我'.'the'.'shoulder'.'on'.', '.Department of Industry and Information Technology.'woman'.'director'.'every month'.'a'.'subordinates'.'department'.'all'.'to'.'kiss'.'mouth'.'replacement'.'24'.'mouth'.'switch'.'and'.'technical'.'device'.'the'.'install'.'work']]Copy the code

You can see that this participle works well.

In addition, part-of-speech tagging and entity recognition can be carried out:

1234result = fool.pos_cut(string)print(result)_, ners = fool.analysis(string)print(ners)Copy the code

Running results:

12 [[('this'.'r'), ('handle'.'n'), ('the'.'r'), ('in'.'v'), ('了'.'y'), (', '.'wd'), ('我'.'r'), ('no'.'d'), ('like'.'vi'), ('Japan'.'ns'), ('kimono'.'n'), (', '.'wd'), ('don't'.'d'), ('the'.'pba'), ('hand'.'n'), ('let'.'v'), ('in'.'p'), ('我'.'r'), ('the'.'ude'), ('shoulder'.'n'), ('on'.'f'), (', '.'wd'), (Department of Industry and Information Technology.'ns'), ('woman'.'b'), ('director'.'n'), ('every month'.'r'), ('a'.'p'), ('subordinates'.'v'), ('department'.'n'), ('all'.'d'), ('to'.'v'), ('kiss'.'a'), ('mouth'.'n'), ('replacement'.'v'), ('24'.'m'), ('mouth'.'q'), ('switch'.'n'), ('and'.'udeng'), ('technical'.'n'), ('device'.'n'), ('the'.'ude'), ('install'.'n'), ('work'.'n')]] [[(12, 15,'location'.'Japan')]]Copy the code

6. LTP

Language Technology Platform (LTP) is a set of Chinese Language processing system developed by social Computing and Information Retrieval Research Center of HIT. LTP develops xmL-based representation of language processing results, and on this basis provides a set of bottom-up rich and efficient Chinese language processing modules (including six core Chinese processing technologies such as morphology, syntax and semantics). And application program interfaces based on Dynamic Link Library (DLL), visualization tools, and can be used in the form of Web services.

LTP is available in Python at github.com/HIT-SCIR/py… Ltp.ai /download.ht… .

Example code is as follows:

12345678from pyltp import Segmentor string = 'This handle should be changed. I don't like Japanese kimono. Don't put your hand on my shoulder.segmentor = Segmentor()segmentor.load('./cws.model')result = list(segmentor.segment(string))segmentor.release()print(result)Copy the code

Running results:

1	41 the/handle/the / / /, / I/not/like / / / Japanese kimonos, / don’t/put/hands / / I / / shoulder / /, / ministry/virgin director / / monthly/after/staff/departments / / to/mouth/account / 24 / mouth/switches/etc/technical/device / / install/work

It can be found that the department of Industry and Information Technology and female officers are not properly separated.

The above are some basic usages of thesaurus, which are recommended by jieba, THULAC and FoolNLTK.

Reference source

M635674608.iteye.com/blog/229883…
Blog.csdn.net/flysky1991/…

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)

Chinese word segmentation principle and common Python Chinese word segmentation library introduction

The principle of

Rule-based word segmentation method

Word segmentation method based on statistics

Semantic – based word segmentation method

Word segmentation method based on understanding

Word segmentation tools

1. jieba

Exact pattern segmentation

Full modal segmentation

Search engine pattern segmentation

The part of speech tagging

2. SnowNLP

participles

3. THULAC

4. NLPIR

5. NLTK

6. LTP

Reference source

Related Posts

Socket X Short connection X long connection

Redis Cluster High availability Cluster construction

2044. Count the number of subsets by bit or that can get the maximum value