This article is participating in Python Theme Month. See the link for details

What is lexical analysis?

Lexical analysis refers to the use of computers to analyze the morphology of natural language, judge the structure and category of words, etc. Specific to the actual project tasks, it is word segmentation and classification of each word, including word segmentation, part-of-speech tagging and entity recognition. The three tasks are of increasing difficulty, as shown below:

Baidu is a high-tech company. Baidu is a high-tech company. Baidu is a high-tech companyCopy the code

What’s the use of lexical analysis?

  • The use of words to express meaning is often insufficient information, and because of the complexity of Chinese, there are many ambiguities, so we choose words as the basic unit of expressing meaning in natural language processing
  • From the perspective of search engines, using the granularity of selected words to deal with natural language-related tasks is a balance between accuracy and recall
  • Lexical analysis is the most basic task of natural language processing and the guarantee of the upper task

The application of lexical analysis

  • Lexical segmentation is most commonly used in search engines, such as Baidu
  • Search Top1 questions, such as “Yao Ming’s wife” can directly return to “Ye Li”, of course, there is also the use of knowledge graph, but the basic underlying task is lexical analysis
  • Using an intelligent voice assistant to control the water heater to 50 degrees requires converting speech into text and then identifying key entities and commands in the text
  • Dialogue question and answer, in the process of man-machine communication, mutual understanding and expression of meaning, also need lexical analysis
  • Entity extraction, such as complex express address extraction of name, mobile phone number, address and other information

Technical development of lexical analysis

1. Based on dictionaries

  • String matching: forward/backward maximum matching, etc., on behalf of the tool has IK word segmentation, advantage is fast, easy to understand, disadvantage is unable to solve ambiguity and unlogged word problems.
  • Statistical language model: It constructs directed acyclic graph based on the dictionary to calculate the maximum probability path. Representative applications include N-gram and Jieba, etc. The advantage is that ambiguity can be alleviated through probability, but the problem of unlogged words still cannot be solved.

2. Annotation based on sequence

  • Statistical methods: HMM, CRF, etc
  • Deep learning methods: LSTM-CRF, FLAT, etc. Currently, LSTM-CRF is the most commonly used method. Of course, LSTM can also be replaced with pre-training model.

3. Performance Comparison (MSR dataset)

algorithm Accurate rate The recall rate F1
The longest match 89.41 94.64 91.95 |
Statistical language model 92.38 96.70 94.49
CRF 96.86 96.64 96.75
LSTM-CRF 97.20 97.30 97.30

In fact, from the comparison of results, the dictionary matching method is not bad, so it can still be used in some scenarios where accuracy is not high and speed is fast; The best performance is LSTM-CRF, and all indicators are the best, indicating that the model is of high application value.

Task to show

At present Baidu’s natural language processing seems to be good performance, just got a good result in the competition, two gold and one silver result is quite able to play, I hope to make persistent efforts. See this article for details, teleport away you.

Jieba is used to perform tasks here. Install the package first:

pip install jieba
Copy the code

1. Word segmentation task

Import jieba seg_str = "I am a good kid" print("/". Join (jieba.lcut(seg_str))Copy the code

Print result:

I'm/am/a good boyCopy the code

Part-of-speech tagging

Import jieba.posseg as pseg words =pseg. Cut (" I am a good boy ") print(" ". Join ([w.word+'/'+w.flag for w in words])Copy the code

Print result:

I/R is /v /q good boy/NCopy the code

3. Entity recognition

Jieba currently cannot perform entity recognition, here is a simple example:

Baidu is a high-tech company entity identification: Baidu /ORG is/V a/M high-tech/N company/NCopy the code