♚ \

Author: Jclian, like algorithms, love to share, hope to make more like-minded friends, together in the path of learning Python further!

This paper will try to use three Chinese word segmentation tools, which are LTP of Hit University of Technology, stutter word segmentation and PKUSEG of Peking University. CWS. The model. Add the following five words to the user’s dictionary:

the

The Python code for the test is as follows:

# -*- coding: utf-8- * -import os
import jieba
import pkuseg
from pyltp import Segmentor

lexicon = ['the'.'less Ann'.'He Fengying'.'F-35 fighter '.'Edar Al Khan'Def ltp_segment(sent): # cws_model_path = os.path.join('data/cws.model') # model path, model name is`cws.model`
    lexicon_path = os.path.join('data/lexicon.txt'Segmentor = Segmentor() Segment.load_with_lexicon (cWS_model_path, lexicon_path) words = list(segmentor.segment(sent)) segmentor.release()returnDef jieba_cut(sent):for word in lexicon:
        jieba.add_word(word)
    returnList (jieba.cut(sent)) # def pkuseg_cut(sent): seg = pkuseg.pkuseg(user_dict=lexicon) words = seg. Cut (sent)return words

sent = After Yu Ting got married, his wife He Fengying bullied Shao an's mother again and again in those years. Yu Ting, who was afraid of his wife, did not even say a word, but Shao an's mother ignored him. '
#sent = 'It was previously reported that Israel became the first country in the world to fly the F-35 in combat in May last year. '
#sent = The boat went to Xiaobird Island by the Yangtze River on April 8. '
#sent = 'Edar Alekan was born in 1958 in Ankara, Turkey, but spent much of his school life in the United States. '

print('ltp:', ltp_segment(sent))
print('jieba:', jieba_cut(sent))
print('pkuseg:', pkuseg_cut(sent))
Copy the code

For the first sentence, the output is as follows:

After Yu Ting got married, his wife He Fengying bullied Shao an’s mother again and again in those years. Yu Ting, who was afraid of his wife, did not even say a word, but Shao an’s mother did not care about him.

ltp: [‘ although ‘, ‘jade pavilion’, ‘home’, ‘then’, ‘, ‘, ‘he’, ‘wife’, ‘He Fengying’, ‘the’, ‘in’, ‘the’, ‘less’, ‘mother,’ bullying ‘, ‘on’, ‘a’, ‘back’, ‘and’, ‘a’, ‘back’, ‘, ‘, ‘fear’, ‘wife’, ‘the’, ‘jade pavilion’, ‘the’, ‘a’, ‘sound’, ‘and’, ‘no’, ‘dare’, ‘word’, ‘, ‘, ‘but’, ‘little Ann’ and ‘fucking’, ‘no’, ‘about’, ‘he’, ‘. ‘]

jieba: [‘ although ‘, ‘jade pavilion’, ‘home’, ‘then’, ‘, ‘, ‘he’, ‘wife’, ‘He Fengying’, ‘the’, ‘in’, ‘the’, ‘less’, ‘mother,’ bullying ‘, ‘on’, ‘time’, ‘and’, ‘time’, ‘, ‘, ‘afraid of his wife’, ‘the’, ‘jade pavilion’, ‘the’, ‘1’, ‘and’, ‘not’, ‘word’, ‘, ‘, ‘but less Ann’, ‘his mother’, ‘no’, ‘about’, ‘he’, ‘. ‘]

pkuseg: [‘ although ‘, ‘jade pavilion’, ‘home’, ‘then’, ‘, ‘, ‘he’, ‘wife’, ‘He Fengying’, ‘the’, ‘in’, ‘the’, ‘less’, ‘mother,’ bullying ‘, ‘on’, ‘a’, ‘back’, ‘and’, ‘a’, ‘back’, ‘, ‘, ‘fear’, ‘wife’, ‘the’, ‘jade pavilion’, ‘the’, ‘a’, ‘sound’, ‘and’, ‘no’, ‘dare’, ‘word’, ‘, ‘, ‘but’, ‘little Ann’ and ‘fucking’, ‘no’, ‘about’, ‘he’, ‘. ‘]

For the second sentence, the output is as follows:

It was previously reported that Israel became the first country in the world to fly the F-35 in combat in May last year.

ltp: [‘ according to ‘, ‘previously ‘,’ reported ‘, ‘, ‘, ‘Israel,’ ‘to’ and ‘last’, ‘may’ and ‘become’, ‘world’, ‘on’, ‘1’, ‘a’, ‘in’, ‘real’ and ‘in’, ‘use’, ‘F – 35’, ‘fighter’, ‘the’, ‘national’, ‘. ‘]

jieba: [‘ Accordingly ‘, ‘before ‘,’ report ‘, ‘, ‘, ‘Israel,’ ‘to’ and ‘last’, ‘5’, ‘month’ and ‘become’, ‘world’, ‘on’, ‘the first’ and ‘in’, ‘real’ and ‘in’, ‘use’, ‘F’, ‘-‘, ’35’, ‘the fighter’, ‘the’, ‘national’, ‘. ‘]

pkuseg: [‘ according to ‘, ‘previously ‘,’ reported ‘, ‘, ‘, ‘Israel,’ ‘to’ and ‘last’, ‘may’ and ‘become’, ‘world’, ‘on’, ‘1’, ‘a’, ‘in’, ‘real’ and ‘in’, ‘use’, ‘F – 35 fighter’, ‘the’, ‘national’, ‘. ‘]

For the third sentence, the output is as follows:

The boat went to Xiaobird Island by the Yangtze River on April 8.

LTP: [‘ Small boat ‘, ‘April ‘,’ 8th ‘, ‘Via Changjiang ‘,’ to ‘, ‘Little Bird Island ‘, ‘]

Jieba: [‘ boat ‘, ‘4’, ‘month’, ‘8’, ‘the nikkei’, ‘the Yangtze river,’ to ‘and’ small ‘, ‘bird island’, ‘. ‘]

Pkuseg: [‘ boat ‘, ‘April’, ‘8’, ‘the’, ‘the Yangtze river,’ to ‘and’ bird ‘, ‘island’, ‘. ‘]

For the fourth sentence, the output is as follows:

Edar Alkan was born in 1958 in Ankara, Turkey, but spent much of his school life in the United States.

LTP: [‘ 1958 ‘, ‘, ‘, ‘Samuel edda agri camp’, ‘born’, ‘in’ and ‘Turkey’, ‘capital’, ‘Ankara’, ‘, ‘, ‘but’, ‘he’, ‘the’, ‘to study’, ‘career’ and ‘how’ and ‘in’, ‘us’,’ spend ‘, ‘. ‘]

jieba: [‘ 1958 ‘, ‘in’, ‘, ‘, ‘Mr’, ‘total’, ‘, ‘, ‘, ‘, ‘he’, ‘born’, ‘in’ and ‘Turkey’, ‘capital’, ‘Ankara’, ‘, ‘, ‘but’, ‘he’, ‘the’, ‘to study’, ‘career’ and ‘how’ and ‘in’, ‘us’,’ spend ‘, ‘. ‘]

pkuseg: [‘ 1958 ‘, ‘, ‘, ‘Samuel edda agri camp’, ‘born’, ‘in’ and ‘Turkey’, ‘capital’, ‘Ankara’, ‘, ‘, ‘but’, ‘he’, ‘the’, ‘to study’, ‘career’ and ‘how’ and ‘in’, ‘us’,’ spend ‘, ‘. ‘]

Then, a brief summary of the above tests:

  1. In terms of user dictionaries: LTP and pKUSeg work well, and jieba does not perform well, mainly because the words in the customized dictionary contain punctuation marks. For solutions to this problem, please refer to blog.csdn.net/weixin_4247…
  2. From the effect of the second sentence, pkuseg should have the best word segmentation effect, and ‘jing’ should be separated as a single word, while LTP and jieba have no effect even if a custom dictionary is added. Similarly, “F-35 fighter” is similar.

In general, the word segmentation effect of the three is very good, the difference is not very big, but in the custom dictionary, there is no doubt that the effect of Pkuseg is more stable. The author will also consider more in the use of participles in the future

Click to become a Registered member of the Community.