SnowNLP, a Python Chinese word segmentation tool for Simplified Chinese Text Processing, is introduced. In the realization of word segmentation at the same time, provide conversion to pinyin (Trie tree to achieve the maximum matching) and traditional to simplified (Trie tree to achieve the maximum matching) and other functions. Simple operation and powerful function.

Install

$ pip install snownlp
Copy the code

Useage

SnowNLP is a python class library that can easily process Chinese text. It was inspired by TextBlob. Since most natural language processing libraries are aimed at English, SnowNLP is a class library that can easily process Chinese text. NLTK is not used here, all algorithms are self-implemented and come with some trained dictionaries. Note that this procedure is dealing with Unicode encoding, so use your own decode into Unicode.

from snownlp import SnowNLP

s = SnowNLP(U 'This thing is really cool.')

s.words         # [u' this ', u' thing ', u' heart ',
                # u' very ', u' like ']

s.tags          # [(u' this ', u'r'), (u' thing ', u'n'),
                # (u' sincerely ', u'd'), (u' very ', u'd'),
                # (u' like ', u'Vg')]

s.sentiments    The probability of positive

s.pinyin        # [u'zhe', u'ge', u'dong', u'xi',
                # u'zhen', u'xin', u'hen', u'zan']

s = SnowNLP("Traditional Characters" The term "traditional Chinese" is also common in Taiwan. ')

s.han           The term for "traditional Chinese"
                # is also common in Taiwan. '

text = Natural language processing (NLP) is an important field in computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between human and computer in natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the study of this field will involve natural language, that is, the language that people use every day, so it has a close connection with the study of linguistics, but there are important differences. Natural language processing is not the study of natural language in general, but the development of computer systems, especially software systems, which can effectively realize natural language communication. So it's part of computer science. ' ' '

s = SnowNLP(text)

s.keywords(3)	# [u' language ', u' nature ', u' computer ']

s.summary(3)	So it's part of computer science,
                Natural language processing is a field of fusion linguistics, computer science,
				# The science of mathematics in one ',
				Natural language processing is the field of computer science with artificial intelligence
				# an important direction in the field ']
s.sentences

s = SnowNLP([[U 'it'.U 'articles'],
             [U 'it'.U 'paper'],
             [The 'u']])
s.tf
s.idf
s.sim([U 'articles'])# [0.3756070762985226, 0, 0]
Copy the code

Features

  • Chinese Word segmentation (character-based Generative Model)
  • Part-of-speech tagging (TnT 3-GRAM hidden horse)
  • Emotion analysis (now the training data is mainly the evaluation of buying and selling things, so the effect is not very good for some other things, to be solved)
  • Naive Bayes
  • Convert to Pinyin (maximum match achieved by Trie tree)
  • Traditional to Simplified (maximum matching by Trie tree)
  • Extracting text keywords (TextRank algorithm)
  • Extract text summary (TextRank algorithm)
  • Tf, idf
  • Tokenization (split into sentences)
  • Text similarity (BM25)
  • Python3 support (thanks to Erning)

Train

The training now includes word segmentation, partof speech tagging, and sentiment analysis, and all provide the original files I used for training. For example, word segmentation in snownlp/seg directory

from snownlp import seg
seg.train('data.txt')
seg.save('seg.marshal')
# from snownlp import tag
# tag.train('199801.txt')
# tag.save('tag.marshal')
# from snownlp import sentiment
# sentiment.train('neg.txt', 'pos.txt')
# sentiment.save('sentiment.marshal')
Copy the code

The trained file will be saved as seg. Marshal, and the data_path in snownlp/seg/__init__.py will point to the trained file