GitHub's new open project, FoolNLTK: a handy Chinese processing toolkit

Heart of the Machine reports

Author: Jiang Siyuan

Recently, GitHub user Wu. Zheng has opened an open source Chinese processing toolkit built using bidirectional LSTM. This tool can not only realize word segmentation, partof speech tagging and named entity recognition, but also enhance the effect of word segmentation with user-defined dictionary. Heart of the Machine gives a brief introduction to this two-way LSTM and gives the results of our testing of the tool on Windows.

Chinese processing kit lot address: https://github.com/rockyzhengwu/FoolNLTK

According to the project, the features of this Chinese toolkit are as follows:

May not be the fastest open source Chinese word segmentation, but it is probably the most accurate open source Chinese word segmentation
Based on BiLSTM model training
Including word segmentation, part-of-speech tagging, entity recognition, all have relatively high accuracy
User-defined dictionary

Word segmentation is a basic technique in Chinese information processing because Chinese words are connected to each other, unlike English, which has a natural space to separate different words. Although it is easy for Chinese users to divide a string of Chinese characters into words, it is very challenging for machines. Therefore, word segmentation has always been an important research problem in the field of Chinese information processing.

As described in this project, the author uses bidirectional LSTM to construct the whole model, which may be the reason why the author is very confident in the segmentation. In terms of Chinese word segmentation, neural network-based methods often use the model of “word vector + bidirectional LSTM + CRF” to learn features by neural network and minimize the amount of artificial features in traditional CRF.

The general neural network architecture of Chinese word segmentation system, in which the feature layer is LSTM, source Xinchi Chen et al.(2017).

In addition to the depth of the methodology used in the toolkit, one of this year’s outstanding ACL papers actually describes the word segmentation approach. Chen Xinchi, Shu Shu, Qiu Xipeng and Huang Xuanjing from Fudan University proposed a new framework in the paper of Adversarial Multi-Criteria Learning for Chinese Word Segmentation, which can use multi-standard Chinese Word Segmentation materials for training.

Since the toolkit primarily uses two-way LSTM, let’s briefly explain this network before discussing the word segmentation effects we tested. First, as the name implies, bidirectional LSTM combines an LSTM that moves from the beginning of a sequence with another LSTM that moves from the end of a sequence. The forward and backward cyclic networks are composed of one LSTM unit. The following is the detailed structure of the LSTM unit, where Z is the input part and Z_i, Z_o and Z_f are the values of the control three gates respectively, that is, they will filter the input information through the activation function F. The general activation function can be chosen as the Sigmoid function, because its output value is 0 to 1, which indicates the degree to which the three doors are opened.

Image from Machine learning handout by Li Hongyi

If we input Z, then the product of g(Z) of the input vector obtained by the activation function and the input gate f(Z_i), g(Z) f(Z_i), represents the information retained after filtering the input data. The forgetfulness controlled by Z_f controls how much information previously remembered needs to be retained, and the retained memory can be expressed by equation C * F (Z_f). The previously retained information plus the current input meaningful information will be retained to the next LSTM unit, that is, we can use C ‘= g(Z) F (Z_i) + CF (z_F) to represent the updated memory, and the updated memory C’ also represents all the previously and currently retained useful information. We then take the activation value h(C ‘) of this updated memory as the possible output, and generally choose the TANH activation function. All that is left is the output gate controlled by Z_o, which determines which of the outputs currently activated by memory are useful. So the final output of LSTM can be expressed as a = H (c’)f(Z_o).

If we organize the series of LSTM units into the following form, they constitute a BiLSTM network:

Image from Machine learning handout by Li Hongyi

As shown above, we connect two backward read LSTM networks to form BiLSTM. If we train forward LSTM and backward LSTM at the same time, and take out the hidden layers of the two cyclic networks and connect them to an output layer, then we can get the final output result Y. The advantage of using this BiLSTM is that the model has a wide range of observation, because when we only use one-way cyclic network, only X_T and the input data before can be observed at time step T +1, but the situation of X_T +2 and beyond can not be observed. When we use bidirectional cyclic networks, the model will observe all the input sequences at each time step to determine the final output.

This is a brief overview of BiLSTM concepts. Heart of Machine has successfully run the model on Windows 10, TensorFlow 1.4, Numpy 1.13.3. We simply tested the word segmentation effect of three paragraphs, namely the general text, the professional text and the professional text after defining the dictionary. We use Python’s built-in functions to read the TXT document and perform word segmentation. The data structure returned by word segmentation is a very common Python list, so we can easily do subsequent processing.

For more common text, the tool does work well:

[‘ this’, ‘report’, ‘the’, ‘discussion’, ‘the’, ‘”‘, ‘artificial intelligence’, ‘”‘, ‘main’, ‘is’,’ means’, ‘can’ and ‘by’ and ‘machine’, ‘is’,’ the ‘, ‘smart’ and ‘, ‘, ‘also ‘,’ is called ‘, ‘machine ‘,’ intelligence ‘, ‘. “, “in”, “academic”, “research”, “field”,” ‘, ‘means’,’ can ‘and’ awareness ‘, ‘surrounding’, ‘environment’, ‘and’, ‘to’ and ‘action’, ‘to’, ‘implementation’, ‘the most’ and ‘best’, ‘could’, ‘the’, ‘the’, ‘agent’, ‘. ‘, ‘in general’, ‘, ‘, ‘artificial intelligence’, ‘the’, ‘long term’, ‘target’, ‘is’,’ implementation ‘, ‘general’, ‘artificial intelligence’, ‘, ‘, ‘this’,’ is’, ‘as’ and’ is’, ‘”‘, ‘strong’ and ‘artificial intelligence’, ‘”‘, ‘. ‘and’ in ‘and’ process ‘and’ cross ‘, ‘area’, ‘problem’, ‘ ‘, ‘, ‘, ‘AGI’, ‘ ‘, ‘the’, ‘performance’ and ‘will’, ‘well,’ ‘more than’, ‘normal’, ‘machine’, ‘. ‘, ‘and ‘,’ can ‘, ‘concurrently ‘,’ process ‘, ‘multiple ‘,’ tasks ‘, ‘. ‘, ‘and’ and ‘weak’ and ‘artificial intelligence’, ‘not’ and ‘solution’, ‘before’ and ‘not’, ‘seen’, ‘the’, ‘problem’, ‘, ‘, ‘and’, ‘the’, ‘ability’, ‘only’, ‘limited’ and ‘in’, ‘specific’, ‘area’, ‘the’, ‘. ‘, ‘but’, ‘, ‘, ‘artificial intelligence’, ‘experts’,’ and ‘, ‘scientists’,’ now, ‘ ‘to’, ‘ ‘, ‘AGI’, ‘ ‘, ‘the’, ‘exact’, ‘define’, ‘still’, ‘including’ and ‘mixed’, ‘not’ and ‘clear’, ‘. ‘and’ difference ‘, ‘strong’ and ‘artificial intelligence’, ‘and’ and ‘weak’, ‘artificial intelligence’, ‘the’, ‘common’ and ‘method’, ‘is’,’ ‘, ‘test’, ‘. ‘, ‘such as’ and’ Turing ‘, ‘test’, ‘and’, ‘robots’,’ college students’, ‘test’, ‘and’, ‘job’, ‘test’, ‘. ‘]

For more specialized words, the tool tends to break them up into more discrete phrases:

[‘ in ‘and’ paper ‘, ‘the’, ‘, ‘, ‘Hinton’, ‘and’ introduction ‘, ‘, ‘Capsule’, ‘, ‘to’, ‘: ‘, ‘, ‘Capsule’, ‘a ‘,’ set ‘, ‘neuron ‘,’ ‘, ‘the’, ‘input’ and ‘output’, ‘to’, ‘quantity’, ‘said’, ‘specific’, ‘real’ and ‘type’, ‘the’, ‘instantiated’, ‘parameters’,’ (‘, ‘or’, ‘specific’, ‘objects’,’ and ‘, ‘concept’, ‘real’, ‘etc.’, ‘occurrence ‘,’ probability ‘, ‘and’ certain ‘, ‘attribute ‘,’) ‘, ‘. ‘, ‘we’, ‘use’, ‘input’ and ‘output’, ‘to’, ‘quantity’, ‘the’, ‘length’, ‘representation’, ‘real’, ‘there’, ‘the’, ‘probability’, ‘. ‘, ‘to’, ‘quantity’, ‘the’, ‘direction’, ‘said’, ‘instantiated’, ‘parameters’,’ (‘, ‘or’, ‘real’, ‘the’, ‘some’, ‘graphic’, ‘properties’,’) ‘, ‘. ‘, ‘the same’ and ‘level’, ‘the’, ‘ ‘, ‘capsule’, ‘ ‘, ‘by’ and ‘transformation’, ‘matrix’, ‘to’, ‘a’, ‘high’ and ‘level’, ‘the’, ‘ ‘, ‘capsule’, ‘ ‘, ‘the’, ‘instantiated’, ‘parameters ‘,’ conduct ‘, ‘forecast ‘,’. ‘and’ when ‘, ‘many’, ‘a’, ‘prediction’, ‘consistent’, ‘when’, ‘(‘,’ this’, ‘paper’, ‘use’, ‘dynamic’, ‘the road’, ‘the’, ‘the’, ‘prediction’, ‘consistent’, ‘) ‘, ‘. ‘, ‘a’, ‘high’ and ‘level’, ‘the’, ‘ ‘, ‘capsule’, ‘ ‘and’ will ‘, ‘change’, ‘to’, ‘active’, ‘. ‘, ‘”‘]

However, we can add professional words to the dictionary to enhance the segmentation effect. The following is the segmentation after adding the dictionary:

[‘ in ‘and’ paper ‘, ‘the’, ‘, ‘Hinton’, ‘ ‘and’ introduction ‘, ‘ ‘, ‘Capsule’, ‘ ‘, ‘to’, ‘: “, “” Capsule”, “is a set of” neurons “,” ‘, ‘the’, ‘input and output vectors’,’ said ‘, ‘specific’, ‘entity type’, ‘the’, ‘instantiate the parameters’,’ (‘, ‘or’, ‘particular’ and ‘object’, ‘and’, ‘concept’, ‘real’, ‘and’, ‘a’, ‘the’, ‘probability’, ‘and’, ‘Certain ‘,’ properties ‘, ‘) ‘, ‘. ‘, ‘we’, ‘use’, ‘input and output vectors’,’ the ‘, ‘length’, ‘representation’, ‘real’, ‘there’, ‘the’, ‘probability’, ‘. ‘, ‘vector’, ‘the’, ‘direction’, ‘said’, ‘instantiate the parameters’,’ (‘, ‘or’, ‘real’, ‘the’, ‘some’, ‘graphic attributes’,’) ‘, ‘. ‘, ‘the same’ and ‘level’, ‘the’, ‘ ‘, ‘capsule’, ‘ ‘, ‘by’ and ‘transformation’, ‘matrix’, ‘to’, ‘a’, ‘high’ and ‘level’, ‘the’, ‘ ‘, ‘capsule’, ‘ ‘, ‘the’, ‘instantiate the parameters’, ‘proceed ‘,’ predict ‘, ‘. ‘and’ when ‘, ‘multiple’, ‘prediction’, ‘consistent’, ‘when’, ‘(‘,’ this’, ‘paper’, ‘use’, ‘dynamic’, ‘route’, ‘the’, ‘prediction’, ‘consistent’, ‘) ‘, ‘. ‘, ‘a’, ‘high’ and ‘level’, ‘the’, ‘ ‘, ‘capsule’, ‘ ‘and’ will ‘, ‘change’, ‘to’, ‘active’, ‘. ‘, ‘”‘]

The following is how to install and use the open source tool:

The installation


     
      pip install foolnltk
     
Copy the code

Directions for use

1. The word segmentation


     
      import fool
      
      Print (fool. Cut (text))
      # [' a ', 'fool ',' in ', 'Beijing ']
     
Copy the code

Command line segmentation


     
      python -m fool [filename]
     
Copy the code

2. User-defined dictionary

The dictionary format is as follows: The higher the weight of the word, the longer the length of the word, the more likely it is to appear. The weight value should be greater than 1


     
      Uncomfortable mushroom 10 what ghost 10 participle tool 10 Beijing 10 Beijing Tian 'anmen 10
     
Copy the code

Loading the dictionary


     
      import fool
      fool.load_userdict(path)
      Text = "mushroom" tiananmen square in Beijing, I see you pain print (fool. The cut (text) # [' me ', 'in', 'Beijing tiananmen square', 'look at', 'you', 'discomfort with mushroom]
     
Copy the code

Delete the dictionary

                                                                        
     
                                                                            
      
                                                                                fool .delete_userdict ();
                                                                            
                                                                        
     
                                                                        Copy the code

Part-of-speech tagging


     
      import fool
      
      Print (fool. Pos_cut (text))
      # [(' a ', 'm'), (' fool ', 'n'), (' in ', 'p'), (' Beijing ', 'ns)]
     
Copy the code

4. Entity recognition

                                                                                
     
                                                                                    
      
                                                                                        import fool
                                                                                    
                                                                                    
      
                                                                                        
                                                                                        
                                                                                    
                                                                                    
      
                                                                                        Text = "a fool in Beijing "words, ners. analysis (text)
                                                                                    
                                                                                    
      
                                                                                        print (ners )
                                                                                    
                                                                                    
      
                                                                                        #[(5, 8, 'location', 'Beijing ')]
                                                                                    
                                                                                
     
                                                                                Copy the code

The authors have only tested it on The Python3 Linux platform for now, but we found that it works well on Windows.

This article for machine heart report, reprint please contact this public number for authorization.

✄ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Join Heart of the Machine (full-time reporter/intern) : [email protected]

Contribute or seek coverage: [email protected]

Advertising & Business partnerships: [email protected]

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

GitHub’s new open project, FoolNLTK: a handy Chinese processing toolkit

GitHub’s new open project, FoolNLTK: a handy Chinese processing toolkit

Related Posts

Summary of the frame system

CSS Learning (3)

Deep learning vue series – Render functions