1. Chinese word segmentation

Chinese word segmentation, in simple understanding, is to divide a sentence into several words. In Baidu Baike, the definition is to break a sequence of Chinese characters into individual words. Word segmentation is the process of regrouping successive word sequences according to certain norms. Such as:

I am Chinese and I love China.

After a participle, it may become:

I'm Chinese and I love China

The above is a relatively simple example. Chinese word segmentation is the basic step of Chinese unstructured text processing. English words have a natural separator between the words – space. However, Chinese does not have obvious word segmentation boundaries, so Chinese word segmentation is difficult to some extent. Such as:

I spoke to Trump on the phone.

This sentence can be divided into:

I got on the phone with Trump

It can also be divided into:

I spoke to Trump on the phone

Therefore, different word segmentation methods will produce different word segmentation results.

2, word segmentation method

Chinese word segmentation belongs to natural language processing (NLP). The method of Chinese word segmentation has gone through several stages of development.

  • A rule-based approach
  • Methods based on statistical analysis
  • Method based on neural network

At present, the popular method is the method based on neural network. The combination model of BI-LSTM +CRF can achieve better word segmentation effect. Stutter Word Segmentation is a cross – language open source Chinese word Segmentation. The stutter word splitter is available in Rust language versions. Stuttering participles support four participles:

  • Accurate mode, which tries to cut the sentence as precisely as possible, is suitable for text analysis;
  • Full mode, which scans out all the words that can be formed into words in a sentence, is very fast, but it can’t resolve ambiguity.
  • Search engine mode, on the basis of the precise mode, to long words segmentation again, improve the recall rate, suitable for search engine segmentation.
  • PADDLE mode, using the PADDLEPADDLE deep learning framework, training sequence annotation (two-way GRU) network model to achieve word segmentation. It also supports part of speech tagging.

3, stuttering participle installation

The installation of the stuttering participle can be seen in the Rust version of the stuttering participle by adding the following code to Cargo. Toml:

[dependencies] jieba - rs = "0.6"

This allows you to use the stuttering participle.

4. Examples of stuttering participles

First, write a WebAssembly function, define it as cut, and then create a Jieba word splitter, using the Jieba word splitter’s cut method to achieve Chinese word segmentation.

use wasm_bindgen::prelude::*;
use jieba_rs::Jieba;

#[wasm_bindgen]
fn cut(sent: String) ->[String] {
    let jieba = Jieba::new();
    let words = jieba.cut("sent", false);
    return words
}

Next, use SSVMUP Build to compile the Rust source code into WebAssembly bytecode and generate the accompanying JavaScript module for the Node.js host environment. Then import the cut function in app.js.

const { cut } = require('../pkg/ssvm_nodejs_starter_lib.js');

const http = require('http');
const url = require('url');
const hostname = '127.0.0.1';
const port = 3000;

const server = http.createServer((req, res) => {
  const queryObject = url.parse(req.url,true).query;
  res.statusCode = 200;
  res.setHeader('Content-Type', 'text/plain');
  res.end(cut(queryObject['sen']));
});

server.listen(port, hostname, () => {
  console.log(`Server running at http://${hostname}:${port}/`);
});

Next, start the NodeJS server with node node/app.js. Access 127.0.0.1:3000 in browser? Sent = I am Chinese and can get the following return results: