TextAnalyzer

A text analyzer which is based on machine learning that can analyze text.

So far, it supports hot word extracting, text classification, part of speech tagging, named entity recognition, chinese word segment, extracting address, synonym, text clustering, word2vec model, edit distance, chinese word segment, sentence similarity.

Features

extracting hot words from a text.

  1. to gather statistics via frequence.
  2. to gather statistics via by tf-idf algorithm
  3. to gather statistics via a score factor additionally.

extracting address from a text.

synonym can be recognized

SVM Classificator

This analyzer supports to classify text by svm. it involves vectoring the text. We can train the samples and then make a classification by the model.

For convenience,the model,tfidf and vector will be stored.

kmeans clustering && xmeans clustering

This analyzer supports to clustering text by kmeans and xmeans.

vsm clustering

This analyzer supports to clustering text by vsm.

part of speech tagging

It’s implemented by HMM model and decoder by viterbi algorithm.

google word2vec model

This analyzer supports to use word2vec model.

chinese word segment

This analyzer supports to do chinese word segment.

edit distance

This analyzer supports calculating edit distance on char level or word level.

sentence similarity

This analyzer supports calculating similarity between two sentences.

How To Use

just simple like this

Extracting Hot Words

  1. indexing a document and get a docId.
long docId = TextIndexer.index(text);
Copy the code
  1. extracting by docId.
 HotWordExtractor extractor = new HotWordExtractor();
 List<Result> list = extractor.extract(0, 20, false);
 if(list ! = null)for (Result s : list)
    System.out.println(s.getTerm() + ":" + s.getFrequency() + ":" + s.getScore());
Copy the code

a result contains term,frequency and score.

Unemployment card: 1:0.31436604 Registered residence: 1:0.30099702 Unit: 1:0.29152703 Withdrawal: 1:0.27927202 Claim: 1:0.27581802 Employee: 1: Labor: 1:0.27370203 Relationship: 1:0.27080503 City: 1:0.27080503 Termination: 1:0.27080503Copy the code

Extracting Address

String str ="xxxx";
AddressExtractor extractor = new AddressExtractor();
List<String> list = extractor.extract(str);
Copy the code

SVM Classificator

  1. training the samples.
SVMTrainer trainer = new SVMTrainer();
trainer.train();
Copy the code
  1. predicting text classification.
double[] data = trainer.getWordVector(text);
trainer.predict(data);
Copy the code

Kmeans Clustering && Xmeans Clustering

List<String> list = DataReader.readContent(KMeansCluster.DATA_FILE);
int[] labels = new KMeansCluster().learn(list);
Copy the code

VSM Clustering

List<String> list = DataReader.readContent(VSMCluster.DATA_FILE);
List<String> labels = new VSMCluster().learn(list);
Copy the code

Part Of Speech Tagging

HMMModel model = new HMMModel();
model.train();
ViterbiDecoder decoder = new ViterbiDecoder(model);
decoder.decode(words);
Copy the code

Define Your Own Named Entity

MITIE is an information extractor library comes up with MIT NLP term , which github is https://github.com/mit-nlp/MITIE .

train total_word_feature_extractor

Prepare your word set, you can put them into a txt file in the directory of ‘data’.

And then do things below:

git clone https://github.com/mit-nlp/MITIE.git
cd tools
cd wordrep
mkdir build
cd build
cmake ..
cmake --build . --config Release
wordrep -e data
Copy the code

Finally you get the total_word_feature_extractor model.

train ner_model

We can use Java\C++\Python to train the ner model, anyway we must use the total_word_feature_extractor model to train it.

if Java,

NerTrainer nerTrainer = new NerTrainer("model/mitie_model/total_word_feature_extractor.dat");
Copy the code

if C++,

ner_trainer trainer("model/mitie_model/total_word_feature_extractor.dat");
Copy the code

if Python,

trainer = ner_trainer("model/mitie_model/total_word_feature_extractor.dat")
Copy the code

build shared library

Do commands below:

cd mitielib
D:\MITIE\mitielib>mkdir build
D:\MITIE\mitielib>cd build
D:\MITIE\mitielib\build>cmake ..
D:\MITIE\mitielib\build>cmake --build . --config Release --target install
Copy the code

Then we get these below:

-- Install configuration: "Release"-- Installing: D:/MITIE/mitielib/java/.. /javamitie.dll -- Installing: D:/MITIE/mitielib/java/.. /javamitie.jar -- Up-to-date: D:/MITIE/mitielib/java/.. /msvcp140.dll -- Up-to-date: D:/MITIE/mitielib/java/.. /vcruntime140.dll -- Up-to-date: D:/MITIE/mitielib/java/.. /concrt140.dllCopy the code

Word2vec

we must set the word2vec’s path system parameter when startup,just like this -Dword2vec.path=D:\Google_word2vec_zhwiki1710_300d.bin.

Word2Vec vec = Word2Vec.getInstance();
System.out.println("Dog | cat." + vec.wordSimilarity("Dog"."Cat"));
Copy the code

Segment

DictSegment segment = new DictSegment();
System.out.println(segment.seg("I am Chinese."));

Copy the code

Edit Distance

char level,

CharEditDistance cdd = new CharEditDistance();
cdd.getEditDistance("what"."where");
cdd.getEditDistance("We are Chinese."."They are Japanese, shikuko.");
cdd.getEditDistance("Is me"."I am");
Copy the code

word level,

List list1 = new ArrayList<String>();
list1.add(new EditBlock("Computer".""));
list1.add(new EditBlock("How much".""));
list1.add(new EditBlock("Money".""));
List list2 = new ArrayList<String>();
list2.add(new EditBlock("Computer".""));
list2.add(new EditBlock("How much".""));
list2.add(new EditBlock("Money".""));
ed.getEditDistance(list1, list2);
Copy the code

Sentence Similarity

String s1 = "We are Chinese.";
String s2 = "They're Japanese. They're shiguo.";
SentenceSimilarity ss = new SentenceSimilarity();
System.out.println(ss.getSimilarity(s1, s2));
s1 = "We are Chinese.";
s2 = "We are Chinese.";
System.out.println(ss.getSimilarity(s1, s2));
Copy the code

————- Recommended reading ————

My 2017 article summary – Machine learning

My 2017 article summary – Java and Middleware

My 2017 article summary – Deep learning

My 2017 article summary — JDK source code article

My 2017 article summary – Natural Language Processing

My 2017 Article Round-up — Java Concurrent Article


Talk to me, ask me questions:

The public menu has been divided into “reading summary”, “distributed”, “machine learning”, “deep learning”, “NLP”, “Java depth”, “Java concurrent core”, “JDK source”, “Tomcat kernel” and so on, there may be a suitable for your appetite.

Why to write “Analysis of Tomcat Kernel Design”

Welcome to: