Natural Language Processing (NLP) — An introduction to the use of HanLP for word segmentation and part of speech labeling.

Introduction to HanLP

HanLP is an NLP toolkit of models and algorithms aimed at popularizing the use of natural language processing in production environments. HanLP has the features of perfect function, efficient performance, clear structure, up-to-date corpus and customizable. Currently, DEEP learning-based HanLP 2.0 is in alpha testing. If we were Java users we might have wasted some time trying to figure out how to use it when we went to the official website, because 2.0 currently seems to be available in Python, and is in beta, and is now making commercial API calls online. Most of the documentation is around online API calls for 2.0, so up until now if you’re using Java, just look at the 1.x branch and use it. Github: github.com/hankcs/HanL… . According to the documentation, there are two ways to use HanLP, the first is to directly rely on Maven, and the second is to download jars and configuration files. Now let’s experience the use of HanLP, let’s go

Maven use

1. Let’s build a simple Maven project, HanlpDemo

For the convenience of users, a built-in Portable version of the package is available, and the pom. XML is added to the dependency. Maven is reload downloaded.

< the dependency > < groupId > com. Hankcs < / groupId > < artifactId > hanlp < / artifactId > < version > portable - 1.8.2 < / version > </dependency>Copy the code

Write a simple test method

/** * @param inputStr */ public static void HanLpSegment(String inputStr) {// List<Term> termList = StandardTokenizer.segment(inputStr); System.out.println(termList); System.out.println(HanLP. Segment (inputStr)); // System.out.println(nlptokenizer.segment (inputStr)); // System.out.println(nlptokenizer.segment (inputStr)); }Copy the code

3, Run the results

There are many ways to participle: standard participle, NLP participle, index participle, N-shortest path participle, etc., the rest of which will not be tested here. Word segmentation in the above three methods, the previous two results have come out, and mark the part of speech, and the third why error (= = failed: open data/model/perceptron/large/CWS bin = =)? Because Maven’s approach is zero-configured, you can use basic functionality (all functionality except word-based word formation and dependency parsing). If you have custom requirements, use hanlp.properties (hanlP. properties is also supported by the Portable version). So the nlptokenizer. segment method should use this function. We use the second method, which is to configure hanlp.properties. Since you already rely on Maven, you don’t need to add jar packages. Just download data and configure hanlp.properties.

Download jar, data, and hanlp.properties

Download: data.zip

After downloading and decompressing, we will put the data folder under SRC /main/resources, or in other locations, hanlp.properties to configure the data path.

Download the jar file and the configuration file hanlp-release.zip

Root =D:/JavaProjects/ hanlp/root=./ SRC /main/resources

This is the second way to configure using HanLP, because we already have Maven dependencies so we don’t need to add jar packages here. If we don’t have Maven, we need to add jar packages to the project reference.

3, run again

The segment method does not produce the same result as the standard segmentation. “July 15, 2021” is labeled as a time /t, and the document says the NLPTokenizer performs pos taging and named entity recognition, so it’s labeled as a time /t named entity. Let’s look at the part of speech tagging and named entities, this explanation can explain this article more, see my last article also defined, anyway, the definition is not my definition, the spirit is not similar, how to say all reasonable!

What is part of speech tagging:

Part of speech is the basic grammatical attribute of vocabulary, also known as part of speech. Part of speech tagging is the process of determining the grammatical category of each word in a given sentence, determining its part of speech and tagging it.

What is the named entity:

Named Entity Recognition (NER), also known as “full name Recognition”, refers to the Recognition of entities with specific meaning in the text, its purpose is to identify human name, place name, organization name and other Named entities in the corpus.

Named entity is the research subject of named entity identification, which generally includes three categories (entity class, time class and number class) and seven sub-categories (person name, place name, institution name, time, date, currency and percentage) named entity.

The last

With the concept of NLP understanding, also have NLP tools, so it is also to start NLP (jiong). At present, there is no practical demand, just personal interest to learn, anyway, waste of time is also a waste of time, I hope that more learning and practice to share!

Reference: 1, Wikipedia, HanLp website, HanLp Github