sequence

This paper mainly studies how to use OpenNLP for part-of-speech tagging

POS Tagging

Part of Speech (POS) tagging is the process of describing a word or a text. This description is called a annotation.

Currently, there are two popular Chinese part-of-speech tags: PKU part-of-speech tagging set and Penn Part-of-speech tagging set. Modern Chinese words can be divided into two categories of 12 parts of speech: one is content words: nouns, verbs, adjectives, numerals, quantifiers and pronouns; The other is function words: adverbs, prepositions, conjunctions, auxiliary words, interjections and onomatopoeia.

Most of these techniques use HMM (hidden Markov model) + Viterbi algorithm, and Maximum Entropy algorithm.

OpenNLP can use the POSTaggerME class to perform basic annotations and the ChunkerME class to perform chunking.

POSTaggerME

    public static POSModel trainPOSModel(ModelType type) throws IOException {
        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ALGORITHM_PARAM, type.toString());
        params.put(TrainingParameters.ITERATIONS_PARAM, 100);
        params.put(TrainingParameters.CUTOFF_PARAM, 5);

        return POSTaggerME.train("eng", createSampleStream(), params,
                new POSTaggerFactory());
    }

    private static ObjectStream<POSSample> createSampleStream() throws IOException {
        InputStreamFactory in = new ResourceAsStreamFactory(POSTaggerMETest.class,
                "postag/AnnotatedSentences.txt");

        return new WordTagSampleStream(new PlainTextByLineStream(in, StandardCharsets.UTF_8));
    }

    @Test
    public void testPOSTagger() throws IOException {
        POSModel posModel = trainPOSModel(ModelType.MAXENT);

        POSTagger tagger = new POSTaggerME(posModel);

        String[] tags = tagger.tag(new String[] {
                "The"."driver"."got"."badly"."injured"."."});

        Assert.assertEquals(6, tags.length);
        Assert.assertEquals("DT", tags[0]);
        Assert.assertEquals("NN", tags[1]);
        Assert.assertEquals("VBD", tags[2]);
        Assert.assertEquals("RB", tags[3]);
        Assert.assertEquals("VBN", tags[4]);
        Assert.assertEquals(".", tags[5]);
    }
Copy the code

Here, model training is carried out first, in which the training text style is as follows:

Last_JJ September_NNP ,_, I_PRP tried_VBD to_TO find_VB out_RP the_DT address_NN of_IN an_DT old_JJ school_NN friend_NN whom_WP I_PRP had_VBD not_RB seen_VBN for_IN 15_CD years_NNS ._.
I_PRP just_RB knew_VBD his_PRP$ name_NN ,_, Alan_NNP McKennedy_NNP ,_, and_CC I_PRP 'd_MD heard_VBD the_DT rumour_NN that_IN he_PRP 'd_MD moved_VBD to_TO Scotland_NNP ,_, the_DT country_NN of_IN his_PRP$ ancestors_NNS ._.
So_IN I_PRP called_VBD Julie_NNP ,_, a_DT friend_NN who's_WDT still_RB in_IN contact_NN with_IN him_PRP ._. She_PRP told_VBD me_PRP that_IN he_PRP lived_VBD in_IN 23213_CD Edinburgh_NNP ,_, Worcesterstreet_NNP 12_CD ._. I_PRP wrote_VBD him_PRP a_DT letter_NN right_RB away_RB and_CC he_PRP answered_VBD soon_RB  ,_, sounding_VBG very_RB happy_JJ and_CC delighted_JJ ._.Copy the code

Annotation:

  • DT(Determiner)
  • NN (Noun, singular or mass)
  • VBD (Verb, past tense)
  • RB (Adverb)
  • VBN (Verb, past participle)

ChunkerME

    private Chunker chunker;

    private static String[] toks1 = { "Rockwell"."said"."the"."agreement"."calls"."for"."it"."to"."supply"."200"."additional"."so-called"."shipsets"."for"."the"."planes"."." };

    private static String[] tags1 = { "NNP"."VBD"."DT"."NN"."VBZ"."IN"."PRP"."TO"."VB"."CD"."JJ"."JJ"."NNS"."IN"."DT"."NNS"."." };

    private static String[] expect1 = { "B-NP"."B-VP"."B-NP"."I-NP"."B-VP"."B-SBAR"."B-NP"."B-VP"."I-VP"."B-NP"."I-NP"."I-NP"."I-NP"."B-PP"."B-NP"."I-NP"."O" };

    @Before
    public void startup() throws IOException {
        ResourceAsStreamFactory in = new ResourceAsStreamFactory(getClass(),
                "chunker/test.txt");

        ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(
                new PlainTextByLineStream(in, StandardCharsets.UTF_8));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 70);
        params.put(TrainingParameters.CUTOFF_PARAM, 1);

        ChunkerModel chunkerModel = ChunkerME.train("eng", sampleStream, params, new ChunkerFactory());

        this.chunker = new ChunkerME(chunkerModel);
    }

    @Test
    public void testChunkAsArray() throws Exception {

        String[] preds = chunker.chunk(toks1, tags1);

        Assert.assertArrayEquals(expect1, preds);
    }
Copy the code

Model training is also carried out here, and the training text style is as follows:

Rockwell NNP B-NP
International NNP I-NP
Corp. NNP I-NP
's POS B-NP
Tulsa NNP I-NP
unit NN I-NP
said VBD B-VP
it PRP B-NP
signed VBD B-VP
a DT B-NP
tentative JJ I-NP
agreement NN I-NP
extending VBG B-VP
its PRP$ B-NP
contract NN I-NP
with IN B-PP
Boeing NNP B-NP
Co. NNP I-NP
to TO B-VP
provide VB I-VP
structural JJ B-NP
parts NNS I-NP
for IN B-PP
Boeing NNP B-NP
's POS B-NP
747 CD I-NP
jetliners NNS I-NP
Copy the code

Annotation:

  • \B Mark start
  • \I mark in the middle
  • \E End of note
  • NP noun block
  • VB verb block

summary

This paper preliminarily demonstrates how to use OpenNLP for pos tagging. Model training is a relatively important aspect, which can improve the accuracy of text tagging in a specific field through specific training.

doc

  • Principle and practice of NLP Chinese Natural Language Processing
  • Penn Part of Speech Tags