
This paper mainly studies how to use OpenNLP for part-of-speech tagging

POS Tagging

Part of Speech (POS) tagging is the process of describing a word or a text. This description is called a annotation.

Currently, there are two popular Chinese part-of-speech tags: PKU part-of-speech tagging set and Penn Part-of-speech tagging set. Modern Chinese words can be divided into two categories of 12 parts of speech: one is content words: nouns, verbs, adjectives, numerals, quantifiers and pronouns; The other is function words: adverbs, prepositions, conjunctions, auxiliary words, interjections and onomatopoeia.

Most of these techniques use HMM (hidden Markov model) + Viterbi algorithm, and Maximum Entropy algorithm.

OpenNLP can use the POSTaggerME class to perform basic annotations and the ChunkerME class to perform chunking.


    public static POSModel trainPOSModel(ModelType type) throws IOException {
        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ALGORITHM_PARAM, type.toString());
        params.put(TrainingParameters.ITERATIONS_PARAM, 100);
        params.put(TrainingParameters.CUTOFF_PARAM, 5);

        return POSTaggerME.train("eng", createSampleStream(), params,
                new POSTaggerFactory());

    private static ObjectStream<POSSample> createSampleStream() throws IOException {
        InputStreamFactory in = new ResourceAsStreamFactory(POSTaggerMETest.class,

        return new WordTagSampleStream(new PlainTextByLineStream(in, StandardCharsets.UTF_8));

    public void testPOSTagger() throws IOException {
        POSModel posModel = trainPOSModel(ModelType.MAXENT);

        POSTagger tagger = new POSTaggerME(posModel);

        String[] tags = tagger.tag(new String[] {

        Assert.assertEquals(6, tags.length);
        Assert.assertEquals("DT", tags[0]);
        Assert.assertEquals("NN", tags[1]);
        Assert.assertEquals("VBD", tags[2]);
        Assert.assertEquals("RB", tags[3]);
        Assert.assertEquals("VBN", tags[4]);
        Assert.assertEquals(".", tags[5]);
Here, model training is carried out first, in which the training text style is as follows:

Last_JJ September_NNP ,_, I_PRP tried_VBD to_TO find_VB out_RP the_DT address_NN of_IN an_DT old_JJ school_NN friend_NN whom_WP I_PRP had_VBD not_RB seen_VBN for_IN 15_CD years_NNS ._.
I_PRP just_RB knew_VBD his_PRP$ name_NN ,_, Alan_NNP McKennedy_NNP ,_, and_CC I_PRP 'd_MD heard_VBD the_DT rumour_NN that_IN he_PRP 'd_MD moved_VBD to_TO Scotland_NNP ,_, the_DT country_NN of_IN his_PRP$ ancestors_NNS ._.
So_IN I_PRP called_VBD Julie_NNP ,_, a_DT friend_NN who's_WDT still_RB in_IN contact_NN with_IN him_PRP ._. She_PRP told_VBD me_PRP that_IN he_PRP lived_VBD in_IN 23213_CD Edinburgh_NNP ,_, Worcesterstreet_NNP 12_CD ._. I_PRP wrote_VBD him_PRP a_DT letter_NN right_RB away_RB and_CC he_PRP answered_VBD soon_RB  ,_, sounding_VBG very_RB happy_JJ and_CC delighted_JJ ._.Copy the code


  • DT(Determiner)
  • NN (Noun, singular or mass)
  • VBD (Verb, past tense)
  • RB (Adverb)
  • VBN (Verb, past participle)


    private Chunker chunker;

    private static String[] toks1 = { "Rockwell"."said"."the"."agreement"."calls"."for"."it"."to"."supply"."200"."additional"."so-called"."shipsets"."for"."the"."planes"."." };

    private static String[] tags1 = { "NNP"."VBD"."DT"."NN"."VBZ"."IN"."PRP"."TO"."VB"."CD"."JJ"."JJ"."NNS"."IN"."DT"."NNS"."." };

    private static String[] expect1 = { "B-NP"."B-VP"."B-NP"."I-NP"."B-VP"."B-SBAR"."B-NP"."B-VP"."I-VP"."B-NP"."I-NP"."I-NP"."I-NP"."B-PP"."B-NP"."I-NP"."O" };

    public void startup() throws IOException {
        ResourceAsStreamFactory in = new ResourceAsStreamFactory(getClass(),

        ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(
                new PlainTextByLineStream(in, StandardCharsets.UTF_8));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 70);
        params.put(TrainingParameters.CUTOFF_PARAM, 1);

        ChunkerModel chunkerModel = ChunkerME.train("eng", sampleStream, params, new ChunkerFactory());

        this.chunker = new ChunkerME(chunkerModel);

    public void testChunkAsArray() throws Exception {

        String[] preds = chunker.chunk(toks1, tags1);

        Assert.assertArrayEquals(expect1, preds);
Model training is also carried out here, and the training text style is as follows:

Rockwell NNP B-NP
International NNP I-NP
Corp. NNP I-NP
Tulsa NNP I-NP
unit NN I-NP
said VBD B-VP
signed VBD B-VP
tentative JJ I-NP
agreement NN I-NP
extending VBG B-VP
its PRP$ B-NP
contract NN I-NP
with IN B-PP
Boeing NNP B-NP
to TO B-VP
provide VB I-VP
structural JJ B-NP
parts NNS I-NP
for IN B-PP
Boeing NNP B-NP
747 CD I-NP
jetliners NNS I-NP
  • \B Mark start
  • \I mark in the middle
  • \E End of note
  • NP noun block
  • VB verb block


This paper preliminarily demonstrates how to use OpenNLP for pos tagging. Model training is a relatively important aspect, which can improve the accuracy of text tagging in a specific field through specific training.


