Description: Word segmentation and word cloud generation through IK word segmentation.

This paper mainly introduces how to use IK word divider for word frequency statistics. Word segmentation is used to make statistics of the word frequency of the article. The main purpose is to realize the word cloud function as shown in the picture below, which can find the key words in the article. Later, it can also carry out part-of-speech tagging, entity recognition and emotion analysis of entities.

The specific modules of word frequency statistics service are as follows:

** Data input: ** text information ** Data output: ** word – word frequency (TF-IDF, etc.) – part of speech and other content ** Components used: ** word segmentation, corpus, word cloud display components, etc. ** Function points: ** whitelist, blacklist, synonyms, etc

Existing Chinese word dividers include IK, HanLP, Jieba and NLPIR. Different word dividers have their own characteristics. This paper uses IK, because ES generally uses the IK word divider plug-in package by MedCL and other big players as the Chinese word divider. Because the IK word segmentation plug-in of ES is deeply combined with ES, only the text word segmentation can not use the content of ES, so the text uses the Shen Yan super big man version of IK.

1. IK participle statistical code

The IK code is relatively simple, not much stuff, break String into words and count the code as follows:

  1. Simple statistical word frequency:

    / * *

    • Full-text word frequency statistics
    • @param Content Indicates the text content
    • @param useSmart specifies whether to useSmart devices
    • @return word, word frequency
    • @throws IOException

    */ private static Map<String, Integer> countTermFrequency(String content, Boolean useSmart) throws IOException {// Output result Map Map<String, Integer> Frequencies = new HashMap<>(); if (StringUtils.isBlank(content)) { return frequencies; } DefaultConfig conf = new DefaultConfig(); conf.setUseSmart(useSmart); IKSegmenter IKSegmenter = new IKSegmenter(new StringReader(content), conf); // Use IKSegmenter to initialize the text message and load the dictionary. Lexeme lexeme; while ((lexeme = ikSegmenter.next()) ! = null) {if (lexeme.getlexemetext ().length() > 1) { Final String term = lexeme.getlexemetext (); Pute (term, (k, v) -> {if (v == null) {v = 1; } else { v += 1; } return v; }); } } return frequencies; }

  2. Statistic word frequency and document frequency:

    / * *

    • Text list word frequency and word document frequency statistics
    • List of @param docs documents
    • @param useSmart Whether to use only participles
    • @return word frequency list word -[word frequency, document frequency]
    • @throws IOException

    */ private static Map<String, Integer[]> countTFDF(List docs, Boolean useSmart) throws IOException {// Output result Map Map<String, Integer[]> Comb = new HashMap<>(); for (String doc : docs) { if (StringUtils.isBlank(doc)) { continue; } DefaultConfig conf = new DefaultConfig(); conf.setUseSmart(useSmart); IKSegmenter IKSegmenter = new IKSegmenter(new StringReader(doc), conf); // Use IKSegmenter to initialize the text message and load the dictionary. Lexeme lexeme; // Set Set terms = new HashSet<>(); while ((lexeme = ikSegmenter.next()) ! = null) { if (lexeme.getLexemeText().length() > 1) { final String text = lexeme.getLexemeText(); Pute (text, (k, v) -> {if (v == null) {v = new Integer[]{1, 0}; pute(text, (k, v) -> {if (v == null) {v = new Integer[]{1, 0}; } else { v[0] += 1; } return v; }); terms.add(text); }} for (String term: terms) {comb. Get (term)[1] += 1; } } return frequencies; }

2. Obtain TopN words in the word cloud

There are various sorting methods for obtaining TopN words for word cloud display, which can be sorted directly according to word frequency, document frequency or TF-IDF and other algorithms. In this paper, TopN is obtained only according to word frequency. There are the following algorithms for obtaining TopN of M numbers:

  • M small N small: fast selection algorithm
  • M big N small: small top heap
  • M big N big: merge sort

This paper adopts the small top heap method to achieve, corresponding to the PriorityQueue data structure PriorityQueue in JAVA:

/** ** Private static */ private static */ private static */ private static */ private static */ private static */ private static List<Map.Entry<String, Integer>> order(Map<String, Integer> data, int topN) { PriorityQueue<Map.Entry<String, Integer>> priorityQueue = new PriorityQueue<>(data.size(), new Comparator<Map.Entry<String, Integer>>() { @Override public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) { return o2.getValue().compareTo(o1.getValue()); }}); for (Map.Entry<String, Integer> entry : data.entrySet()) { priorityQueue.add(entry); } //TODO (); if( list(0).value == list(99).value ){xxx} List<Map.Entry<String, Integer>> list = new ArrayList<>(); Int size = priorityqueue.size () <= topN? priorityQueue.size() : topN; for (int i = 0; i < size; i++) { list.add(priorityQueue.remove()); } return list; }Copy the code

3. Analysis of IK code

The core main class is IKSegmenter, and there are dic packages (dictionary related content) and CharacterUtil identifyCharType() method. The directory structure is as follows:

The IKSegmenter class structure is shown below, where init() is a private method and the initialization load dictionary adopts non-lazy loading mode. The dictionary is called and loaded when the IKSegmenter instance is initialized for the first time. The code is located below the structure diagram.

IKSegmenter(Reader Input, Configuration CFG) {this.input = input; // IKSegmenter class constructor public IKSegmenter(Reader Input, Configuration CFG) {this.input = input; this.cfg = cfg; this.init(); } // IKSegmenter class initializes private void init() {// Initializes dictionary.initial (this.cfg); This.context = new AnalyzeContext(this.cfg); // Initialize the context. this.context = new AnalyzeContext(this.cfg); // loadSegmenters this.segmenters = this.loadSegmenters(); // This. Arbitrator = new IKArbitrator(); Public static Dictionary initial(Configuration CFG) {if (singleton == null) {synchronized (Dictionary.class) { if (singleton == null) { singleton = new Dictionary(cfg); return singleton; } } } return singleton; }Copy the code

The Dictionary private constructor Dictionary() loads both the IK Dictionary and the extended Dictionary. We can also put our own online Dictionary here so that ikAnalyter.cfg.xml can only be configured to change the Dictionary frequently.

private Dictionary(Configuration cfg) { this.cfg = cfg; this.loadMainDict(); // Main dictionary and extension dictionary this.loadmiaozhendict (); This.loadstopworddict (); This.loadquantifierdict (); // quantifier dictionary}Copy the code

When IKSegmenter calls next() to get the next character, the identifyCharType() method in the CharacterUtil class is called to identify the character class. Here we can also customize some character classes for handling the new network language, such as @, ##, etc. :

static int identifyCharType(char input) { if (input >= '0' && input <= '9') { return CHAR_ARABIC; } else if ((input >= 'a' && input <= 'z') || (input >= 'A' && input <= 'Z')) { return CHAR_ENGLISH; } else { Character.UnicodeBlock ub = Character.UnicodeBlock.of(input); / / # caster increase for Chinese characters if (ub = = Character. UnicodeBlock. CJK_UNIFIED_IDEOGRAPHS | | ub = = Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS || ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A | | input = = '#') {/ / known Chinese character set utf-8 return CHAR_CHINESE; } else if (ub = = Character. UnicodeBlock. HALFWIDTH_AND_FULLWIDTH_FORMS / / the whole Angle digital Character and particular Character / / Korean, Japan and South Korea | | ub = = Character.UnicodeBlock.HANGUL_SYLLABLES || ub == Character.UnicodeBlock.HANGUL_JAMO || ub == Japanese Character set Character. UnicodeBlock. HANGUL_COMPATIBILITY_JAMO / / | | ub = = Character. UnicodeBlock. HIRAGANA / / HIRAGANA | | ub = = Character. UnicodeBlock. KATAKANA / / KATAKANA | | ub = = Character. UnicodeBlock. KATAKANA_PHONETIC_EXTENSIONS) {return CHAR_OTHER_CJK; }} // Other unprocessed characters return CHAR_USELESS; }Copy the code

Because IK content is not much, it is recommended that we can start from scratch, including the implementation of ISegmenter interface of the various autoparticipators and other content.

4. Show word clouds

WordCloud presentations can be made using Kibana’s own WordCloud Dashboard, or the popular WordCloud. For your own test, you can use the online microword cloud to quickly and easily view the effect of the word cloud: import a two-column XLS file, and the left control bar can also beautify the configuration of the shape font.

The display effect is as follows:

5. To summarize

This paper mainly realizes the word frequency statistics function through IK word segmentation, which is used for word cloud display. It is not only applicable to ES, but also can be used for word frequency statistics in any data source document. However, the functions are relatively basic. Interested students can implement the functions of word ordering change (TF/IDF), part-of-speech tagging, entity recognition and emotion analysis, etc. IK word segmentation is limited and needs to be assisted by more advanced word segmentation such as HanLP(with part of speech tagger) and NLP related knowledge. You can also refer to the lexical analysis module of Baidu AI.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.