The term “word cloud” was coined by Rich Gordon, an associate professor of journalism at Northwestern University. He has always been interested in the latest forms of online content distribution — the ones that only the Internet can deliver and that newspapers, radio, television and other media can’t. Often, the newest, best-suited means of communication are also the best. Therefore, “word cloud” is to visually highlight the “keywords” with high frequency in network texts by forming “keyword cloud” or “keyword rendering”.

Word cloud image filters out a large amount of text information, so that web visitors can understand the main idea of the text just by glancing at it.

The original idea was to make a website by crawling down the clove Garden pneumonia epidemic situation in real time and saving it locally, and turn the detailed epidemic report into a word cloud, so that people could get keywords through the word cloud instead of reading long articles. The results showed that the network is a XXX pneumonia epidemic real-time dynamic.

Read the file

First of all, I copied the following text dynamically from the lilac Garden pneumonia epidemic in real time and drew it into a word cloud.

Source of infection: wild animals, possibly Chinese Chrysanthemum Bat Route of transmission: respiratory droplet transmission, contact transmission, fecal-oral transmission possible Susceptible population: general population susceptible. Old people and the person that have foundation disease is infected hind the illness is heavier, children and infant also have come on incubation period: it is 3 ~ 7 days commonly, the longest do not exceed 14 days, infectivity exists inside latent periodCopy the code

The effect is as follows:

The first step, of course, is to save the data locally and then read it. Because file objects consume operating system resources, files must be closed after being read. Close () is not called when an IOError occurs, so you should use a try… The finally. The code is as follows:

try:
    fp=open("D:\\githubMe\\flask-tutorial\\doc\\coronavirus_data.txt".'r', encoding='UTF-8')
    text=fp.read()
    print(text)
finally:
    if fp:
        fp.close()
Copy the code

Of course, Python’s with can automatically call close(). The optimized code is as follows:

with open("D:\\githubMe\\flask-tutorial\\doc\\coronavirus_data.txt".'r', encoding='UTF-8') as fp:
    text=fp.read()
Copy the code

Generate the word cloud

To generate a wordcloud, there are a number of ways to call the wordcloud package.

pip install WordCloud
Copy the code

The case on its website. Import PIL image processing library and save the picture.

pip install PIL
Copy the code

PIL is deprecated, so you can install pilfork version Pillow instead.

pip install Pillow
Copy the code
from wordcloud import WordCloud
import PIL .Image as image

with open("D:\\githubMe\\flask-tutorial\\doc\\coronavirus_data.txt".'r', encoding='UTF-8') as fp:
    text=fp.read()
    wordcloud=WordCloud().generate(text)
    word_image=wordcloud.to_image()
    word_image.save('coronavirus_test_1.png'.'png')
Copy the code

Generate word cloud, but Chinese garble appears in the figure, can normally display English characters.

All Chinese content into English, word cloud without garbled

Wordcloud defaults to DroidSansMono. This font is not available on Windows 10, so font_path will be modified to adjust this path. I imported the system’s own font Microsoft Yahei MSYh.ttC. Of course, you can also import third-party fonts on the Internet.

Point font_path to the address of the font as follows:

    wordcloud=WordCloud(font_path="D:\\githubMe\\flask-tutorial\\doc\\msyh.ttc").generate(text)
Copy the code

Word clouds generated after execution

At this point, the basic Chinese word cloud has emerged, but some of it is not a word, but a sentence, so how to divide words?

Chinese word segmentation

Here we need to import jieba participle

  1. Four word segmentation modes are supported: Precise mode, which tries to cut sentences most accurately and is suitable for text analysis; Full mode, the sentence can be all the words are scanned out, very fast, but can not solve ambiguity; Search engine model, on the basis of accurate model, the long word segmentation again, improve recall rate, suitable for search engine word segmentation. PaddlePaddle deep learning framework and training sequence labeling (bidirectional GRU) network model are used to realize word segmentation. Pos tagging is also supported.
  2. Support traditional participle
  3. Supports custom dictionaries
pip install jieba
Copy the code

The code is as follows:

from wordcloud import WordCloud
import PIL .Image as image
import jieba

def participle_word(text):
    text_list=jieba.cut(text)
    res=' '.join(text_list)
    return res

with open("D:\\githubMe\\flask-tutorial\\doc\\coronavirus_data.txt".'r', encoding='UTF-8') as fp:
    text=fp.read()
    text=participle_word(text)
    wordcloud=WordCloud(font_path="D:\\githubMe\\flask-tutorial\\doc\\msyh.ttc").generate(text)
    word_image=wordcloud.to_image()
    word_image.save('coronavirus_test_4.png'.'png')
Copy the code

The generated Chinese word segmentation cloud is as follows:

Change the width of high

Wordcloud not only has the font_path attribute, but also width and height, which default to 400px and 200px respectively. You can make the canvas larger to make it an 800 by 800 square as follows:

    wordcloud=WordCloud(font_path="D:\\githubMe\\flask-tutorial\\doc\\msyh.ttc",width=800,height=800).generate(text)
Copy the code

Change shape

In addition to the above three attributes, there is the common attribute mask. Gives a binary mask where the text is drawn. If mask has a value, the width and height are ignored and the shape of mask is used instead. All white areas will be considered “masks”, while other areas can be used at will.

Mask only accepts nd-array or None values, so numpy is required to convert the image.

Found a white background of the picture “demon”, very consistent with the outbreak of the initiator of the operation.

pip install numpy
Copy the code

Import NUMPY package, support a large number of dimensional array and matrix operations, in addition to array operations to provide a large number of mathematical functions. Think of the “demon” image as a large matrix, and WorlCloud fills the non-white areas with text based on the algorithm. One of the most important features of Numpy is its N-dimensional array object, NDARray, which is a collection of data of the same type, indexed with a zero subscript for the elements in the collection. An Ndarray object is a multidimensional array that holds elements of the same type. Each element in NDARray has an area of memory with the same storage size.

from wordcloud import WordCloud
import PIL .Image as image
import jieba
import numpy as np

def participle_word(text):
    text_list=jieba.cut(text)
    res=' '.join(text_list)
    return res

with open("D:\\githubMe\\flask-tutorial\\doc\\coronavirus_data.txt".'r', encoding='UTF-8') as fp:
    text=fp.read()
    text=participle_word(text)
    shade=np.array(image.open('D:\\githubMe\\flask-tutorial\\doc\\wordcloud2.png'))
    wordcloud=WordCloud(font_path="D:\\githubMe\\flask-tutorial\\doc\\msyh.ttc",width=800,height=800,mask=shade).generate(text)
    word_image=wordcloud.to_image()
    word_image.save('coronavirus_test_6.png'.'png')
Copy the code

Execute the script to generate the following word cloud, the outline is “demon”

The rest of the property

  1. prefer_horizontal

The ratio of times to try horizontal fitting as opposed to vertical. If prefer_horizontal < 1, (There is currently no built-in way to get only vertical. words.)

When this value is 0, all words in the word cloud are vertical; When the value is 1 or greater than 1, all words in the word cloud are horizontal. Words can only be properly placed if they are greater than 0 and less than 1.

    wordcloud=WordCloud(font_path="D:\\githubMe\\flask-tutorial\\doc\\msyh.ttc",width=800,height=800,mask=shade,prefer_horizontal=1).generate(text)
Copy the code

    wordcloud=WordCloud(font_path="D:\\githubMe\\flask-tutorial\\doc\\msyh.ttc",width=800,height=800,mask=shade,prefer_horizontal=0).generate(text)
Copy the code

  1. background_colorThe background color of the word cloud is black by default. So for example, let me make it white.
    wordcloud=WordCloud(font_path="D:\\githubMe\\flask-tutorial\\doc\\msyh.ttc",width=800,height=800,mask=shade,background_color='#fff').generate(text)
Copy the code

  1. relative_scalingThe importance of relative word frequency to font size. whenrelative_scaling = 0Only word rankings are considered. whenrelative_scaling = 1Frequently used words will be twice as big. If you want to think about word frequency instead of word frequency, then. 5Or sorelative_scalingIt usually looks good. If it isautoUnless repeated astrueOtherwise it will be set to0.5, in which case it will be set to0.

reference

Word_cloud website

A learning programming technology of the public number. Every day, I post high-quality posts, open source projects, utilities, interview tips, programming learning resources and more. The goal is to achieve personal technology and public growth together. Welcome to pay attention to, progress together, to the full stack of big man cultivation road