Introduction to the

Word clouds are a way of presenting data

  • When it doesn’t, it feels awesome and superior
  • After using it, I feel like I see others using it everywhere

Learn how to implement word clouds in Python

To prepare

The installation package

pip install wordcloud matplotlib jieba PIL
Copy the code

Prepare some text, English or Chinese

A simple example

WordCloud() Optional argument

  • font_path: can be used to specify font paths, includingotfandttf
  • width: Width of the word cloud. Default is 400
  • height: Height of the word cloud, default is 200
  • mask: mask that can be used to customize the shape of the word cloud
  • min_font_size: Minimum font size. The default value is 4
  • max_font_size: Maximum size. Default is the height of the word cloud
  • max_words: Maximum number of words. Default is 200
  • stopwords: Stop word to be ignored, if not specified the default stop word thesaurus is used
  • background_color: Background color, defaultblack
  • modeDefault for:RGBMode, ifRGBAPatterns andbackground_colorSet toNone, the background will be transparent
# -*- coding: utf-8 -*-

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Open text
text = open('constitution.txt').read()
# generate object
wc = WordCloud().generate(text)

# Display word cloud
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

Save to file
wc.to_file('wordcloud.png')
Copy the code

Because English words are separated by Spaces, most do not require additional processing

Chinese word cloud

Chinese generally needs to go through word segmentation, first look at the effect of dividing words

Taking Journey to the West as an example, it can be seen that there are various double characters, three characters and four characters in the results, but many of them are not reasonable words

# -*- coding: utf-8 -*-

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Open text
text = open('xyj.txt').read()
# generate object
wc = WordCloud(font_path='Hiragino.ttf', width=800, height=600, mode='RGBA', background_color=None).generate(text)

# Display word cloud
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

Save to file
wc.to_file('wordcloud.png')
Copy the code

This time we first use jieba for Chinese word segmentation, and we can see that basically all the generated word clouds are reasonable words

# -*- coding: utf-8 -*-

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import jieba

# Open text
text = open('xyj.txt').read()

# Chinese participle
text = ' '.join(jieba.cut(text))
print(text[:100])

# generate object
wc = WordCloud(font_path='Hiragino.ttf', width=800, height=600, mode='RGBA', background_color=None).generate(text)

# Display word cloud
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

Save to file
wc.to_file('wordcloud.png')
Copy the code

Using a mask

Mask is translated as mask here because it feels similar to the use of masks in Photoshop

After using a mask, you can generate a word cloud of the specified shape based on the provided mask image

# -*- coding: utf-8 -*-

from wordcloud import WordCloud
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import jieba

# Open text
text = open('xyj.txt').read()

# Chinese participle
text = ' '.join(jieba.cut(text))
print(text[:100])

# generate object
mask = np.array(Image.open("black_mask.png"))
wc = WordCloud(mask=mask, font_path='Hiragino.ttf', mode='RGBA', background_color=None).generate(text)

# Display word cloud
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Save to file
wc.to_file('wordcloud.png')
Copy the code

color

The color of the word cloud can be extracted from the mask using ImageColorGenerator()

# -*- coding: utf-8 -*-

from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import jieba

# Open text
text = open('xyj.txt').read()

# Chinese participle
text = ' '.join(jieba.cut(text))
print(text[:100])

# generate object
mask = np.array(Image.open("color_mask.png"))
wc = WordCloud(mask=mask, font_path='Hiragino.ttf', mode='RGBA', background_color=None).generate(text)

# Generate colors from images
image_colors = ImageColorGenerator(mask)
wc.recolor(color_func=image_colors)

# Display word cloud
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Save to file
wc.to_file('wordcloud.png')
Copy the code

Of course, you can also set it to a solid color by adding a color matching function

# -*- coding: utf-8 -*-

from wordcloud import WordCloud
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import random
import jieba

# Open text
text = open('xyj.txt').read()

# Chinese participle
text = ' '.join(jieba.cut(text))
print(text[:100])

# color function
def random_color(word, font_size, position, orientation, font_path, random_state):
	s = 'hsl(0, %d%%, %d%%)' % (random.randint(60, 80), random.randint(60, 80))
	print(s)
	return s

# generate object
mask = np.array(Image.open("color_mask.png"))
wc = WordCloud(color_func=random_color, mask=mask, font_path='Hiragino.ttf', mode='RGBA', background_color=None).generate(text)

# Display word cloud
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Save to file
wc.to_file('wordcloud.png')
Copy the code

Please refer to HSL color scheme

www.w3.org/wiki/CSS3/C…

Fine control

If you want to finely control the words that appear in the word cloud and the size of each word, try generate_from_frequencies(), including two parameters

  • frequencies: a dictionary that specifies words and their corresponding sizes
  • max_font_size: Maximum font size. The default value isNone

generate() = process_text() + generate_from_frequencies()

Jieba is used to extract keywords and weights, and then draw word clouds

# -*- coding: utf-8 -*-

from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import jieba.analyse

# Open text
text = open('xyj.txt').read()

Extract keywords and weights
freq = jieba.analyse.extract_tags(text, topK=200, withWeight=True)
print(freq[:20])
freq = {i[0]: i[1] for i in freq}

# generate object
mask = np.array(Image.open("color_mask.png"))
wc = WordCloud(mask=mask, font_path='Hiragino.ttf', mode='RGBA', background_color=None).generate_from_frequencies(freq)

# Generate colors from images
image_colors = ImageColorGenerator(mask)
wc.recolor(color_func=image_colors)

# Display word cloud
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Save to file
wc.to_file('wordcloud.png')
Copy the code

reference

  • IO /word_cloud/…
  • WordCloud official example: amuel.github. IO /word_cloud/…
  • Jieba Chinese segmentation: github.com/fxsjy/jieba

Video lecture course

Deep and interesting (1)