Look back at some of the people and events of 2017 with text mining

One, foreword

When I finally made this picture, I suddenly had a feeling that these were all the names mentioned or recorded in my diary in 2017. Of course, in order to avoid unnecessary trouble, I omitted the names of many relatives and friends. Think of a lifetime to say long also long, say short words, dozens of words cloud also summarized those people and things. Once get along day and night, know each other, perhaps already gradually go gradually far, two or three old friends talking about old people’s past just found dead or alive can not remember so-and-so classmate surname what name who, forget always happens imperceptibly, sometimes even oneself don’t know what to forget.

Childhood, youth, youth period of people and things, forget it is always inevitable, but now a load of spring and autumn is not “things like a dream without a trace” it? 2018 has passed a week, many people the summary, the review of 2017 want to also summarize the review, no summary habit will continue to lead their own New Year. Usual nor writing at the end of the summary of I, maybe listen to xu fei “father wrote prose” hear “tears”, “this is my father/diary writing/this is his life leave/stay prose”, this reminded me of a few words of lyrics after many years if my children to know my past, although no prose, But also hope to borrow review during can sum up and leave a point……

In fact, I plan to study the names in the diary because I want to learn and apply some methods of text analysis and mining to practice operation according to the articles I have read. The diary is a ready-made corpus and the most familiar text, so I have such an article.

Second, extract names

First of all, in order to obtain the names in the text, according to the idea provided in the article “Google semantic analysis and gephi social network derived from tianlong eight novels”, jieba Chinese word segmentation Python library is used to try to extract the TOP5000 nouns with the largest weight of TF/IDF from the diary text.

It can be seen from the output results that while the names of Zhuangzi, Jia Baoyu and Wang Xiaobo are accurately extracted, there are many nouns of other things that need to be removed. Since we do not know what convenient and efficient method can be used to extract people’s names, this time, according to the number of people’s names in the text, set a lower limit, and then manually screen out the names that meet the requirements, and then increase the weight of TF/IDF by 100 or 1000 times at the same time, so as to achieve beautiful Word Cloud with HTML5 Word Cloud.

The picture above is the most representative portrait in my diary. Have master everybody lu Xun, ye Jiaying and so on, know the user Zhang Jiawei, passer-by a, like the singer Yamaguchi Baihui, an Pu, have AI big guy Wu En da, Li Fei fei, also have some popular variety, hot events in the characters and so on. Each person was recorded for a different reason, and this annual stamp was pieced together.

Three, the extraction of character relations

In addition to extracting names from the text, the network relationships of characters in the diary are also extracted based on co-occurrence and visualized by GEPHI.

To quote an introduction to the basic principles of co-occurrence networks: “Co-occurrence between entities is a statistics-based extraction of information. Closely related characters often appear in multiple paragraphs of text at the same time. By identifying identified entities (names) in the text, you can calculate the frequency and rate of occurrence of different entities together. When the ratio is above a certain threshold, we think there is some kind of connection between the two entities.”

The realization of the code can refer to the extraction of “Busan Line” character relationship code, can be changed according to their own needs.

Apply it to its own text and generate subsequent “node” and “edge” files for Gephi visualization, which also need to remove non-named data. The node format is as follows:

The edge format is as follows:

Import data into gephi software:

Adjust the size and color of the nodes and run the layout algorithm:

Tagging allows you to see more people’s names than word clouds:

A blind operation, focusing on the close and frequent characters in the text, the main nodes are Lu Xun, Ye Jiaying, Daiyu and so on:

The most important vein in the whole network is shown in the figure below:

In much more regular contact, there is a “he” and “He Zhizhang” this pretty strange, thought for a few seconds later, remembered that once someone asked an old name, given name, also interesting, but then congratulations still don’t know what name people, see a friend can say, and try to He Xing boy take a nice name bai: what do you think?

Behind the network of characters are bits and pieces of memory from 2017. There are many places that are not external but they enjoy themselves, and many people and things that they can’t remember.

I am a man who have a very poor memory, but just can’t remember a few days ago, and many things yesterday, diary also wrote two or three years, every time I look back at the beginning of people and things, just feel glad once cautiously and diligence, and meets the blank period, or in a time, also will be very disconsolate, as if a part is out of the life, only the white one. To quote shen Fu at the beginning of Six Stories of A Floating Life: “Dongpo said, ‘Things are like a spring dream without traces.’ If I forget them, I feel as if I have been punished by god.” Although I do not have a few great events of unique advantage to be able to record, but the sort of “the thing was like spring dream without trace” regret and disconsolate feel sympathize however, this also is the reason that I ever benefit others write diary, of course vary from person to person, cannot force.

Although this paper is just a simple text mining, without in-depth research, it is still a novel exploration process for myself. I also take this as a superficial review of some of my own people and events in 2017. Finally, I leave a riddle, and make a variety show according to the picture below.

4. Related reading

1. Jieba Chinese lexicon -github

2. Google Semantic Analysis and Gephi Social Network Derived from Tian Long’s Eight Novels

3, Text Cooccurrence Example

4. Extracting Character Relationships from Train to Busan based on Co-occurrence in Python

Appreciate the author

* * * *

In this paper, the author

Desert X

Columnist for the Python Chinese Community. Python developer and new yamaguchi paraku fan.

Join the CodingGo programming community

**** Learn programming knowledge, harvest Internet insiders.

Look back at some of the people and events of 2017 with text mining

Related Posts

How to make Serverless Application a single/microservice Application

[Xu Xiaodi] EOS: An important way of IPFS landing

Redis: Programmer’s favorite database?