Complete source code can be obtained in the public number: “01 binary” background reply: “Zhai Tianlin”

Yesterday was the Lantern Festival. In Nanjing, the Lantern Festival means that the New Year is over and it’s time for us to go back to work. All say this year’s melon is especially many (gourd baby of that kind), but during the Chinese New Year the sweetest I think is not Zhai Tianlin’s “what is the net?” Nothing else.

Some time ago, zhai tianlin’s academic misconduct and plagiarism began to come out on Weibo, which involved zhai’s tutor, dean and even the entire Beijing Film Academy.

I usually do not pay much attention to the entertainment industry, so I did not take it to heart at first, until the news that Zhai plagiarized Chen Kun’s thesis for a large part in the online news, I became interested in the entertainment industry doctor’s article. Let’s take a hard core look at Zhai’s paper from a coder perspective.

Experimental environment

Before starting the analysis, LET me explain the experimental environment in which the analysis was conducted in order to avoid any anomalies:

  • MacOS 10.14.3
  • Python 3.6.8 (Anaconda)
  • Visual Studio Code
  • The packages used are:
    • Pkuseg (participle)
    • Matplotlib (drawing)
    • wordcloud
    • Numpy (Mathematical calculation)
    • Sklearn (Machine Learning)

Data acquisition

To be honest, at first I thought that even if Zhai didn’t know what “CNKI” was, CNKI should include Zhai’s articles. However, AFTER searching cnKI for a long time, I couldn’t find Zhai’s papers. Fortunately, I found his articles on Toutiao and saved them in data/zhai.txt. Speaking of which, I really want to thank Zhai Tianlin. Thanks to him, we have become so academic and started to study undergraduate master’s and doctoral theses.

Data cleaning

In the previous section we saved his paper to a TXT file, so we need to load the article into memory first:

# Data fetch (read from file)
def readFile(file_path):
    content = []
    with open(file_path, encoding="utf-8") as f:
        content = f.read()
    return content
Copy the code

I counted 2,5005 words, excluding the title at the beginning and the acknowledgments at the end.

Next, we will clean up the data. Here, I use Pkuseg for word segmentation of the content, and output the result of word segmentation after removing the stop words.

Stop words are words that have no specific meaning in context, such as this, that, you, me, him, land, punctuation, etc. Since no one uses these meaningless stop terms in their search, I’m going to filter them out to make the score better.

# Data cleanup (word segmentation and removal of stop words)
def cleanWord(content):
    # participle
    seg = pkuseg.pkuseg()
    text = seg.cut(content)

    # read stop words
    stopwords = []
    with open("Stopwords/Harbin Institute of Technology stopwords table.txt", encoding="utf-8") as f:
        stopwords = f.read()

    new_text = []
    # Drop the stop word
    for w in text:
        if w not in stopwords:
            new_text.append(w)

    return new_text
Copy the code

Execution Result:

Let me mention two points:

  1. Why is the word segmentation tool used pkuseg instead of jieba?

Pkuseg is a word segmentation tool launched by Peking University. The official address is github.com/lancopku/pk…

  1. Why use Harbin Institute of Technology stop word table?

The download address of the stop list is: github.com/YueYongDev/…

Stop list A good type of text
Harbin Institute of Technology stop glossary Literature and periodical texts
Baidu stop word table News reports
Sichuan University vocabulary table Email literature text

Guan Qin, Deng Sanhong, Wang Hao. A comparative study of Chinese text Clustering using Stop Word lists [J]. Journal of Data Analysis and Knowledge Discovery, 2006, 1(3).

Those who are interested in reading this paper can get it in the public account: “01 binary” background reply: “Stop word table comparative study”

Data statistics

Is the data statistics, in fact, there is no good statistics, here to simplify, is the statistics of the frequency of each word, and then output the highest frequency of 15 words

# Data collation (word frequency statistics)
def statisticalData(text):
    # Count the frequency of each word
    counter = Counter(text)
    Output the 15 words with the highest frequency
    pprint.pprint(counter.most_common(15))
Copy the code

Printed results:

Really a rare “good actor” ah, can bring the role into life, even if the belly has no goods but still use their own performance ability to set up a “outstanding student” for themselves, the character image is so full, maybe this is the art of creation!

Most of the words in the article are life, role, character and personality, which are exactly the spirit of a good actor. If we make these words into a word cloud, it may be better.

Generate the word cloud

Wordcloud generation into this part I use the wordcloud library, it is very simple to use, there are many online tutorials, here needs to mention is: in order to prevent Chinese garbled situation, need to configure the parameter font_path. Chinese fonts can choose system, also can find online, here I recommend a Chinese fonts for free download url: www.lvdoutang.com/zh/0/0/1/1….

Here is the code to generate the word cloud:

# Data visualization (generating word clouds)
def drawWordCloud(text, file_name):
    wl_space_split = "".join(text)

    # Set word cloud background image
    b_mask = plt.imread('assets/img/bg.jpg')
    # set the word cloud font (if you do not set the word cloud font)
    font_path = 'assets/font/FZZhuoYTJ.ttf'
    Basic Settings for the word cloud (background color, font path, background image, word spacing)
    wc = WordCloud(background_color="white",font_path=font_path, mask=b_mask, margin=5)
    # Generate word clouds
    wc.generate(wl_space_split)
    # Display word cloud
    plt.imshow(wc)
    plt.axis("off")
    plt.show()
    # Save the word cloud locally
    path = os.getcwd()+'/output/'
    wc.to_file(path+file_name)
Copy the code

True and False Li Kui (Comparison of articles)

After analyzing “Li Ghost”, it is necessary for us to ask his real body “Li Kui” brothers, the same as before the same routine, first find the data, and then word statistics word frequency, here will not repeat the operation, directly released word cloud map.

Do you think this picture is very similar to Zhai’s word cloud picture? Then, how similar is the “real and fake Li Kui”? Now let’s calculate the similarity between the two articles.

Comparison of similarity of articles

TF-IDF

There are many ways to compare the similarity of articles, and there are many categories of models used, including TF-IDF,LDA,LSI, etc. For convenience, tF-IDF is only used for comparison here.

TF – the IDF is actually based on word frequency of TF and add the IDF information, IDF, called inverse document frequency, do not know may have a look of the nguyen other teacher’s explanation: www.ruanyifeng.com/blog/2013/0… , the TFIDF explanation is also very thorough.

Sklearn

Scikit-learn, also known as sklearn, is one of the most popular Python modules in machine learning. Sklearn module TfidfVectorizer to calculate the similarity between the two articles, the code is as follows:

# Calculate text similarity
def calculateSimilarity(s1, s2):
    def add_space(s):
            return ' '.join(cleanWord(s))
    
    # Add Spaces between words
    s1, s2 = add_space(s1), add_space(s2)
    # Convert to TF matrix
    cv = TfidfVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # Calculate TF coefficients
    return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))
Copy the code

In addition to Sklearn, we can also use Gensim to call some models for calculation, considering the length of the article, it is up to readers to collect information to achieve.

After we input zhai’s paper and Chen’s paper into this function respectively, the output result is:

The similarity of the two articles is 0.7074857881770839Copy the code

In fact, THIS result I was quite surprised, only know that this “li Ghost” looks like, but did not expect the similarity should be as high as 70.7%. Of course, as a younger brother, zhai’s case is nothing compared to Wu’s. 🙈

Complete source code can be obtained in the public number: “01 binary” background reply: “Zhai Tianlin”