Complete source code can be obtained in the public number: “01 binary” background reply: “public number data analysis”

1. Introduction

Under the influence of my classmates, I registered a public account “01 binary” in September, 2018. Due to various reasons (actually laziness), I didn’t publish my first article on this public account until November 11th. As of this writing, I have posted 21 articles and only 86 users, which got me thinking. Why does my public account have no users?

For this I also specially consulted my good friend 🐔 elder brother, he told me, writing is on the one hand, on the other hand, but also to catch the hot spot, say that finish to send me a well-known public number related data, let me arrange arrangements.

This is not, an analysis only found that the original want to let the public number have reading capacity is also to pay attention to the routine, next let us to a coder perspective to analyze what is the routine.

2. Analyze the purpose

The author has three main purposes of analysis in this project:

(1) Several analyses of the content operation of a well-known public account, mainly descriptive analysis of the amount of posts, likes, time of posts and other aspects;

(2) Through the analysis of the title, to explain what kind of title is more popular;

(3) Visualization of complex and disorderly structured data and unstructured data to show the beauty of data.

3. Experimental environment

Before starting the analysis, LET me explain the experimental environment in which the analysis was conducted in order to avoid any anomalies:

  • MacOS 10.14.3
  • Python 3.6.8 (Anaconda)
  • Visual Studio Code (Development)
  • Jupiter Notebook (Debug Environment)
  • The packages used are:
    • Pkuseg (participle)
    • Pyecharts (drawing)
    • Numpy (Mathematical calculation)
    • Pandas (Data computation)

4. Data acquisition and preview

4.1 Data Acquisition

This data set is through the web crawler to climb a public number so the article organized, this part is not elaborated, the need for data set can be directly downloaded source view or in the public number: “01 binary” background reply: “public number dataset” access.

4.2 Previewing Data

In this project, I used Pandas to read and preview data. Pandas is widely used in data science.

The data preview code is as follows:

import pandas as pd
Read the data set
df=pd.read_excel('data/data.xlsx')
Display the first five rows of data
df.head()
Copy the code

The data set length of this time is 👇 :

I calculated that as of January 22, 2019, the official account had published a total of 1,111 articles (a really friendly number for singles 🐶).

5. Descriptive analysis

In this part, the author will conduct a descriptive statistical analysis of the main numerical data in this data set, which is a relatively routine data analysis and can reveal some problems. The visualization tool is pyecharts. The official website is pyecharts.org/#/.

5.1 Article Sources

It is difficult to keep the daily activity of the public account by one person alone. Therefore, many public accounts will timely reprint some of the better articles written by others. But if too much reprint articles, originality is greatly reduced, so how much proportion is better? Let’s take a look at the number of reprinted articles and original articles on the well-known public account:

# Get reprinted articles
df_copy = df[df['Original link'].str.len() > 5]
print('The number of reprinted articles is:'+str(len(df_copy)))
print('Number of original articles:'+str(len(df)-len(df_copy)))
Copy the code

In this data set, as long as there is data in the column of original link, it is reproduced by default, and if there is no data, it is original by default, and the output result is as follows:

Number of reprinted articles: 249 number of original articles: 862Copy the code

A rough calculation of the ratio is about 3.5:1

5.2 How hot is he

From the beginning of the article I said that this data set is the data set of a well-known public account, so how famous is it? Let’s take a look at the number of posts, likes, and likes per article.

Create a matrix with 5 rows and 4 columns (representing 5 years and 4 quarters), obtain the year and quarter of the time, and add 1 to the corresponding position:

Get a quarterly change matrix for the number of articles
def getPostJiDu(df):
    Generate a matrix with 5 rows and 4 columns all 0
    list_jidu = [[0] *4 for i in range(5)]
    for articleTime in df['Post Time'] :# fetch date
        date = str(articleTime).split(' ') [0]
        Get the year
        year = getYear(date)
        Get the quarter
        jidu = getJiDu(date)
        list_jidu[year- 2015.][jidu- 1] + =1
    return list_jidu
Copy the code

Then use Pyecharts to draw the icon:

# Draw a quarterly chart of the number of articles
def drawPostJiDu(df):
    Get the quarterly matrix of articles
    res_list = getPostJiDu(df)
    # create a list of titles
    attr = []
    for year in range(len(res_list)):
        for month in range(len(res_list[year])):
            attr.append("Quarter {} of {} year".format(str(year+2015),str(month+1)))

    Construct a list of values
    v1 = reduce(operator.add, res_list)

    # remove useless values
    attr = attr[2:- 3]
    v1 = v1[2:- 3]

    line = Line("Trend chart of Number of posts", width=1500, height=500)
    line.add("A well-known public Account", attr, v1, is_stack=True,
             is_label_show=True, is_smooth=True, is_fill=True,  xaxis_name='quarter', yaxis_name='Number of posts',xaxis_rotate=30)
    return line
Copy the code

It is mentioned here that because the public account has no data in the first and second quarters of 2015 and the second, third and fourth quarters of 2019, it needs to perform a step to remove useless values.

By analogy, we can also get the trend chart of the number of likes over time:

Here’s how the average number of likes per post has changed.

Have to say, the flow of this public number is really not to be underestimated, the average article can have 1W + points of praise (not reading), no wonder advertisers are queuing to buy her advertising space, it is said that some advertisers even opened 30W advertising fees! (When can my article also have so much traffic 😢)

5.3 Will it be clickbait

To tell you the truth, when I was doing analysis, I was shocked to see the above data. What kind of articles did he send every day to get such a huge flow of traffic?

High school Chinese teachers often mention to us that a good article title can add a lot of points to your article. It makes me wonder if this guy is a clickbait? Or is the title really that catchy? Practice is the only criterion to test finishing. Come on, wipe that smile off your face and let’s analyze it.

5.3.1 What kind of article title is more pleasing

This part of our idea is as follows: get the title of the article, and then carry out word segmentation, stop word filtering, word frequency statistics, word cloud generation. These are introduced in the previous article, if not clear, move to “Li GUI see Li Kui — I use Zhai Tianlin’s paper to do the analysis”, here directly on the code:

Select the title of the article and analyze it
def chooseMostPop50Titles(df):
    texts=[]
    for title in list(df['title') :# Prevent articles from appearing without titles
        if len(str(title))>3:
            if str(title) not in ['Share pictures'] :# participle, remove stop words
                text=cleanWord(str(title))
                texts.append(text)
    Turn a two-dimensional array into a one-dimensional array
    title_cuts=reduce(operator.add, texts)
    
    # Count the frequency of each word
    counter = Counter(title_cuts)
    # Select the 50 words with the highest frequency
    counter=counter.most_common(50)
    Output the 50 words with the highest frequency
    pprint.pprint(counter)
    name = []
    value = []
    for count in counter:
        name.append(count[0])
        value.append(count[1])
    return drawWordCloud(name,value)

# Data visualization (generating word clouds)
def drawWordCloud(name, value):
    wordcloud = WordCloud(width=800, height=400)
    wordcloud.add("Inscription cloud Picture", name, value, word_size_range=[
                  20.100], rotate_step=20)
    return wordcloud
Copy the code

Considering the length of the problem, I will not show the results of the word frequency statistics, here directly released the word cloud:

Seeing the word cloud map, you can roughly guess what kind of people the official account is aimed at. If you’re interested, let us know in the comments what you think of this word cloud.

5.3.2 How long the title should be

It is a good thing that the title can grasp the pain point, but is it related to length? (This part is very simple, just categorize the length, so directly put the code)

# Plot the relationship between the length of the title and the number of likes
def drawTitleLenAndFavourite(df):
    v1 = [0] *6
    for i in range(len(df['title'])):
        title_len = len(str(df['title'][i]))
        if title_len >= 5 and title_len <= 8:
            v1[0] += df['thumb up'][i]
        elif title_len >= 9 and title_len <= 12:
            v1[1] += df['thumb up'][i]
        elif title_len >= 13 and title_len <= 16:
            v1[2] += df['thumb up'][i]
        elif title_len >= 17 and title_len <= 20:
            v1[3] += df['thumb up'][i]
        elif title_len >= 20 and title_len <= 24:
            v1[4] += df['thumb up'][i]
        elif title_len >= 25:
            v1[5] += df['thumb up'][i]
    attr = ['5-8'.'9-12'.'13-16'.'17-20'.'21-24'.24 + ' ']
    bar = Bar("The relationship between title length and likes.", title_pos='center')
    bar.add("", attr, v1, is_label_show=True)
    return bar
Copy the code

Visualization results:

Now think, high school language teacher said the title should not be too long, should not be too short, 13-16 words between the most suitable for the original is not fooling me ah.

5.4 When is the best time to post the article

Everyone’s reading time is not certain, but most people’s free time is about the same, so some people will ask when is the best time to push the article? Let’s take a look at how this public number is arranged.

5.4.1 Relationship between the Number of Posts and the time period

A quick look at the data shows that most of the time it happens at night, so I thought it would be a good idea to use a pie chart to see the relationship between the two.

# Draw a pie chart between the number of texts and the hour
def drawPostHour(df):
    # get the post time
    list_hour = getPostHour(df)
    array = np.array(list_hour)
    # get the 5 time periods with the most post time
    attr = list(sorted(np.argsort(array)[- 5:]))
    attr = [When the "{}".format(i) for i in attr]
    # get the number of the 5 most posted times
    v1 = list(array[sorted(np.argsort(array)[- 5:])])
    pie = Pie("Post Time Distribution chart")
    pie.add("", attr, v1, is_label_show=True)
    return pie
Copy the code

In the above code, we use getPostHour() to generate a set of zeros, adding one to the list in time. Here is the resulting pie chart:

You can see that most of the articles on the public account are pushed at about 10 o ‘clock at night, 11 o ‘clock, think about it is indeed. Most office workers lie in bed and brush their mobile phones before going to bed at this time. As time goes by, users get into the habit of reading an article before going to bed every day. It has to be said that this kind of user stickiness is really strong!

5.4.2 How many likes do users get

The public account tweets to the user, then the user reaction to the article how? Let’s take a look at the bar chart of user likes with post time:

The idea of the code and before the same, here will not be put out, in general, the public number of likes or very considerable.

6. Conclusion

Through the above analysis, I think if you want to let the public number have more readers and reading volume on a step can be considered from the following two perspectives:

  • Come up with a good headline, one that fits your audience preference, and one that’s not too long
  • Think about your free time for the community, and push articles in your free time to gradually make users stick to the public account

In addition, this paper is a basic exploratory data analysis paper, not a data analysis report, which focuses on enlightening ideas and teaching people how to fish. Limited by the size of the data, it is not the purpose of this paper to draw conclusions. The analysis of the results is scattered in each part, so don’t spray if you don’t like the conclusion control at the end of the paper.

Complete source code can be obtained in the public number: “01 binary” background reply: “public number data analysis”


“Long march is love, to a concern line 👇”