Get Trump’s most recent Twitter feed

In addition to using crawlers to get Twitter, there is a simple and stable solution is through the official API, but the official Twitter is relatively strict recently, it is difficult to apply for open platform is difficult to pass. Fortunately, I had a Twitter developer account before.

  • Install python- Twitter library for third party access to Twitter
pip install python-twitter
Copy the code
  • Dump Trump’s most recent Tweets
import twitter

proxies = {
    'http': ' '.'https': ' '
}

api = twitter.Api(consumer_key=' ',
                  consumer_secret=' ',
                  access_token_key=' ',
                  access_token_secret=' ', proxies=proxies)
                  
count = 200
max_id = None
fp = open('Trump-twitter.txt', encoding='utf-8', mode='w')
while True:
    statuses = api.GetUserTimeline(screen_name="realDonaldTrump", count=count, max_id=max_id)
    if len(statuses) < 1:
        break
    for s in statuses:
        print(s)
        fp.write(str(s) + "\n")
        max_id = s.id

    max_id = max_id - 1

fp.close()
Copy the code

Note that this API needs to be used in a network environment with access to Twitter, so it is configuredproxy

Currently, Twitter’s API limits access to only the most recent 3,200 tweets, and there is some variation, though not much.

The derived data looks something like this:

Data preprocessing

  • usepandasIt is very convenient to process tables. Here, important information is mainly processed into table fields, which is convenient for statistical analysis
  • Focus on the following fields
The field name Field to explain
id Twitter ID Twitter.com/i/web/statu…
created_at Release time
favorite_count Thumb up for
hashtags Topic TAG
retweet_count Forwarding number
source Release source, such as iPhone
text Content of tweets
media_type Content media types, such as video
ret_user_name The publisher of the original article
ret_user_verified Whether the publisher of the original text is authenticated
quoted_user_name The author of the quoted tweet
quoted_user_verified Whether the author of the quoted tweet is authenticated
created_day Release date, convenient by day statistics
created_day_hour Release date, accurate to the hour, for example: 2020-06-1402
created_hour Release hour, convenient statistics of the time rule
  • Code snippet
Convert the release time to a specific format
def get_created_info(row):
    created_at = row.created_at
    day,time_str = created_at.split(' ')
    hour = time_str.split(':')[0]
    day_hour = day+""+hour
    return day,day_hour,hour

# Apply to multiple fields
DF[['created_day'.'created_day_hour'.'created_hour']] = DF.apply(get_created_info,axis=1,result_type='expand')    
Copy the code

Analysis of tweets

The data preprocessed in the previous step is converted to the PANDAS DataFrame for subsequent processing

df = pd.read_excel('./Trump-twitter.xlsx')
Copy the code

Then we enter into the analysis of super multidimensional degree

Daily post statistics

 df[['id'.'created_day']].groupby(by=['created_day']).count().sort_values(by=['id'], ascending = False). The head (20). The plot. The bar (figsize = (12, 4))Copy the code

Make drawings according to the top 20 total posts per day

It can be found that the most posts were made on 2020-06-05, with 153 posts. Let’s take a closer look at the number of posts per hour for that day

df1 = df[df.created_day=='2020-06-05']
df1[['id'.'created_hour']].groupby(by=['created_hour'). The count (). The plot. The bar (figsize = (12, 4))Copy the code

Post the most at 11 or 12 o ‘clock in a day. Analyze all the tweets of the day and draw WordCloud. WordCloud is a relatively useful library for drawing WordCloud

Emotional analysis of tweets

def get_sentiment(row):
    polarity = tb(row.text).polarity
    if polarity < 0:
        tag = 'negative'
    elifPolarity < 0.3: Tag ='neutral'
    else:
        tag = 'positive'
    return tag,polarity
df[['text_polarity'.'polarity_prob']] = df.apply(get_sentiment,axis=1,result_type='expand')    
Copy the code

Textblob library, a simple text processing library, is used here, which can be used for sentiment analysis, partof speech tagging, spelling correction and so on

It can be seen that Trump’s tweets are mostly neutral, of course, there are also a lot of negative tweets, which will not be expanded here…

Time Period Statistics

df[['id'.'created_hour']].groupby(by=['created_hour'). The count (). The plot. The bar (figsize = (12, 4))Copy the code

It can be seen from the figure that Trump posts posts in 24 hours, with the largest number of tweets around 12 o ‘clock. It is true that he is oriented to governing the country through Twitter, and the amount of tweets is outrageous.

Analyze all the tweets during that time to create a word cloud

As you can see from the word cloud, “White House”, “Novel Coronavirus”, “Obama”, “Biden” are his frequent references, while keeping in mind the words “Great American”…

Conduct word cloud statistics for “FAKE NEWS” mentioned in the tweet

Don’t explain

Forwarding and original statistics

Explode = (0, 0.1) df[['id'.'tweet_status']].groupby(by=['tweet_status']).count()\
.plot.pie(y='id', figsize = (5, 5), explodes = explodes, autopct ='% 1.1 f % %',shadow=True, startangle=0,label=' ')
Copy the code

Forwarding with original accounted for a lot of

Statistics of authors transferred

rt_df = df[['id'.'ret_user_name']].groupby(by=['ret_user_name']).count().sort_values(by=['id'], ascending = False) rt_df. Head (20). The plot. The bar (figsize = (16, 4))Copy the code

  • Mostly retweets from the White House, followed by his own

Release source statistics

Explode = (0, 0.1) df[['id'.'source']].groupby(by=['source']).count()\
.plot.pie(y='id', figsize = (5, 5), explodes = explodes, autopct ='% 1.1 f % %',shadow=True, startangle=0,label=' ')
Copy the code

Mostly from the iPhone

Tweet topic TAG statistical analysis

  • It can be seen that Trump has been paying more attention to the COVID-19 epidemic and the “salary protection plan” related to the epidemic.
  • Next is MAGA (Make Ameraica Greate Again)

Content clustering of tweets

Trump’s tweets are grouped into 10 categories. As can be seen from the figure, the content types of Trump’s tweets are mostly concentrated in certain aspects

Do word2vec for tweets

model.similar_by_word('Trump')
# output

[('coronavirus', 0.6097418069839478),
 ('great', 0.5778061151504517),
 ('realDonaldTrump', 0.554646909236908),
 ('Great', 0.5381245613098145),
 ('National', 0.49641942977905273),
 ('America', 0.47522449493408203),
 ('today', 0.4736398458480835),
 ('people', 0.469297856092453),
 ('Democrats', 0.45948123931884766),
 ('time', 0.4551768898963928)]
Copy the code

Likes histogram

df[df.favorite_count > 0][['id'.'favorite_count']].plot.hist(y='favorite_count', bins = 50, figsize = (12, 4))Copy the code

The number of likes is mainly around 5 million

Histogram of forwarding number

df[df.retweet_count > 0][['id'.'retweet_count']].plot.hist(y='retweet_count', bins = 50, figsize = (12, 4))Copy the code

The forwarding number is mainly about 1W

conclusion

This paper mainly from multidimensional objective statistics, do not do too much interpretation. Is to throw a brick to attract jade, readers can play imagination, do more dimensional analysis and profound interpretation.