A micro-blog hot search triggered by the story

First, the story begins here
Step one: collect pictures
Step 2: Show pictures
Step 3: Promote links
Step 4: Statistical analysis
- 1. Data processing
- 2. Data filtering
- 3. Count the frequency of each day
- 4. Count the frequency of constellations
- 5. Count the frequency of the month
- 6. Data Visualization (3 bar charts)
Write in the last

First, the story begins here

On the night of March 29th, I was squatting in the toilet, probably crouching while swiping my phone… Suddenly I found a trending weibo hashtag # Universe the Day you were born

In the comments section, we all had the same problem: we couldn’t access the NASA website (probably due to high traffic, resulting in extremely high latency). As a socialist upright youth, how can I let it go?

So, I decided to do something!!

Step one: collect pictures

A simple idea occurred to me: since you can’t download pictures from the official website, I’ll help you collect them and send them to you. (Collecting data, just write a crawler?)

So I went straight to the NASA website to analyze a wave of requests. The results… Well, I’m one of them, and I can’t load an image.

How could I shrink back from this difficulty? Then, I went to the comments section of Weibo to look for it. Sure enough, I found a big guy on Douban who had already collected all the pictures:

In the spirit of “fetch”, I decided that this would be my data source (Douban photo album)

Simple analysis, found that can flip through a m_start parameters, and each page 20 images (such as m_start = 0 for the first page, m_start = 20 for the second page), then write a loop can be:

import re
import queue
import requests
import threading
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
headers = {
'Host': 'www.douban.com'.'Connection': 'keep-alive'.'Cache-Control': 'max-age=0'.'Upgrade-Insecure-Requests': '1'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.3; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'.'Sec-Fetch-Dest': 'document'.'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q = 0.9 '.'Sec-Fetch-Site': 'none'.'Sec-Fetch-Mode': 'navigate'.'Sec-Fetch-User': '? 1 '.'Accept-Encoding': 'gzip, deflate, br'.'Accept-Language': 'zh-CN,zh; Q = 0.9 '.'Cookie': 'bid=rb_kUqiDS6k; douban-fav-remind=1; _pk_ses. 100001.8 cb4 = *; Ap_v = 0,6.0; __utma = 30149280.1787149566.1585488263.1585488263.1585488263.1; __utmc=30149280; __utmz = 30149280.1585488263.1.1. Utmcsr = (direct) | utmccn = (direct) | utmcmd = (none); __yadk_uid=HNoH1YVIvD2c8HrQDWHRzyLciFJl1AVD; __gads=ID=a1f73d5d4aa31261:T=1585488663:S=ALNI_MafqKPZWHx0TGWTpKEm8TTvdC-eyQ; ct=y; _pk_id. 100001.8 cb4 = 722 e0554d0127ce7. 1585488261.1.1585488766.1585488261.; __utmb = 30149280.10.6.1585488263 '
}
 
# driver initialization
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
 
# Download image
def downimg() :
    while not img_queue.empty():
        img = img_queue.get()
        img_name = img[0]
        url = img[1]
        res = requests.get(url)
        data =res.content
        with open('./img/%s.webp'%img_name,'wb') as f:
            f.write(data)
        print(img_name)
         
# site parameters
url_o = 'https://www.douban.com/photos/album/1872547715/?m_start=%d'
 
# crawl the connection
img_queue = queue.Queue()
for i in range(0.21):
    url = url_o%(18*i)
    driver.get(url)
    es = driver.find_elements_by_class_name('photo_wrap')
    for e in es:
        img_e = e.find_element_by_tag_name('img')
        img_url = img_e.get_attribute('src')
        img_url = img_url.replace('photo/m/public'.'photo/l/public') # replace with big picture
        text_e = e.find_element_by_class_name('pl')
        img_date = text_e.text
        img_queue.put((img_date,img_url))
    print('%d page climb complete '%(i+1))
driver.close()
 
# Download image
thread_list = []
N_thread = 5
for i in range(N_thread):
    thread_list.append(threading.Thread(target=downimg))
for t in thread_list:
    t.start()
for t in thread_list:
    t.join()
Copy the code

The code is simple: WebDriver accesses the page, fetches the image address, and then uses Requests to download and save the image through multiple threads. At this point, the picture collection work is basically completed!

Step 2: Show pictures

Now that we have images, how do we get them? To send everyone private hair? Of course, I didn’t do that. I decided to write a small web page for everyone to visit. Being a very unprofessional person, I mixed it up and it went something like this (The universe on your birthday) :

Step 3: Promote links

About promotion, I also don’t understand, I dare not say. Foolishly, I decided to post a micro blog myself (probably thinking: such a convenient tool, must be welcomed by everyone, must be right, right…). :

Reality is cruel. The masses have guessed: no one cares, stone sink to the bottom of the sea ~

After a few twists and turns, finally, with the help of a popular blogger on the topic, we got some traffic:

Step 4: Statistical analysis

Although this traffic is far from what I expected, after all, this topic is also read by hundreds of millions of people, but I decided to do a simple statistics of yesterday’s visit:

1. Data processing

After getting the original CSV form of web page access data in a certain degree of statistics, simple data processing was carried out and the format was adjusted to be more convenient to read.

2. Data filtering

Since the table contains not only data from NASA pages, but also data from other pages, it is necessary to filter the data:

# fetch data
data = pd.read_csv('./analyze/20200330-20200330.csv',encoding='utf-8')
 
# Filter data (NASA-related data with expiration date)
data_NASA = []
for i in range(len(data)):
    url = urllib.parse.unquote(data['URL'][i])
    pv = data['PV'][i] # views
    uv = data['UV'][i] # visitors
    #if url[-1] == 'day' and 'NaN' not in url: # for NASA access page
    if 'date=' in url and 'NaN' not in url:
        try:
            data_NASA.append((re.findall('date=(\d*? Month \ d *? ) ',url)[0],pv,uv))
        except:
            pass
Copy the code

3. Count the frequency of each day

# Count the frequency of each day
PV_map= {}
UV_map = {}
PV_total = 0
UV_total = 0
for d in data_NASA:
    if d[0] not in PV_map.keys():
        PV_map[d[0]] = 0
        UV_map[d[0]] = 0
    PV_map[d[0]] +=  d[1] # PV
    UV_map[d[0]] +=  d[2] # UV
    PV_total += d[1]
    UV_total += d[2]
for k in PV_map.keys(): # Calculate frequency
    PV_map[k] = PV_map[k]/PV_total*100
    UV_map[k] = UV_map[k]/UV_total*100
PVs= sorted(PV_map.items(),key=lambda x:x[1],reverse=True) # sort
UVs= sorted(UV_map.items(),key=lambda x:x[1],reverse=True) # sort
Copy the code

4. Count the frequency of constellations

# Judge your horoscope
def get_xingzuo(month, date) :
    dates = (21.20.21.21.22.22.23.24.24.24.23.22)
    constellations = ("Capricorn"."Aquarius"."Pisces"."Aries"."Taurus"."Gemini"."Cancer"."狮子座"."Virgo"."Libra"."Scorpio"."Sagittarius"."Capricorn")
    if date < dates[month-1] :return constellations[month-1]
    else:
        return constellations[month]
 
# Count the frequency of each constellation
xingzuo = ("Capricorn"."Aquarius"."Pisces"."Aries"."Taurus"."Gemini"."Cancer"."狮子座"."Virgo"."Libra"."Scorpio"."Sagittarius"."Capricorn")
xingzuo_map = {}
for x in xingzuo:
    xingzuo_map[x] = 0
xingzuo_total = 0
for d in data_NASA:
    month = int(re.findall('(\d*?) Month (\ d *? Days',d[0[])0] [0])
    day = int(re.findall('(\d*?) Month (\ d *? Days',d[0[])0] [1])
    x = get_xingzuo(month,day)
    #xingzuo_map[x] += d[1] # PV
    xingzuo_map[x] += d[2] # UV
    xingzuo_total += d[2]
for k in xingzuo_map.keys():
    xingzuo_map[k] = xingzuo_map[k]/xingzuo_total*100
xingzuos= sorted(xingzuo_map.items(),key=lambda x:x[1],reverse=True) # sort
Copy the code

5. Count the frequency of the month

# Count the frequency of each month
month = [str(i)+'month' for i in range(1.13)]
month_map = {}
for m in month:
    month_map[m] = 0
month_total = 0
for d in data_NASA:
    m = d[0].split('month') [0] +'month'
    #month_map[m] += d[1] # PV
    month_map[m] += d[2] # UV
    month_total += d[2]
for k in month_map.keys():
    month_map[k] = month_map[k]/month_total*100
months= sorted(month_map.items(),key=lambda x:x[1],reverse=True) # sort
Copy the code

6. Data Visualization (3 bar charts)

## Birthday query TOP10- by visitor volume UV
date = []
uv = []
for i in UVs:
    date.append(i[0])
    uv.append(i[1])
top10_date = date[:10]
top10_date.reverse()
top10_uv = uv[:10]
top10_uv.reverse()
fig, ax = plt.subplots() # drawing
b = plt.barh(top10_date,top10_uv,color='#6699CC') # gold #FFFACD Silver #C0C0C0 Orange #FFA500 Blue #6699CC
i = len(b)
for rect in b: # draw numerical
    if i==3: # 3
        rect.set_facecolor('#FFA500') # orange
    if i==2: # 2
        rect.set_facecolor('#C0C0C0') # silver
    if i==1: # 1
        rect.set_facecolor('#FFFACD') # gold
    w = rect.get_width()
    ax.text(w, rect.get_y()+rect.get_height()/2.' %.2f%%'%w,ha='left', va='center')
    i -= 1
plt.xticks([]) # turn off the x-coordinate
 
 
 
## Constellation query ranking
name = []
v = []
for i in xingzuos:
    name.append(i[0])
    v.append(i[1])
name.reverse()
v.reverse()
fig, ax = plt.subplots() # drawing
b = plt.barh(name,v,color='#6699CC') # gold #FFFACD Silver #C0C0C0 Orange #FFA500 Blue #6699CC
i = len(b)
for rect in b: # draw numerical
    if i==3: # 3
        rect.set_facecolor('#FFA500') # orange
    if i==2: # 2
        rect.set_facecolor('#C0C0C0') # silver
    if i==1: # 1
        rect.set_facecolor('#FFFACD') # gold
    w = rect.get_width()
    ax.text(w, rect.get_y()+rect.get_height()/2.' %.2f%%'%w,ha='left', va='center')
    i -= 1
plt.xticks([]) # turn off the x-coordinate
 
## Month query ranking
name = []
v = []
for i in months:
    name.append(i[0])
    v.append(i[1])
name.reverse()
v.reverse()
fig, ax = plt.subplots() # drawing
b = plt.barh(name,v,color='#6699CC') # gold #FFFACD Silver #C0C0C0 Orange #FFA500 Blue #6699CC
i = len(b)
for rect in b: # draw numerical
    if i==3: # 3
        rect.set_facecolor('#FFA500') # orange
    if i==2: # 2
        rect.set_facecolor('#C0C0C0') # silver
    if i==1: # 1
        rect.set_facecolor('#FFFACD') # gold
    w = rect.get_width()
    ax.text(w, rect.get_y()+rect.get_height()/2.' %.2f%%'%w,ha='left', va='center')
    i -= 1
plt.xticks([]) # turn off the x-coordinate
Copy the code

The final result looks something like this:

Write in the last

If possible, I also hope to create the so-called “extreme romance” in the countless keyboard tapping sound ~

Finally, here are some pictures from the NASA event that I think are better:

A micro-blog hot search triggered by the story

A micro-blog hot search triggered by the story

First, the story begins here

Step 2: Show pictures

Step 3: Promote links

1. Data processing

3. Count the frequency of each day

5. Count the frequency of the month

Write in the last

Related Posts