I am participating in the Mid-Autumn Festival Creative Submission contest, please see: Mid-Autumn Festival Creative Submission Contest for details

The Mid-Autumn Festival is coming soon, I do not know if you have such trouble as I do, every festival, have to rack their brains to think for a long time, what kind of blessing to appear sincere and creative, what kind of circle of friends copywriting will have culture and force.

Go to the Internet search, search out the blessing, painting style is mostly like this

The Buddha said: I can let you make a wish

I said to Buddha: I wish XXX always healthy, young and happy

Buddha said: Only four days

I said ok, spring, summer, fall, winter.

The Buddha said, “No, only three days.

I said, ok, yesterday, today, tomorrow.

The Buddha said, “No, only two days

I said, well, dark and day

The Buddha said, no, only one day.

I said, ok.

The Buddha asked blankly, which day?

I said, every day.

Or maybe it’s something like this, more or less chronological.

Plain sailing for you, erhu does not bless you, three words or two words friendship, all directions qi wishes, colorful belongs to you handsome. The arrival of the Mid-Autumn Festival, published SMS wishes customers business progresses day by day, family hehemeimei, healthy body.

That have suitable for young people, have sincerity creative force case culture, elegant and fresh free from vulgarities of the Mid-Autumn Festival copywriting?

I decided to use crawler collection to see what other people were doing.


1. Determine the crawler target

I found in the major website for a long time, found very little of this Mid-Autumn festival blessing copywriting plate, and those blessing theme website, above the copywriting still stay in the age of SMS blessing, read goose bumps all over.

Looking for a circle, I still put my eyes back to Zhihu, in some of the Mid-Autumn Festival copywriting questions, or a lot of high-quality answers.

In addition, I also found that there are a lot of columns with great Mid-Autumn Festival wishes.

Therefore, the goal of our crawler is divided into two parts, one is to crawl answers to zhihu questions, the other is to crawl the contents of zhihu columns.


2. Crawler

The goal of our crawler is two parts, one is to crawl the answers to the zhihu questions, the other is to crawl the content of the zhihu column.

2.1 Answers to crawling questions

2.1.1 Web page analysis

Since we have explained how to crawl all the answer data under the Question of Zhihu in detail in the article “Python Web crawler: Crawl 18934 answer data under the topic of Zhihu”, if you want to know how to capture the packet to get the data interface, you can go here to check.

So here website interface analysis part skipped, directly on the code!

2.1.2 Crawler source code

import requests
from bs4 import BeautifulSoup
import json
 
def fetchUrl(url) :

    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8 '.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }

    r = requests.get(url, headers=headers)
    return r.json()
        
def parse_Json(jsonObj) :

    json_data = jsonObj['data']
    try:
        for item in json_data:
            qid = item['question'] ['id']
            content = item['content']
            bsObj = BeautifulSoup(content, "html.parser");
            lines = bsObj.find_all("p")
            for line in lines:
                sts = line.text.replace(""."");
                if len(sts) > 3:
                    save_data(qid, sts + "\n");
                    print(sts)

            print("-- -- -- -- -- -- -- -- -- --" * 20)
    except Exception as e:
        print(e)
        
def save_data(qid, data) :

    filename = f'{qid}.txt'
    with open(filename, "a") as f:
        f.write(data)
        
def getMaxPage(qid) :
    url = f"https://www.zhihu.com/api/v4/questions/{qid}/answers? include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_det ail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_conte nt%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2C relevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author %2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follo wer_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=0&platform =desktop&sort_by=default"
    jsObj = fetchUrl(url)
    totals = jsObj['paging'] ['totals']
    
    print(totals)
    print(The '-'*10)
    return totals


if __name__ == '__main__':
	Set the problem id here
    qid = 25252525;
    maxPage = getMaxPage(qid);
    page = 0
    while(page < maxPage):
        url = f"https://www.zhihu.com/api/v4/questions/{qid}/answers? include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_det ail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_conte nt%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2C relevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author %2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follo wer_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={page}&platform=desktop&sort_by=default"
        print(url)
        jsonObj = fetchUrl(url)
        parse_Json(jsonObj)
        page += 5
    print("Done!!")
Copy the code

Once you have set up the problem ID to crawl (that is, the QID in the code), you can run the code to crawl.

2.1.3 Operation effect

2.2 Crawl column

By contrast, column crawling is much easier.

2.2.1 Web page analysis

First, we open the F12 developer tool and refresh the web page. In Preview, we can see that the text data is directly in the source code of the web page, without the need for additional interface access.

That is to say, we do not need to capture the package, directly parsing the HTML text of the web page.

The developer tools switch to the Elements TAB and find the TAB where the body of the article is.

Click on the arrow in the upper left corner, and then click on the text of the page. The developer tool will automatically locate the source code of the page you clicked on.

So we know that the body content is under a DIV tag with a class of post-RichText, and each paragraph of text is inside a P tag.

The main body content can be extracted in the following way.

from bs4 import BeautifulSoup

bsObj = BeautifulSoup(html, "html.parser");
richText = bsObj.find("div", attrs={"class": "Post-RichText"});
lines = richText.find_all("p")
Copy the code

2.2.2 Crawler source

Code at last!

import requests
from bs4 import BeautifulSoup
import json
 
def fetchUrl(url) :

    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8 '.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }

    r = requests.get(url, headers=headers)
    return r.text
        
def parse_Html(html) :

    bsObj = BeautifulSoup(html, "html.parser");
    richText = bsObj.find("div", attrs={"class": "Post-RichText"});
    lines = richText.find_all("p")
    for line in lines:
        sts = line.text.replace(""."");
        if len(sts) > 3:
            save_data("zhuanlan", sts + "\n");
            print(sts)

# print(richText)

        
def save_data(qid, data) :

    filename = f'{qid}.txt'
    with open(filename, "a") as f:
        f.write(data)

if __name__ == '__main__':
    # column ID list
    pidList = [407748025.250353160.259765710.82196621.82400827];
    for pid in pidList:
        url = f'https://zhuanlan.zhihu.com/p/{pid}'
        print(url)
        html = fetchUrl(url)
        parse_Html(html)
        print("-- -- -- -- -- -" * 20)
    
    print("Done!!")
Copy the code

2.2.3 Operation effect


3. Data analysis

I selected three questions and five columns related to the Mid-Autumn Festival blessing copywriting, and the result was more than 1,300 lines.

3.1 Data preprocessing

As shown in the figure, there are a lot of sentences that are not related to the text of blessing speech, such as dividing line, opening line, story telling and so on, which need data cleaning.

In simple terms, the process is to manually delete irrelevant sentences, then process them into a unified format through the code, and finally remove the repetition.

Finally, after data cleaning, the number of blessing messages obtained is 873.

3.2 Webpage Display

With all this data, we’re not satisfied with that, so what do we do next? How do we present our data?

I suddenly remembered an open source project that had been popular on the Internet for a while called The Dog Lick Diary.

It seems that there is no copyright problem, so on this basis, I simply modified, got my “Mid-Autumn Festival diary”.

Demo website: api.SmartCrane. Tech/Autumn/

If you are interested in the project source code, or want to DIY friends on the basis of this, you can find me to source code.


Afterword.

This time, we collected 873 sincere and creative, cultural and coercive Mid-Autumn Festival wishes from Zhihu, and then we made a pretty exquisite webpage to display these data.

I believe friends and relatives will feel your sincerity when they receive such blessings.

But in other words, blessing language said again good, not as good as a red envelope to the truth.

I wish you all a happy Mid-Autumn Festival! Remember to eat the moon


If there is something in the article that is not clear, or the explanation is wrong, please criticize and correct it in the comment section, or scan the qr code below and add our wechat. We can learn and communicate together and make progress together.