Reprint please indicate the source: blog.csdn.net/forezp/arti… This article is from Fang Zhipeng’s blog

Today, ON a whim, I wanted to crawl my blog in Python, do data aggregation, create a high-resolution cloud map (a visual display of word frequency), and see what I’ve been writing lately.

First, directly on a few of my blog data cloud

1.1 Crawl the aggregation of article titles

1.2 Crawl the aggregation of abstracts of articles

1.3 Crawl the title + summary of the article

I recently wrote a series of SpringCloud tutorials, and a few moreMicro servicearchitectureFrom the cloud view, it’s a good match. If you don’t believe me, check out my blog. It’s pretty accurate

2. Technology stack

  • Development tool: PyCharm
  • Crawler technology: BS64, Requsts, Jieba
  • Analysis tool: wordArt

Three, crawler structure design

The entire crawler architecture is very simple:

  • Crawl my blog: blog.csdn.net/forezp
  • To get the data
  • Use “stutter” libraries of data, word segmentation.
  • The obtained data are used to make cloud images on Artword.
  • Show the cloud image to the user.

Fourth, concrete implementation

First, crawl the data according to the blog address:

url = 'http://blog.csdn.net/forezp'

titles=set()

def download(url):
    if url is None:
        return None
    try:
        response = requests.get(url, headers={
            'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',})if (response.status_code == 200) :return response.content
        return None
    except:
        return None
Copy the code

Parsing the title

def parse_title(html):
    if html is None:
        return None
    soup = BeautifulSoup(html, "html.parser")
    links = soup.find_all('a', href=re.compile(r'/forezp/article/details'))
    for link in links:

        titles.add(link.get_text())
Copy the code

Analysis summary:


def parse_descrtion(html):
    if html is None:
        return None
    soup=BeautifulSoup(html, "html.parser")
    disciptions=soup.find_all('div',attrs={'class': 'article_description'})
    for link in disciptions:

        titles.add(link.get_text())
Copy the code

Use “stutter” participle, “excited 8” participle how to use, see here: github.com/fxsjy/jieba… .

“‘ def jiebaSet(): STRS =” if titles. Len ()==0: return for item in titles: STRS = STRS +item;

tags = jieba.analyse.extract_tags(strs, topK=100, withWeight=True)
for item in tags:
    print(item[0] + '\t' + str(int(item[1] * 1000)))
Copy the code

“‘ Because there was less data, I printed it directly to the console and copied it, better to store it in MongoDB.

Make cloud images: Use artword online tool, wordart.com

First: import data copied from the console:

Embarrassingly, this site does not support Chinese when drawing diagrams. You need to select a font that supports Chinese from c:/ Windows/Fonts. MAC users can also copy folders from Windows or download them online.

Then click on Visulize to generate a high-resolution cloud image. Finished explanation, what needs to improve please leave a message.

Download the source code: github.com/forezp/Zhih…

5. Article reference

Super simple: quickly make a high force grid word cloud map

Excellent articles recommended: