Author: Ying Zhaokang

I use crawler to take “Tencent cloud technology community” all articles, see what I get

preface

Free weekend practice crawler will take Tencent cloud technology community to open, ha, classic pikachu at the beginning

This time I use Python crawler plus an “imperfect” word segmentation system to build, Tencent cloud technology community all articles of the word cloud, to see the overall probably write what xi xi 🙂

The body of the

Programming ideas

  1. Gets the addresses of all articles
  2. Content extraction for single article pages
  3. All articles are content extracted and the results are stored in the MongoDB database
  4. Use word segmentation system and wordcloud to construct wordcloud

** Note :** before storing all the article addresses, I added a random number. Later, articles were randomly selected for extraction to prevent local results due to different dates

Get the article list page, all the article information

Save format:

  • Index Indicates the random number index
  • The title of article
  • Address
  • Article content
    def get_one_page_all(self, url):
        try:
            html = self.get_page_index(self.baseURL)
            # BeautifulSoup parse
            soup = BeautifulSoup(html, 'lxml')
            title = soup.select('.article-item > .title')
            address = soup.select('.article-item > .title > a[href]')
            for i in range(len(title)):
            # generate random index
                random_num = random.randrange(0, 6500)
                content = self.parse_content('https://www.qcloud.com' + address[i].get('href').strip())
                yield {
                    'index' : random_num,
                    'title':title[i].get_text().strip(),
                    'address' : 'https://www.qcloud.com' + address[i].get('href').strip(),
                    'content' : content
                }
        Skip if an index error is encountered
        except IndexError:
            pass

Copy the code
Parse the text
    def parse_content(self, url):
        html = self.get_page_index(url)
        soup = BeautifulSoup(html, 'lxml')
        }}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
        content = soup.select('.J-article-detail')
        return content[0].get_text()

Copy the code

The results of

In this case, I will directly display the final result, which is not very ideal because the word segmentation system is not very good. In this case, I use regular expression to remove all non-Chinese characters in the content

Since personal computers aren’t very well configured, I split the results into 20 pieces, each made up of 100 randomly selected articles

This is the word cloud generated by all articles, and word segmentation and screening are not very good, resulting in numerals and personal nouns

conclusion

It can be seen that Tencent cloud technology community on the article, most are related to data

Ha ha, not very good, later to improve (word selection)

Finally hit a small advertisement, I hope you pay attention to my public number:

Ikang_kj hey hey:)

reading

In Python3, you can use proxy to automate the web of a crawler.

This article has been published by Tencent Cloud Technology community authorized by the author

The original link: cloud.tencent.com/community/a…

Massive technical practical experience, all in Tencent cloud community