I used crawler to crawl "Tencent cloud technology community" all the articles, look what I got

Author: Ying Zhaokang

I use crawler to take “Tencent cloud technology community” all articles, see what I get

preface

Free weekend practice crawler will take Tencent cloud technology community to open, ha, classic pikachu at the beginning

This time I use Python crawler plus an “imperfect” word segmentation system to build, Tencent cloud technology community all articles of the word cloud, to see the overall probably write what xi xi 🙂

The body of the

Programming ideas

Gets the addresses of all articles
Content extraction for single article pages
All articles are content extracted and the results are stored in the MongoDB database
Use word segmentation system and wordcloud to construct wordcloud

** Note :** before storing all the article addresses, I added a random number. Later, articles were randomly selected for extraction to prevent local results due to different dates

Get the article list page, all the article information

Save format:

Index Indicates the random number index
The title of article
Address
Article content

    def get_one_page_all(self, url):
        try:
            html = self.get_page_index(self.baseURL)
            # BeautifulSoup parse
            soup = BeautifulSoup(html, 'lxml')
            title = soup.select('.article-item > .title')
            address = soup.select('.article-item > .title > a[href]')
            for i in range(len(title)):
            # generate random index
                random_num = random.randrange(0, 6500)
                content = self.parse_content('https://www.qcloud.com' + address[i].get('href').strip())
                yield {
                    'index' : random_num,
                    'title':title[i].get_text().strip(),
                    'address' : 'https://www.qcloud.com' + address[i].get('href').strip(),
                    'content' : content
                }
        Skip if an index error is encountered
        except IndexError:
            pass

Copy the code

Parse the text

    def parse_content(self, url):
        html = self.get_page_index(url)
        soup = BeautifulSoup(html, 'lxml')
        }}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
        content = soup.select('.J-article-detail')
        return content[0].get_text()

Copy the code

The results of

In this case, I will directly display the final result, which is not very ideal because the word segmentation system is not very good. In this case, I use regular expression to remove all non-Chinese characters in the content

Since personal computers aren’t very well configured, I split the results into 20 pieces, each made up of 100 randomly selected articles