Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Efficient universal news text extractor

Research in the NLP field more or less requires crawling a huge amount of news data, and it is certainly not appropriate to write a set of crawling rules for each news site. You need to pay for crawling with third-party software. There is always a tool kit for efficient single news extraction. The tool kit can complete the job of extracting news information from different news portals.

The GeneralNewsExtractor (GNE) is a general news site body extraction module. It enters the HTML of a news page and outputs the body content, title, author, publication date, image address in the body, and tag source code for the body. GNE is very effective in extracting hundreds of Chinese news websites such as Toutiao, netease News, Youmingxingxing, Guancheng, Ifeng.com, Tencent News, ReadHub, Sina News, with almost 100% accuracy.

The GeneralNewsExtractor (GNE)…

Installation method:

pip install --upgrade gne

Copy the code

It’s also very simple to use:

from gne import GeneralNewsExtractor
extractor = GeneralNewsExtractor()
html = 'Site source code'
result = extractor.extract(html)
print(result)
Copy the code

The project was named an extractor rather than a crawler to avoid unnecessary risk, so the input is HTML source code and the output is a dictionary. Use the appropriate method to get the HTML of the target site.

Super easy to use, basically can perfect crawl any website news!!


The author’s experiment: take People’s Daily news as an example Example1:

import pandas as pd
from gne import GeneralNewsExtractor
import requests

# get HTML
url = "http://edu.people.com.cn/n1/2021/0504/c1006-32094364.html"
rep = requests.get(url)
source = rep.content.decode("gbk",errors='ignore')
Create GNE extractor
extractor = GeneralNewsExtractor()
# Incoming input (HTML) gets the result
result = extractor.extract(source)
The result is returned as a dictionary type of {" title ", "author", "publish_time", "content", "images"}
print(result)
Copy the code

Crawl as shown below:

In addition, the author also tried in a number of popular news portal sites, the basic success of extraction

Example2:

import pandas as pd
from gne import GeneralNewsExtractor
import requests
import chardet

class NewsExtract(object) :
    
    def __init__(self) :
        pass
    
    def extract(self,url) :
        # read link
        rep = requests.get(url)
        Get the page code
        encoding = chardet.detect(rep.content)["encoding"]
        # Parse the source code of the web page
        source = rep.content.decode(encoding,errors='ignore')
        Create GNE extractor
        extractor = GeneralNewsExtractor()
        # Incoming input (HTML) gets the result
        result = extractor.extract(source)
        return result 

if __name__ == '__main__':
        news_website ={
        "01 Baidu News":"https://news.cctv.com/2021/10/08/ARTIdfEmRcmTVug3xbAetmHw211008.shtml"."02 Sina News":"https://news.sina.com.cn/c/2021-10-08/doc-iktzqtyu0280551.shtml".03 People's Daily Online:"http://opinion.people.com.cn/n1/2021/1008/c1003-32246575.html"."04 The Paper":"https://www.thepaper.cn/newsDetail_forward_14812232"."05 Xinhua net":"http://www.news.cn/politics/leaders/2021-10/07/c_1127935225.htm"."06 Tencent News":"https://new.qq.com/omn/TWF20211/TWF2021100800473500.html"."08 Baidu News":"http://baijiahao.baidu.com/s?id=1713042599973678893"."09 Phoenix News":"https://news.ifeng.com/c/8AAfop0xwa6"."10 Sohu News":"https://www.sohu.com/a/493916920_114988?spm=smpc.news-home.top-news3.4.1633693367314ApmGjAj&_f=index_chan08news_8",}Create extractor
    ne = NewsExtract()
    # Violent extraction
    for idx,url in news_website.items():
        try:
            print(f"==={idx}:{url}= = =")
            info_news = ne.extract("http://edu.people.com.cn/n1/2021/0504/c1006-32094364.html")
            print(info_news)
            print(a)except:
            print(f"{idx}Crawl failed")
Copy the code

The above – mentioned sites have successfully extracted news content! ~

In this way, in the process of multi-news portal information crawling, GNE can be used to abstract away the steps of writing the extraction rules for each news web page, and we only need to pay attention to how to collect more and more complete urls, so that these urls form a news URL pool.

⭐⭐⭐ looking forward to more interesting coding, in-depth communication ~~~