Moment For Technology

Python crawler + Word frequency statistics Crawls hot news articles on Qq.com and makes word frequency statistics

Posted on Dec. 3, 2022, 8:56 a.m. by 崔淑華
Category: The back-end Tag: The back-end The crawler

1. Target address

new.qq.com/ch/finance/

Let's take the financial column as an example. Here we can look at the source code of the web page and see that the news is arranged in an unordered list. Each news is a Li, so we can further parse as long as we get all the Li (that is, li corresponding to ul). So we parse the source through BeautfulSoup. So getting all the Li's is pretty simple, one line of code

 uls=soup.find_all('ul')
Copy the code

2. Content to crawl on the home page (the content in the green box above)

1. Link address of details page (one line of code)

detail_url=l.a.attrs['href']# Details page link
Copy the code

2. The tag to which the news belongs (one line of code)

tags=l.find_all(attrs={'class':'tags'})# Newstag
Copy the code

Here are the steps:

2.1 First define the function that we get the source of the web page
def getHTMLText(url) :
    Get HTML from web page
    user_agent = [
    "Mozilla / 5.0 (Macintosh; U; Intel Mac OS X 10_6_8; En-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"."Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"."Mozilla / 5.0 (Windows NT 10.0; WOW64; The rv: 38.0) Gecko / 20100101 Firefox / 38.0"."Mozilla / 5.0 (Windows NT 10.0; WOW64; Trident / 7.0; . NET4.0 C; . NET4.0 E; The.net CLR 2.0.50727; The.net CLR 3.0.30729; The.net CLR 3.5.30729; InfoPath.3; The rv: 11.0) like Gecko"
    ]

    headers = {'User-Agent': random.choice(user_agent)}

    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = 'utf-8'
        return r.content
    except:
        return ""
Copy the code

Next we need to get all the Li's, which we can do in a single line of beutlfulsouo

uls=soup.find_all('ul')
Copy the code

Find_all returns a list element, because ul (unordered list) on the page has multiple lists of news that we don't know which one, so we have to see for ourselves. The list of news we want here is the second element of ULS, ulS [1].

2.2 Analyze the links and labels of the news details page on the home page

def parase_index(url) :
    Parsing the content of the home page
    html=getHTMLText(url)
    soup = BeautifulSoup(html, "lxml")
    uls=soup.find_all('ul')

    news_type=""# News Category
    if "finance" in url:
        news_type="Finance and economics"
    elif "ent" in url:
        news_type="Entertainment"
    elif "milite" in url:
        news_type="Military"
    elif "tech" in url:
        news_type="Science and technology"
    elif "world" in url:
        news_type="International"
    print("= = = = = = = = = = = = = = = = = = = = = = = = = = = {} = = = = = = = = = = = = = = = = = = = = = = = = = = =".format(news_type))
    
    for l in uls[1].find_all('li'):
        detail_url=l.a.attrs['href']# Details page link
        
        try:
            title,content=getContent(detail_url)Get the title name, body of the details page
        except:
            continue
       
        print(title)
        tags=l.find_all(attrs={'class':'tags'})# Newstag
        # Extract tag text
        tags=re.findall('target="_blank"(.*?) '.str(tags[0]))
        tags=",".join(tags)
        
        writer.writerow((news_type,tags,title,content))
Copy the code

3. Parsing details page

The details page is even better. Just parse and save the title and body parts

def getContent(url) :
    Parsing news body HTML
    html = getHTMLText(url)
    soup = BeautifulSoup(html, "lxml")
    title=soup.h1.get_text()# Get title
    artical=soup.find_all(attrs={'class':'one-p'})
    content=""
    for para in artical:
        content+=para.get_text()
    return title,content
Copy the code

Iv. Data update

Since the crawled news is real-time hot news, the content of each crawl is different, so add this step. Save the contents of each crawl that do not duplicate the previous data.

def update(old,new) :
    Update dataset: add data from this new crawl to the dataset (removing duplicates)
    data=new.append(old)
    data=data.drop_duplicates()
    return data

Copy the code

Five, word frequency statistics

Here I put a list of stop words on the Internet, is the code stop_words.txt. You can just do a quick search on the Internet.

def word_count(data) :
    Word frequency statistics
    txt=""
    for i in data:
        txt+=str(i)
    # Load stop word list
    stopwords = [line.strip() for line in open("stop_words.txt",encoding="utf-8").readlines()]  
    words  = jieba.lcut(txt)  
    counts = {}  
    for word in words:  
        # is not in the stop list
        if word not in stopwords:  
            # Do not count one-word words
            if len(word) == 1:  
                continue  
            else:  
                counts[word] = counts.get(word,0) + 1  
    items = list(counts.items())  
    items.sort(key=lambda x:x[1], reverse=True)   
    return pd.DataFrame(items)
Copy the code

The main function is the realization of the effect

If you want to crawl for more than one category, then add the subclass's links to the list of links to crawl. The complete code is as follows:

# Links to climb: Economic, entertainment, military, technological, international
url_list=['https://new.qq.com/ch/finance/'.'https://new.qq.com/ch/ent/'.'https://new.qq.com/ch/milite/'.'https://new.qq.com/ch/tech/'.'https://new.qq.com/ch/world/'
         ]

# Define the name of the data set to be saved
file_name="NewsData.csv"
try:
    data_old=pd.read_csv(file_name,encoding='gbk')
except:
    pass
csvFile = open(file_name, 'a', newline=' ',encoding="gb2312")
writer = csv.writer(csvFile)
writer.writerow(("News Classification"."News Tag"."News headlines"."News content"))

for url in url_list:
    parase_index(url)
    
print("The crawl is done!)
csvFile.close()
print("= = = = = = = = = = = = = = = = = = = = =")
print("Start updating dataset")
data_new=pd.read_csv(file_name,encoding='gbk')
update(data_old,data_new).to_csv(file_name,index=None,encoding='gbk')
print("Update done!")
print("= = = = = = = = = = = = = = = = =")
print("Start word frequency statistics")
data=pd.read_csv(file_name,encoding="gbk")
res=word_count(data['News content'])
res.to_csv("frequence.txt",header=None,index=None)
print("That's it!)
print(res)
Copy the code

Search
About
mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.