This is the fourth day of my participation in the August Text Challenge.More challenges in August

Crawl target

Web site:Zhihu hot list 

Tool use

Development environment: Win10, PYTHon3.7

Development tools: PyCharm, Chrome

Toolkits: Requests, LXML, RE

Project idea analysis

Pycharm is the best way to download import Requests from the target url. Click the toolbar as shown below:

PIP install is used to download other commands.

Then in the blank interface of hot search, right click the popup option to view the source code, you can find the title, after obtaining the page data, extract the title data as shown below:

Once we get the web data, we need to use xpath to extract the corresponding image address

Get details of the content address, the details of the address is not in the a tag oh, we need to use the re to extract details of the page address:

The detailed URL needs to be segmented and replaced to obtain the exact URL

Easy source sharing

Import re # regular expressions import requests # Send network requests from LXML import etree # Convert data # Agree resource locator URL = 'https://www.zhihu.com/billboard' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, Like Gecko) Chrome/17.0.963.84 Safari/ 535.11se 2.x MetaSr 1.0',} # Bs4 new_url_list = re.findall('link ':{"url":"(.*?)"}', response.text) print(new_url_list) html_object = etree.HTML(response.text) a_list = html_object.xpath('//a[@class="HotList-item"]') # print(a_list) for a, new_url in zip(a_list, new_url_list): title = a.xpath('.//div[@class="HotList-itemBody"]/div[1]/text()')[0] url1 = new_url.replace('u002F', ") img_url = a.x path ('/div [@ class = "HotList - itemImgContainer"] / img / @ SRC ') [0] f = open (' zhihu hot list data. The text, "a", Encoding = "utf-8") f.w rite (" title: "+ + '\ n' title) f.w rite (" address:" + + '\ n' url1) f.w rite (" picture address: " + img_url + "\n") f.write("\n")Copy the code

I am white and white I, a program yuan like to share knowledge ❤️

If you are not familiar with programming, you can leave a comment on this blog or you want to learn Python.