The starting point of everything, 10 lines of code set beauty

A prelude to the article

Target data source analysis

The target address to be captured this time is:

Capture target: capture the site’s pictures, the target 2000. Use Python framework for: requests libraries, other re module technology stack added: rules for regular expression target website address: www.netbian.com/mei/index.h… www.netbian.com/mei/index\_… www.netbian.com/mei/index\_…

Conclusion, the rules of the list page to http://www.netbian.com/mei/index_} {paging. HTM.

Data range

  1. 164 pages in total;
  2. 20 pieces of data per page.

The code of the label where the picture is located is as follows:

<li><a href="/desk/23397.htm" title=" img SRCopy the code

The page address is /desk/23397.htm.

Sorting requirements are as follows

  1. Generate all list page urls
  2. Traverse the list page URL address, and get the picture details page address;
  3. Enter the details page to obtain a larger picture;
  4. Save the picture;
  5. After getting 2,000 pictures, I started looking at them.

Code implementation time

After installing the Requests module in advance, use PIP install Requests. If access fails, switch to the domestic PIP source.

Homework: how to set the global PIP source.

The code structure is as follows:

Def format(): def save_image(): pass if __name__ == '__main__': main()Copy the code

For example, you need to know a little bit about the front end and regular expressions before you start.

and

, the first thing to do is to unpack the string and extract the target data part.

Get the source code for the web page through Requests as follows.

# def fetching function main () : url = "http://www.netbian.com/mei/index.htm" headers = {the user-agent: "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"} res = requests. Get (url=url, headers=headers, timeout=5) res.encoding = "GBK" print(res.text)Copy the code

Use the Requests module’s GET method to retrieve web page data with the request address, request header, and wait time.

The User-Agent in the request header field can either use the content I provided to you or get it through developer tools.

After the data returns the Response object, the data encoding is set with res.encoding=”GBK”, which can be obtained from the web source.

Request to data source, that is, start parsing data, if using regular expressions, it is recommended to do some simple tailoring of the target data first.

Clipping strings is a fairly routine operation in Python and can be done by writing code.

I use the same two strings I mentioned above.

Def format(text): Div_html = '<div class="list">' page_html = '<div class="page">' start = text.find(div_html) + len(div_html) end  = text.find(page_html) origin_text = text[start:end]Copy the code

The resulting origin_text is our target text.

The target text is parsed through the RE module

The target text returned above is shown below. The goal of this section is to get the image detail page address.

The technique used is the RE module, which of course needs to be used with regular expressions. For regular expressions, you can follow the eraser a little bit.

Def format(text): Div_html = '<div class="list">' page_html = '<div class="page">' start = text.find(div_html) + len(div_html) end  = text.find(page_html) origin_text = text[start:end] pattern = re.compile('href="(.*?)"') hrefs = pattern.findall(origin_text) print(hrefs)Copy the code

In the re.pile method, the regular expression is passed, which is a syntactic structure to retrieve the specific content of a string. For example. : represents any single character other than newline characters (\n, \r); * : matches the preceding subexpression zero or more times. ? : When this character is immediately followed by any other qualifier (*, +,? , {n}, {n,}, {n,m}), the matching mode is non-greedy, non-greedy is to reduce the matching; () : group extraction. With this knowledge, go back to the code and look at the implementation.



Suppose there is a string:href=”/desk/23478.htm”, the use ofhref=”(.*?) “You can take one of these/desk/23478.htmMatch out, the function of parentheses is also for the convenience of subsequent extraction.

The final output is shown in the figure below.

Clean crawl results

Some link addresses are incorrect and need to be removed from the list. This step uses the list generator to complete the task.

pattern = re.compile('href="(.*?)"') hrefs = pattern.findall(origin_text) hrefs = [i for i in hrefs if i.find("desk")>0]  print(hrefs)Copy the code

Capture inner page data

After obtaining the list page address, you can obtain the data in the image page. The technique used here is consistent with the previous logic.

Def format(text, headers): Div_html = '<div class="list">' page_html = '<div class="page">' start = text.find(div_html) + len(div_html) end  = text.find(page_html) origin_text = text[start:end] pattern = re.compile('href="(.*?)"') hrefs = pattern.findall(origin_text) hrefs = [i for i in hrefs if i.find("desk") > 0] for href in hrefs: url = f"http://www.netbian.com{href}" res = requests.get(url=url, headers=headers, timeout=5) res.encoding = "GBK" format_detail(res.text) breakCopy the code

In the first loop, break is added to break the loop, and the format_detail function is used to format the inner page data, again as a formatted string.

Since only one picture per page is the target data, re.search is used for retrieval, and the group method of the object is called for data extraction.

Duplicate code found, optimized later.

Def save_image(image_src): res = requests.get(url=image_src, timeout=5) content = res.content with open(f"{str(time.time())}.jpg", "wb") as f: f.write(content)Copy the code

Get the first picture, post to the blog record.

Optimize the code

Extract the code repetition logic and encapsulate it into a common function. The final code is as follows:

Import requests import re import time def request_get(url, ret_type="text", timeout=5, encoding="GBK"): Headers = {" user-agent ": "Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"} res = requests. Get (url=url, headers=headers, timeout=timeout) res.encoding = encoding if ret_type == "text": Return res.text elif ret_type == "image": return res.content Url = "http://www.netbian.com/mei/index.htm" text = request_get (url) format (text) # analytic function def format (text) : origin_text = split_str(text, '<div class="list">', '<div class="page">') pattern = re.compile('href="(.*?)"') hrefs = pattern.findall(origin_text) hrefs = [i for i in hrefs if i.find("desk") > 0] for href in hrefs: Url = f"http://www.netbian.com{href}" print(f" {url}") text = request_get(url) format_detail(text) def split_str(text, s_html, e_html): start = text.find(s_html) + len(e_html) end = text.find(e_html) origin_text = text[start:end] return origin_text def format_detail(text): origin_text = split_str(text, '<div class="pic">', '<div class=" picdown ">') pattern = re.compile(' SRC ="(.*?)"') image_src = pattern. Search (origin_text).group(1) # save images Save_image (image_src) def save_image(image_src): content = request_get(image_src, "image") with open(f"{str(time.time())}.jpg", "wb") as f: F.w rite (content) print (" images successfully saved ") if __name__ = = "__main__ ': the main ()Copy the code

Run the code to get the results shown below.

The target of 2000

20 pictures have been crawling, below the target 2000, in the beginning stage according to this simple way to grab.

Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base

The main function needs to be modified in this step:

Def main(): urls = [f"http://www.netbian.com/mei/index_{i}.htm" for i in range(2, 201)] url = "http://www.netbian.com/mei/index.htm" urls.insert(0, url) for url in urls: Print (" print ", url) text = request_get(url) format(text)Copy the code

———————————————— Copyright: This article is originally published BY CSDN blogger “Dream Eraser” under CC 4.0 BY-SA copyright agreement. Please attach the link to the original source and this statement. The original link: blog.csdn.net/hihell/arti…