Python Crawler combat - take you to parse different famous quotes website ❤️ | August more text challenge

This is the 8th day of my participation in the August More Text Challenge. For details, see:August is more challenging

1. Enter the url

Quotes.toscrape.com/, go to the homepage of the website, observe the structure of the page, we found that the content of the page is very clear.

It is mainly divided into three main fields: famous quotes, author and label, and the contents of the three fields are extracted this time.

2. Identify requirements and analyze web page structure

Open the developer tool and click Networ for network data packet capture analysis. The website makes requests in get mode without parameters, so we can simulate the request by using the GET () method in the request library. Headers request is required to simulate the browser information verification. Prevent being detected by the web server as a crawler request.

You can also click on the little leftmost arrow in the developer tool to help you quickly locate where the Web data is on the Element TAB.

3. Parse the structure of the web page and extract data.

After the request is successful, we can start to extract the data. I’m using the xpath parsing method, so first parse the xpath page, click on the leftmost arrow, can help us quickly locate the data, the page data in the Element TAB. Since the data is sorted in a list of requests, we can locate the entire list first. Through THE HTML parser in LXM, one field at a time, grab and save to the list, convenient for the next step of data cleaning.

4. Save the file to a CSV file.

5. Source sharing

import requests from lxml import etree import csv url = "https://quotes.toscrape.com/" headers = { 'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Get (url,headers = headers).text HTML = eree.HTML(res) queto_list = html.xpath('//div[@class="col-md-8"]') lists = [] for queto in queto_list: Title = queto.xpath('./div[@class="quote"]/span[1]/text()' Queto. Xpath ('/div [@ class = "quote"] / span [2] / small/text () ') # quotes label tags = Queto.xpath ('./div[@class=" tags"]/ div[@class="tags"]/a[@class="tag"]/text()') # Lists. Append (authuor) lists. Append (tags) with open("./.csv",'w',encoding=' utF-8 ',newline='\n') as f: writer = csv.writer(f) for i in lists: writer.writerow(x)Copy the code

I am white and white I, a love to share knowledge of the program yuan ❤️

If you have no contact with the programming section of the friends see this blog, find do not understand or want to learn Python, you can directly leave a message + private I ducky [thank you very much for your likes, collection, attention, comments, one button four connect support]

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python Crawler combat – take you to parse different famous quotes website ❤️ | August more text challenge

1. Enter the url

2. Identify requirements and analyze web page structure

3. Parse the structure of the web page and extract data.

4. Save the file to a CSV file.

5. Source sharing

Python Crawler combat – take you to parse different famous quotes website ❤️ | August more text challenge

1. Enter the url

2. Identify requirements and analyze web page structure

3. Parse the structure of the web page and extract data.

4. Save the file to a CSV file.

5. Source sharing

Related Posts

188W+ programmers: Is Java passed by value or by reference?

Spring-resttemplate: Design pattern

Elegant object conversion -MapStruct