The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article comes from Tencent cloud author: Natural and unrestrained Kun

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)

Use scrapy.Selector or BeautifulSoup to implement the following requirements (30 points)

(1) Read the content of the given dangdang. HTML page, note: (2) Get the name, price, author of all books on the page, Press and picture books (20) (3) to the url to save the information to a file (excel, CSV, json, TXT format) (5) web page file dangdang. HTML file download link: pan.baidu.com/s/1awbG5zqO… Password: 3 urs

1.1 BeautifulSoup solution
from bs4 import BeautifulSoup as bs
import pandas as pd

def cssFind(book,cssSelector,nth=1) :
    if len(book.select(cssSelector)) >= nth:
        return book.select(cssSelector)[nth-1].text.strip()
    else:
        return ' '

if __name__ == "__main__":
    with open("dangdang.html",encoding='gbk') as file:
        html = file.read()
    soup = bs(html,'lxml')
    book_list = soup.select("div ul.bigimg li")
    result_list = []
    for book in book_list:
        item = {}
        item['name'] = book.select("a.pic") [0] ['title']
        item['now_price'] = cssFind(book,"span.search_now_price")
        item['pre_price'] = cssFind(book,"span.search_pre_price")
        item['author'] = book.select("p.search_book_author a") [0] ['title']
        item['publisher'] = book.select("p.search_book_author span a")[-1].text
        item['detailUrl'] = book.select("p.name a") [0] ['href']
        item['imageUrl'] = book.select("a.pic img") [0] ['src']
        if item['imageUrl'] = ="images/model/guan/url_none.png":
            item['imageUrl'] = book.select("a.pic img") [0] ['data-original']
        result_list.append(item)

    df = pd.DataFrame(result_list,columns=result_list[0].keys())
    df.to_excel("Dangdang book information.xlsx")
Copy the code

Use scrapy. Selector

from scrapy.selector import Selector
import pandas as pd

if __name__ == "__main__":
    with open("dangdang.html",encoding='gbk') as file:
        response = Selector(text=file.read())
    book_list = response.xpath("//ul[@class='bigimg']/li")
    result_list = []
    for book in book_list:
        item = {}
        item['name'] = book.xpath("a[@class='pic']/@title").extract_first()
        item['now_price'] = book.xpath(".//span[@class='search_now_price']/text()").extract_first()
        item['pre_price'] = book.xpath(".//span[@class='search_pre_price']/text()").extract_first()
        item['author'] = book.xpath("p[@class='search_book_author']//a/@title").extract_first()
        item['publisher'] = book.xpath("p[@class='search_book_author']//a/@title").extract()[-1]
        item['detailUrl'] = book.xpath(".//p[@class='name']/a/@href").extract_first()
        item['imageUrl'] = book.xpath("a[@class='pic']/img/@src").extract_first()
        if item['imageUrl'] = ="images/model/guan/url_none.png":
            item['imageUrl'] = book.xpath("a[@class='pic']/img/@data-original").extract_first()
        result_list.append(item)

    df = pd.DataFrame(result_list,columns=result_list[0].keys())
    df.to_excel("Dangdang book information.xlsx")
Copy the code

2. Demand: Capture the super full reduction product information of Tmall Three Squirrels flagship store (55 points)

Website address below sanzhisongshu.tmall.com/p/rd523844…. The scoring criteria are as follows: 1. Create a function to get all the content of the page, and the code is correct (5 points). 2. 3. Obtain the product name, price and url information of product picture of each productin the page (20 points) 5. Write the results obtained in step (3) into the database (10 points) 6. Code specification with annotations (5 points)

import requests
from bs4 import BeautifulSoup as bs
import urllib
import os
import pymysql

Get an instantiation of BeautifulSoup
def getSoup(url, encoding="gbk", **params) :
    reponse = requests.get(url, **params)
    reponse.encoding = encoding
    soup = bs(reponse.text, 'lxml')
    return soup

# Download a single image function
def downloadImage(imgUrl, imgName) :
    imgDir = "photo"
    if not os.path.isdir(imgDir):
        os.mkdir(imgDir)
    imgPath = "%s/%s" %(imgDir,imgName)
    urllib.request.urlretrieve(imgUrl,imgPath)

# Download all image functions
def downloadAllImages(soup) :
    image_list = soup.select("img")
    count = 0
    for image in image_list:
        try:
            srcStr = image['data-ks-lazyload']
            imgFormat = srcStr[-3:]
            if imgFormat == 'gif':
                continue
            count += 1
            imgName = "%d.%s" % (count, imgFormat)
            imgUrl = "http:" + srcStr
            downloadImage(imgUrl, imgName)
        except Exception as e:
            print(str(e))

# Select tags from CSS selector syntax
def cssFind(movie,cssSelector,nth=1) :
    if len(movie.select(cssSelector)) >= nth:
        return movie.select(cssSelector)[nth-1].text.strip()
    else:
        return ' '

Get database connection function
def getConn(database ="product") :
    args = dict(
        host = 'localhost',
        user = 'root',
        passwd = '... your password',
        charset = 'utf8',
        db = database
    )
    return pymysql.connect(**args)

if __name__ == "__main__":
    soup = getSoup("https://sanzhisongshu.tmall.com/p/rd523844.htm" \
                   "? SPM = a1z10. 1 - b - s. 5001-14855767 631.8.19 ad32fdW6UhfO & scene = taobao_shop")
    # Download all images
    downloadAllImages(soup)
    Get database connection
    conn = getConn()
    cursor = conn.cursor()
    Create table productinfo from database
    sql_list = []
    sql_list.append("drop table if exists productinfo")
    sql_list.append("create table productinfo(name varchar(200)," \
                    "price varchar(20),imageUrl varchar(500))")
    for sql in sql_list:
        cursor.execute(sql)
        conn.commit()
    Get product information and insert it into the database
    item_list = soup.select("div.item4line1 dl.item")
    for item in item_list:
        name = cssFind(item,"dd.detail a")
        price = cssFind(item,"dd.detail span.c-price")
        imageUrl = item.select("dt img") [0] ['data-ks-lazyload']
        insert_sql = 'insert into productinfo values("%s","%s","%s")' %(name,price,imageUrl)
        cursor.execute(insert_sql)
        conn.commit()
Copy the code

Please describe the schematic diagram of scrapy operation as accurately as you understand it. (15 points)

Scrapy framework schematic. PNG

In the actual process of writing code, code files are generally written in the following order: 1. Write the item.py file; 2. Compile crawler files; > < span style = “box-sizing: border-box; word-break: inherit! Important; 1. Create settings.py and Scrapy files. The Spiders send Requests to the Scheduler 2. The Scheduler sends Requests for downloading web pages to Downloader 3. The Downloader gets a corresponding response from a web page to the Spiders 4. Spiders of crawlers parse response to form Item 5.Item is sent to the pipe, which processes the data accordingly and makes the data persistent. 6.Middelwares is divided into three kinds: scheduling middleware Scheduler Middlewares, crawler middleware spider Middlewares and Download middlewares. When writing scrapy-Redis distributed crawler, Redis is equivalent to scheduling middleware Scheduler Middlewares. To disguise the crawler, set the User agent user-agent and proxy Ip, is set in the crawler middleware spider Middlewares, Download Middlewares can be set accordingly.