Learn Python

Life is short, learn Python!

If a site implements user-agent crawling, cookie crawling, IP blocking, etc., we need to integrate Selenium into scrapy. Bypass the site anti – crawl, to achieve the purpose.

The zhaopin.com site is not a dynamic web page, but it requires simulated login, so we integrate Selenium with scrapy for data fetching.

I. Demand analysis

Open the target website and search for Web front-end development engineers.

This is the home page. Since my current location is in Wuhan, the system automatically locates me in Wuhan. After clicking and searching:

This is one point that requires a way out through Selenium.

After manual login, the following screen is displayed:

Our goal was 8 pieces of data for every job Posting:

  1. Name Job title
  2. Salary pay
  3. Adress region
  4. Experience experience
  5. EduBack Education
  6. Company Name
  7. CompanyType companyType
  8. Scale of a company
  9. Introduction of the info

Scrapy project file configuration

Define the items

import scrapy

class ZlzpItem(scrapy.Item) :Name = scrapy.Field() *** *** info = scrapy.Field()Copy the code

Scrapy crawler definition: zl.py

[url]
firstPageUrl : 'https://sou.zhaopin.com/?jl=736&kw=web%E5%89%8D%E7%AB%AF%E5%B7%A5%E7%A8%8B%E5%B8%88&p=1'
    # as the url for the first page, myspider.py is not shown below to avoid code redundancy.
    base_url = 'https://sou.zhaopin.com/?jl=736&kw=web%E5%89%8D%E7%AB%AF%E5%B7%A5%E7%A8%8B%E5%B8%88&p={}'
Copy the code

Here is the source code for zl.py:

1. Initialization Settings:

# -*- coding: utf-8 -*-
import scrapy
from zlzp.items import ZlzpItem

count = 1   Define a global variable that builds the url of the next page with base_URL
class ZlSpider(scrapy.Spider) :
    name = 'zl'
    allowed_domains = ['zhaopin.com']
    start_urls = [firstPageUrl]
Copy the code

2. Parse functions:

    def parse(self, response) :

        global count
        count += 1  Each time the page is parsed, let count+1 and baseurl construct the URL for the next page

        jobList = response.xpath('//div[@class="positionlist"]/div/a')

        for job in jobList:
            name =  job.xpath("./div[1]/div[1]/span[1]/text()").extract_first() ... salary***,company***,.... info = job.xpath("./div[3]/div[1]//text()").extract_first() item = ZlzpItem(name=name,salary=salary,company=company,adress=adress,experience=experience,eduBack=eduBack,companyType=company Type,scale=scale,info=info)yield item
Copy the code

3. Paging:

        next_url = 'https://sou.zhaopin.com/?jl=736&kw=web%E5%89%8D%E7%AB%AF%E5%B7%A5%E7%A8%8B%E5%B8%88&p={}'.format(count)
        if count == 34:
            return None # set the program stop condition
        if next_url:
            yield scrapy.Request(next_url,callback=self.parse)
Copy the code

Define DownloadMiddleware: myDownloadmiddleware.py

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
...
class ZlzpDownloaderMiddleware:
    def __init__(self) :
        self.driver = webdriver.Chrome()
    def process_request(self, request, spider) :
        self.driver.get(request.url)
        time.sleep(3) # 3 s to rest
		# set the display wait, because we need to log in, we scan the code to log in, until the page appears (i.e. the URL display is firstPageurl)
        WebDriverWait(self.driver, 1000).until(
            EC.url_contains(request.url)
        )
        time.sleep(6) The page will take some time to load after successful login, then rest for a few seconds
        return HtmlResponse(url=self.driver.current_url, body=self.driver.page_source, encoding="utf-8",
                            request=request)  The response object is then returned to the crawler (zl.py)
Copy the code

Description:

  1. The core of Selenium’s integration into Scrapy is that it intercepts requests in the crawler middleware and returns the processed response object, which corresponds to response in the crawler file (here, zl.py)parse function. Without Selenium integration, the Response object does not respond well to site crawling.
  2. There is only a small amount of Selenium code in the parse_request method here, because there are not many dynamic operations.
  3. Key: Response object after return:

We cannot return None here. If we return None, the request will be sent to the download middleware to download the page, and the response of the page will be returned to spider(hr.py). However, we have already downloaded the content of this page when we browser.get above, so there is no need to download it again. We just need to make a response object and directly return this response to the spider

“> < span style =” box-sizing: border-box; word-break: inherit! Important

from itemadapter import ItemAdapter
import csv

class ZlzpPipeline:
    def __init__(self) :
        self.f = open('zlJob.csv'.'w', encoding='utf-8', newline=' ')
        # self.file_name = ['name','upTime','salary','needs','welfare','company','scale','types']
        self.file_name = ['name'.'salary'.'company'.'adress'.'experience'.'eduBack'.'companyType'.'scale'.'info'] 
        self.writer = csv.DictWriter(self.f, fieldnames=self.file_name)
        self.writer.writeheader()

    def process_item(self, item, spider) :
        self.writer.writerow(dict(item))Write the specific value passed by the spider
        return item Return after writing

    def close_spider(self, spider) :
        self.f.close()
Copy the code

Settings. Py configuration

BOT_NAME = 'zlzp'

SPIDER_MODULES = ['zlzp.spiders']
NEWSPIDER_MODULE = 'zlzp.spiders'
LOG_LEVEL = 'WARNING'. ROBOTSTXT_OBEY =False. DEFAULT_REQUEST_HEADERS = {'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'.'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'en',}... DOWNLOADER_MIDDLEWARES = {'zlzp.middlewares.ZlzpDownloaderMiddleware': 543,}... ITEM_PIPELINES = {'zlzp.pipelines.ZlzpPipeline': 300,}...Copy the code

. Represents comment code, omitted here.

Three, program operation

Command line type:

scrapy crawl hr
Copy the code

Pic1: the program ends on page 34, corresponding to count = 34

Pic02: (CSV file)

4. Simple data analysis

View the data

import pandas as pd
df = pd.read_csv('./zlJob.csv')
df.head()
Copy the code

Salary pie chart presentation

c = (
    Pie(init_opts=opts.InitOpts(bg_color="white"))
    .add(""[list(z) for z in zip(typesX,number)])   List (zip(x,y))-----> [(x,y)]
    .set_global_opts(title_opts=opts.TitleOpts(title="Type."))  # titles
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))  # Data tag Settings
)

c.render_notebook()  
Copy the code

Experience requirements bar chart display

from pyecharts.charts import Bar
bar = Bar()
bar.add_xaxis(['3 to 5 years'.'1 to 3 years'.'不限'.'5-10 years'.'Inexperienced'.'Less than 1 year'.'More than 10 years'])
bar.add_yaxis('Experience requirements'[462.329.83.78.19.15.4])
bar.render()
Copy the code

Academic requirements bar chart

c = (
    Pie(init_opts=opts.InitOpts(bg_color="white"))
    .add(""[list(z) for z in zip(educationTypes,number)])   List (zip(x,y))-----> [(x,y)]
    .set_global_opts(title_opts=opts.TitleOpts(title="Type."))  # titles
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))  # Data tag Settings
)

c.render_notebook()  
Copy the code

Most require a bachelor’s degree, or college degree or higher.

Company type bar chart

c = (
    Pie(init_opts=opts.InitOpts(bg_color="white"))
    .add(""[list(z) for z in zip(companyTypes,number)])   List (zip(x,y))-----> [(x,y)]
    .set_global_opts(title_opts=opts.TitleOpts(title="Type."))  # titles
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))  # Data tag Settings
)

c.render_notebook()  
Copy the code

You can see that most of the companies are private or listed.

Five, the summary

Page turning processing, since we only use Selenium to open the web page and request data, page turning processing is generally carried out in the crawler file. If the href attribute of the corresponding next page tag A is not the page URL of the next page, we need to set dynamic global variables to build a dynamic URL.

The response object is the core of selenimu’s scrapy implementation, intercepts the request in the download middleware, and returns the response object to the crawler.

If you are interested in the code involved in this article, you can scan the following QR code 👇 and reply to zhaopin in the background to obtain the corresponding code file.