First of all, we sample the data. Currently there are many recruitment websites, select one website (51job), and analyze only one city (Guangzhou) for analysis. Through the framework of scrapy crawler, we grab the data of recruitment positions, and export CSV files.

1. Analyze the page data structure of the recruitment website

1.1 Job list analysis

Enter the data analysis position through the website and select Guangzhou region to jump to a job list. Only 5 columns of data are displayed in this list and are displayed in pagination. The 5 columns of data cannot meet the needs of the following analysis dimensions

1.2 Job details analysis

Job details are more information, including the applicant’s education, work experience, ability requirements, the type of recruitment company, the number of employees, the number of recruitment, salary requirements and other dimensions of data.

1.3 Ideas of data acquisition

(1) Access to the list of job listings to view the details of the URL. (2) Access the details page using the URL to obtain the details page information.

2 crawler code implementation

2.1 Project Creation

Execute ** “scrapy startProject job_spider” ** on the CMD console to generate the following structure: itemload.py,main.py, and data files. Create a new Python project in Eclipse and copy the corresponding files in.

The main. Py file is created to debug the project in Eclipse, and the following code is added to the file. This code is the entrance to the whole crawler project, running the project and exporting the structure to CSV file, where %(name)s represents the placeholder of the crawler name and %(time)s represents the time of exporting the file

if __name__ == '__main__':
    cmdline.execute(argv=['scrapy'.'crawl'.'51job'.'-o'.'data/export_%(name)s_%(time)s.csv'])
Copy the code

2.2 Data reading implementation of recruitment list

 1 class JobSpider(scrapy.Spider):
 2    name = "51job"
 3    start_urls = [
 4        "Https://search.51job.com/list/030200, 000000000 0,00,9,99, 25 b0%25 e6%2595% % 25 e6%258 d % 25 ae % 25 e5%2586% % 2588% % 25 e6 259 e, 2590, v2 ,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lo nlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="
 5    ]
 6    def parse(self, response):
 7        list=response.xpath('//*[@id="resultList"]/div')
 8        for info in list:
 9            name=info.xpath('p/span/a/text()').extract_first()
10            url=info.xpath('p/span/a/@href').extract_first()
11            if not  name is None:
12                yield scrapy.Request(url,callback=self.parse_detail)
13
14        next_url=response.css('#resultList > div.dw_page > div > div > div > ul > li:last-child>a::attr(href)').extract_first();
15        if next_url:
16            yield scrapy.Request(next_url,callback=self.parse)
Copy the code

Line 3: data analysis and position list url in guangzhou, the address as the beginning of the crawler Line 6: the method is the crawler frame parsing is complete after page, the calling analysis method on line 7: the list of position information area of the object Line 10: loop position list of data, for details of the corresponding address url line 12: Line 14: Get the link element for the next page of the job list Line 16: Continue parsing the data for the next page of the job list

2.3 Implementation of reading detailed page of recruitment list

 1def parse_detail(self, response):
 2        job_title=response.xpath('/html/body/div[3]/div[2]/div[2]/div/div[1]/h1/text()').extract_first()
 3        work_info=response.xpath('/html/body/div[3]/div[2]/div[2]/div/div[1]/p[2]/text()').extract()
 4        salary=response.xpath('/html/body/div[3]/div[2]/div[2]/div/div[1]/strong/text()').extract_first()
 5        company_name=response.xpath('/html/body/div[3]/div[2]/div[4]/div[1]/div[1]/a/p/text()').extract_first()
 6        company_type=response.xpath('/html/body/div[3]/div[2]/div[4]/div[1]/div[2]/p[1]/text()').extract_first()
 7        company_num=response.xpath('/html/body/div[3]/div[2]/div[4]/div[1]/div[2]/p[2]/text()').extract_first()
 8        business=response.xpath('/html/body/div[3]/div[2]/div[4]/div[1]/div[2]/p[3]/text()').extract_first()
 9        job_desc=response.xpath('/html/body/div[3]/div[2]/div[3]/div[1]/div//text()').extract()
10        l=JobItemLoad(JobItem())
11        if len(work_info)>=5:
12            l.add_value('work_place', work_info[0])
13            l.add_value('work_experience', work_info[1])
14            l.add_value('education', work_info[2])
15            l.add_value('headcount', work_info[3])
16            l.add_value('publish_date', work_info[4])
17        elif len(work_info)==4:
18            l.add_value('work_place', work_info[0])
19            l.add_value('work_experience', work_info[1])
20            l.add_value('headcount', work_info[2])
21            l.add_value('publish_date', work_info[3])
22        l.add_value('job_title', job_title)
23        l.add_value('salary', salary)
24        l.add_value('company_name', company_name)
25        l.add_value('company_type', company_type)
26        l.add_value('company_num', company_num)
27        l.add_value('business', business)
28        l.add_value('job_desc'."".join(job_desc))
29        l.add_value("url", response.url)
30        yield l.load_item()
Copy the code

Line 2-9: Get the parsed data line 10: generate the ItemLoader object Line 11-29: assign the data Line 30: Return the detailed data

2.4 Data processing for crawling

The file kitems. py is used to process the data. Because the following files have special characters, newlines, Spaces and other problems, the file is used to process the data

1def normalize_space(value):
2    move = dict.fromkeys((ord(c) for c in u"\xa0\n\t\r"))
3    output = value.translate(move)
4    return output
5class JobItem(scrapy.Item):
6    job_title = scrapy.Field(input_processor=MapCompose(normalize_space))# Job title
7    work_place=scrapy.Field(input_processor=MapCompose(normalize_space))# Workplace
Copy the code

Line 2: Define the character to be removed Line 6: define the method to handle the field. MapCompost allows you to define multiple processing methods

3 A trample hole

3.1 The exported file contains garbled characters

You need to add the following configuration in settings.py

FEED_EXPORT_ENCODING = "gb18030"
Copy the code

3.2 Field processing does not take effect

You need to add the following code to the itemload.py file

default_output_processor = TakeFirst()
Copy the code

Follow the public account and reply “51Job” to get the project code