Yesterday one of my friends said he was looking for a new job. I want to help you check the recruitment situation of Boss. Last month, I wrote an open source crawler framework, Kcrawler, and finally added a Boss class support, which can quickly query the recruitment situation of different positions and different industries according to keywords. There are readily available libraries, and it’s easy to help a friend.

1. Install

Kcrawler is open source. Development students can choose clone source. However, if you want to save time, you can simply PIP install and import the project to use.

pip install kcrawler
Copy the code

2. Use

Two classes are provided.

  • Crawler: Simply crawls any website
  • Boss: Boss only (recommended by this article)

2.1 Crawler base class

Kcrawler provides a Crawler base class, Crawler, which encapsulates the basic functions of a general Crawler. The crawler object of a site is instantiated by passing in a configuration dictionary and then calling the crawl method of the object. Support HTML, JSON, image crawling. The following is an example of Boss configuration.

from kcrawler.core.Crawler import Crawler

def queryjobpage_url(x):
    p = list()
    if x[0]:
        p.append('i' + str(x[0]))
    if x[1]:
        p.append('c' + str(x[1]))
    if x[2]:
        p.append('p' + str(x[2]))
    return 'https://www.zhipin.com/{}/'.format(The '-'.join(p))

config = {
    'targets': {
        'city': {
            'url': 'https://www.zhipin.com/wapi/zpCommon/data/city.json'.'method': 'get'.'type': 'json'
        },
        'position': {
            'url': 'https://www.zhipin.com/wapi/zpCommon/data/position.json'.'method': 'get'.'type': 'json'
        },
        'industry': {
            'url': 'https://www.zhipin.com/wapi/zpCommon/data/oldindustry.json'.'method': 'get'.'type': 'json'
        },
        'conditions': {
            'url': 'https://www.zhipin.com/wapi/zpgeek/recommend/conditions.json'.'method': 'get'.'type': 'json'
        },
        'job': {
            'url': 'https://www.zhipin.com/wapi/zpgeek/recommend/job/list.json'.'method': 'get'.'type': 'json'
        },
        'queryjob': {
            'url': 'https://www.zhipin.com/job_detail/'.'method': 'get'.'type': 'html'
        },
        'queryjobpage': {
            'url': queryjobpage_url,
            'method': 'get'.'type': 'html'
        },
        'jobcard': {
            'url': 'https://www.zhipin.com/wapi/zpgeek/view/job/card.json'.'method': 'get'.'type': 'json'
        }
    }
}

crawler = Crawler(config)
Copy the code

The creation of the crawler object is now complete. Those of you who have crawler experience might have noticed that there is no headers. In fact, there are two ways to pass headers. One way is to add the headers key directly to the config dictionary. Such as:

Copy the Request Headers in its entirety in the browser and assign a value to the variable Headers as a docstring.

headers = ''' :authority: www.zhipin.com :method: GET :path: / .... cookie: xxx .... The user-agent: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36"
config = {'headers': headers, 'targets': {... }}Copy the code

The second recommended method is to create a new file named headers under the current file. Copy the entire HEADERS string in the browser in the same way to overwrite the file, directly, without adding document character variables. When the Config dictionary does not provide a headers field, the Crawler automatically reads the headers string from the headers file.

Call crawler.crawl(target) to crawl the data. Target is the key defined by targets in the config dictionary.

data = crawler.crawl('job')
Copy the code

2.2 Boss class

Crawler Crawler data is the original data of the site, although it has been converted into a dictionary or list, but in order to further obtain the field of interest, you need to extract it yourself. But Kcrawler already provides a Boss class that you can use directly.

You are advised to copy the Request Headers file in your browser and save it to the Headers file in your current directory. You can then happily use the Boss class to instantiate an object and crawl the data of interest.

from kcrawler import Boss

boss = Boss()
Copy the code

If the following output is displayed, the content of the headers file is successfully read.

read heades from file.
parse headers
Copy the code

2.2.1 Recruitment city

cities = boss.city()
Copy the code

Returns:

[{'id': 101010000,
  'name': 'Beijing'.'pinyin': None,
  'firstChar': 'b'.'rank': 1,
  'children': [{'id': 101010100,
    'name': 'Beijing'.'pinyin': 'beijing'.'firstChar': 'b'.'rank': 1}}, {'id': 101020000,
  'name': 'Shanghai'. ]Copy the code

2.2.2 Popular Cities

hotcities = boss.hotcity()
Copy the code

Returns:

[{'id': 100010000, 'name': 'the country'},
 {'id': 101010100, 'name': 'Beijing'},
 {'id': 101020100, 'name': 'Shanghai'},
 {'id': 101280100, 'name': 'guangzhou'},
 {'id': 101280600, 'name': 'shenzhen'},
 {'id': 101210100, 'name': 'hangzhou'},
...]
Copy the code

2.2.3 City of the current logged-in user

user_city = boss.userCity()
Copy the code

Returns:

Hey hey, I won't tell you...Copy the code

2.2.4 industry

industries = boss.industry()
Copy the code

Returns:

[{'id': 100001, 'name': 'E-commerce'},
 {'id': 100002, 'name': 'games'},
 {'id': 100003, 'name': 'the media'},
 {'id': 100004, 'name': 'Advertising'},
 {'id': 100005, 'name': 'Data Services'},
 {'id': 100006, 'name': 'Health care'},
 {'id': 100007, 'name': 'Life Service'},
 {'id': 100008, 'name': 'O2O'},
 {'id': 100009, 'name': 'travel'},
 {'id': 100010, 'name': 'Category information'},
 {'id': 100011, 'name': 'Music/Video/Reading'},
 {'id': 100012, 'name': 'Online Education'},
 {'id': 100013, 'name': 'Social Networking'},
...]
Copy the code

2.2.5 Expected Positions of current logged-in users (three)

expects = boss.expect()
Copy the code

Returns:

[{'id': xxx, 'positionName': 'xxx'},
 {'id': xxx, 'positionName': 'xxx'},
 {'id': xxx, 'positionName': 'xxx'}]
Copy the code

2.4.6 Recommended post list for current logged-in users

The boss.job(I, page) method needs to provide two parameters. The first is boss.expect() returns the index of the list, for example, 0 for the recommended job for the first expected job; The second parameter page is used to specify data pages, for example 1 is the first page.

jobs = boss.job(0.1)
Copy the code

Returns:

[{'id': 'be8bfdcdf7e99df90XV-0t-8FVo~'.'jobName': 'Deep Learning Platform Manager/Intermediate Technology'.'salaryDesc': '30-50K'.'jobLabels': [Shenzhen Futian District Shopping Park.'5-10 years'.'bachelor'].'brandName': 'China Peace'.'brandIndustry': 'Internet Finance'.'lid': '411 f6b88-8 a83-437 - a - aa5f - 5 de0fc4da2b7. 190 - GroupC, 194 - GroupB. 1'},
 {'id': 'f649c225a1b9038f0nZ609W5E1o~'.'jobName': 'Recommendation System Evaluation Engineer'.'salaryDesc': '20-35K'.'jobLabels': [Shenzhen Nanshan Science and Technology Park.'1 to 3 years'.'bachelor'].'brandName': 'company'.'brandIndustry': 'Internet',
 {'id': '94cfec046b98b9671H150tS8E1M~'.'jobName': 'User Data Mining Engineer'.'salaryDesc': 'pay 20-40 k · 15'.'jobLabels': [Nanshan Center, Nanshan District, Shenzhen.'No limit to experience'.'bachelor'].'brandName': 'Today's Headlines'.'brandIndustry': 'Mobile Internet'.'lid': '411 f6b88-8 a83-437 - a - aa5f - 5 de0fc4da2b7. 190 - GroupC, 194 - GroupB. 4'},
...]
Copy the code

2.4.7 Job Search by Keyword

City, industry and position are the ID fields corresponding to the data that have been climbed above. Query; The two different methods are used here because the URL composition of the first and second page of Boss job search and subsequent pages is different, with the former fixed and the latter changing. This is why the URL passed into the configuration dictionary above is in functional form.

Therefore, use the QueryJob method if you are only interested in querying the first page of data; If you want to continue to see the second page and beyond, you must use the QueryJobPage method. Of course, queryJobPage’s methods can actually crawl the first page of data, but it is recommended to use them separately so they are less likely to trigger anti-crawlers. After all, this is what users normally do when they visit.

tencent_jobs1 = boss.queryjob(query='company', city=101280600, industry=None, position=101301)
tencent_jobs2 = boss.queryjobpage(query='company', city=101280600, industry=None, position=101301, page=2)
Copy the code

Returns:

[{'jobName': 'Tianyan Lab - Senior Researcher in Big Data and Artificial Intelligence'.'salaryDesc': '30-60K'.'jobLabels': [Shenzhen Nanshan Science and Technology Park.'3 to 5 years'.'master'].'brandName': 'company'.'brandIndustry': 'Internet'.'id': 'ce6f957a5e4cb3100nZ43dm6ElU~'.'lid': '22Uetj173Sz.search.1'},
 {'jobName': 'QQ Information Stream Recommendation Algorithm Engineer/Researcher '.'salaryDesc': 'pay 20-40 k · 14'.'jobLabels': [Shenzhen Nanshan Science and Technology Park.'No limit to experience'.'bachelor'].'brandName': 'company'.'brandIndustry': 'Internet'.'id': 'a280e0b17a2aeded03Fz3dq_GFQ~'.'lid': '22Uetj173Sz.search.2'},...Copy the code

github

Github.com/kenblikylee…



This article was first published with a public number