I believe that you have more or less submitted your resume on the hook. Today, I suddenly want to know the salary level, recruitment requirements, benefits and company location of Python development in Beijing. The analysis must be based on existing data samples. This article shows the current situation of Python development in Beijing through crawler and data analysis, hoping to help you in career planning!!

The crawler

The first step in a crawler, of course, is to analyze the request and the web source code. We can’t find the job Posting in the source code of the web page. But in the request we see a POST request like this

We can see from the picture below

Url:www.lagou.com/jobs/positi…

Request mode: POST

Result: indicates the published recruitment information

TotalCount: indicates the number of recruitment messages

Through practice, it is found that in addition to carrying headers, the pull – up network also has limitations on the IP access frequency. At first, the message ‘access too frequently’ will be displayed, and further access will be blacklisted. However, it will be automatically removed from the blacklist after a period of time.

For this strategy, we can limit the request frequency, which affects the crawler efficiency.

Secondly, we can also crawler through proxy IP. Free proxy IP addresses can be found online, but most are unreliable. The price is not very good.

How do you choose

Train of thought

By analyzing the request we found that 15 pieces of data were returned per page, and the totalCount tells us the total number of pieces of information for that job.

Round up to get the total number of pages. The resulting data is then saved to a CSV file. This gives us a source of data for analysis!

The Form Data in the POST request takes three parameters

First: Whether the home page (doesn’t work)

Pn: the page number

Kd: indicates the search keyword

no bb, show code

Get the request result
# kind search keyword
The default page number is 1
def get_json(kind, page=1.):
    # post request parameters
    param = {
        'first': 'true'.'pn': page,
        'kd': kind
    }
    header = {
        'Host': 'www.lagou.com'.'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='.'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    }
    # set proxy
    proxies = [
        {'http': '140.143.96.216:80'.'https': '140.143.96.216:80'},
        {'http': '119.27.177.169:80'.'https': '119.27.177.169:80'},
        {'http': '221.7.255.168:8080'.'https': '221.7.255.168:8080'}]The requested URL
    url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false'
    Use proxy access
    # response = requests.post(url, headers=header, data=param, proxies=random.choices(proxies))
    response = requests.post(url, headers=header, data=param)
    response.encoding = 'utf-8'
    if response.status_code == 200:
        response = response.json()
        The positionResult in the request response includes the total number of queries and the job information on the page (company name, address, salary, benefits, etc.).
        return response['content'] ['positionResult']
    return None
Copy the code

Then we just need to call get_JSON each time we turn the page to get the result of the request and iterate over the desired recruitment information

if __name__ == '__main__':
    Mysql > select * from first page
    kind = 'python'
    # request to obtain the total number of entries at one time
    position_result = get_json(kind=kind)
    The total number of article
    total = position_result['totalCount']
    print('{} development position, job information total {} section..... '.format(kind, total))
    Round up 15 pages to get the total number of pages
    page_total = math.ceil(total/15)

    # all query results
    search_job_result = []
    #for i in range(1, total + 1)
    # Crawl only the first 100 pages of data for efficiency
    for i in range(1.100):
        position_result = get_json(kind=kind, page= i)
        # After each capture, pause for a while to prevent being blocked by the server
        time.sleep(15)
        # Job information on the current page
        page_python_job = []
        for j in position_result['result']:
            python_job = []
            # Company name
            python_job.append(j['companyFullName'])
            # Company Abbreviation
            python_job.append(j['companyShortName'])
            # Company size
            python_job.append(j['companySize'])
            # financing
            python_job.append(j['financeStage'])
            # Region
            python_job.append(j['district'])
            # titles
            python_job.append(j['positionName'])
            # Ask for years of service
            python_job.append(j['workYear'])
            # Degree wanted
            python_job.append(j['education'])
            # Salary range
            python_job.append(j['salary'])
            # Benefits
            python_job.append(j['positionAdvantage'])

            page_python_job.append(python_job)

        # Add to all lists
        search_job_result += page_python_job
        print('Page {} completed, total number of current positions :{}'.format(i, len(search_job_result)))
        # After each capture, pause for a while to prevent being blocked by the server
        time.sleep(15)
Copy the code

Ok! We’ve got the data, and the last step is to save the data

  Convert the total data to a data frame and output it
    df = pd.DataFrame(data=search_job_result,
                      columns=['Full name'.'Company abbreviation'.'Size of company'.'Funding phase'.'regional'.'Job Title'.'Work Experience'.'Educational Requirements'.'wages'.'Position Benefits'])
    df.to_csv('lagou.csv', index=False, encoding='utf-8_sig')
Copy the code

Run the main method directly on the result:

The data analysis

After analyzing CVS files, we need to clean the data for the convenience of statistics

For example, eliminate internship recruitment, working years without requirements or fresh graduates as 0 years of processing, salary range needs to calculate a approximate value, education as a college without requirements

# fetch data
df = pd.read_csv('lagou.csv', encoding='utf-8')
# Data cleaning, eliminate internships
df.drop(df[df['Job Title'].str.contains('internship')].index, inplace=True)  
# print(df.describe())
# Since the data in the CSV file is in the form of a string, the string is converted into a list using regular expression first, and then the mean value of the interval is taken
pattern = '\d+'  
df['work_year'] = df['Work Experience'].str.findall(pattern)
# Working years after data processing
avg_work_year = []
# Years of work
for i in df['work_year'] :# If work experience is' unlimited 'or' recent graduate ', then the matching value is null and years of work is 0
   if len(i) == 0:  
       avg_work_year.append(0)  
   # If the matching value is a value, return that value
   elif len(i) == 1:  
       avg_work_year.append(int(' '.join(i)))  
   # If the match value is an interval, then take the average value
   else:  
       num_list = [int(j) for j in i]  
       avg_year = sum(num_list)/2  
       avg_work_year.append(avg_year)
df['Work Experience'] = avg_work_year

# Turn the string into a list and take the first 25% of the range
df['salary'] = df['wages'].str.findall(pattern)
# monthly salary
avg_salary = []  
for k in df['salary']:  
   int_list = [int(n) for n in k]  
   avg_wage = int_list[0]+(int_list[1]-int_list[0) /4  
   avg_salary.append(avg_wage)
df['Monthly salary'] = avg_salary

Consider the minimum education for any position with no limitation: junior college
df['Educational Requirements'] = df['Educational Requirements'].replace('不限'.'college')
Copy the code

After a simple cleaning of the data, let’s start our statistics

Draw a salary histogram

Draw the frequency histogram and save it
plt.hist(df['Monthly salary'])
plt.xlabel(Salary (thousand yuan))   
plt.ylabel('frequency')
plt.title("Wage histogram")   
plt.savefig('wages. JPG')  
plt.show()  
Copy the code

Conclusion: The salary of Python developers in Beijing is mostly between 15 and 25K

Pie chart of company distribution

# Draw the pie chart and save it
count = df['regional'].value_counts()
plt.pie(count, labels = count.keys(),labeldistance=1.4,autopct='% 2.1 f % %')  
plt.axis('equal')  # Make the pie chart perfectly round
plt.legend(loc='upper left', bbox_to_anchor=(0.1.1))  
plt.savefig('pie_chart.jpg')  
plt.show()  
Copy the code

Conclusion: Haidian district has the largest number of Python companies, followed by Chaoyang District. If you are going to work in Beijing, you probably know where to rent a house

Histogram of academic requirements

# {' bachelor ': 1304, 'Associate ': 94,' Master ': 57, 'doctor ': 1}
dict = {}
for i in df['Educational Requirements'] :if i not in dict.keys():
        dict[i] = 0
    else:
        dict[i] += 1
index = list(dict.keys())
print(index)
num = []
for i in  index:
    num.append(dict[i])
print(num)
plt.bar(left=index, height=num, width=0.5)
plt.show()
Copy the code

Conclusion: Most companies require a bachelor’s degree in Python recruitment. But the degree is just a stepping stone, if you work hard to improve your skills, these are not important

Welfare benefits word cloud picture

# Draw word cloud to summarize strings in job benefits
text = ' '  
for line in df['Position Benefits']:  
   text += line  
# Use jieba module to split the string into a list of words
cut_text = ' '.join(jieba.cut(text))
#color_mask = imread('cloud.jpg') # set background image
cloud = WordCloud(
    background_color = 'white'.Font must be specified for Chinese operations
    font_path='yahei.ttf'.#mask = color_mask,
    max_words = 1000,
    max_font_size = 100
    ).generate(cut_text)

# Save the word cloud image
cloud.to_file('word_cloud.jpg')
plt.imshow(cloud)
plt.axis('off')
plt.show()
Copy the code

Conclusion: Flexible working is a benefit of most companies, and a few companies will also provide six insurances and one housing fund. Team atmosphere and flat management are also important aspects.

At this point, this analysis is over. If you need it, you can also check out the job listings for other positions or regions. I hope this will help you to determine your own development and career plans.

Pay attention to the public account “Programmers grow together” (ID: FinishBug). We have prepared rich gifts for new partners, including but not limited to: Python, Java, Linux, big data, artificial intelligence, front-end and other 21 technical directions. You can get the “gift package” by responding to the background.