The full text summary

In this paper, the data on the pull check net is collected first, and the data of Python positions is collected, and then Python is used for visualization. It mainly involves the knowledge of crawler & data visualization.

The crawler parts

Python is used to grab the data on the pull box, using the simple and easy-to-use Requests module. The main point to note is that the dragnet is a dynamic web page, so you will use the browser’s F12 developer tools to capture packages. After capturing the packet, you will find that the webpage is actually in the form of a POST, so you need to submit data, the submitted data is as follows:


Real address is: www.lagou.com/jobs/positi…

It can also be easily found in the figure above: KD is the query keyword, pn is the number of pages, which can be turned.

Code implementation

Import requests # import re import time import random # post url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0' # anti climb measures the header = {' Host ': 'www.lagou.com', 'user-agent ':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36', 'Accept': 'application/json, text/javascript, */*; Q = 0.01 ', 'the Accept - Language' : 'useful - CN, en - US; Q = 0.7, en. Q =0.3', 'accept-encoding ': 'gzip, deflate, BR ', 'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'X-Requested-With': 'XMLHttpRequest', 'X-Anit-Forge-Token': 'None', 'X-Anit-Forge-Code': '0', 'Content-Length': '26', 'Cookie': 'user_trace_token=20171103191801-9206e24f-9ca2-40ab-95a3-23947c0b972a; _ga = GA1.2.545192972.1509707889; LGUID=20171103191805-a9838dac-c088-11e7-9704-5254005c3644; JSESSIONID=ABAAABAACDBABJB2EE720304E451B2CEFA1723CE83F19CC; _gat=1; LGSID=20171228225143-9edb51dd-ebde-11e7-b670-525400f775ce; PRE_UTM=; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com % 2 flink % 3 furl % 3 dkkjpgbhanny1nukalpx2odfuxv9itif3kbawm2 - fDNu % 26 ck % 3 d3065. 1.126.376.14 0.374.139.129%26shh%3Dwww.baidu.com % 26 SHT % 3 dmonline_3_dg % 26 wd % 3 d % 26 eqid % 3 db0ec59d100013c7f000000055a4504f6; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; LGRID=20171228225224-b6cc7abd-ebde-11e7-9f67-5254005c3644; index_location_city=%E5%85%A8%E5%9B%BD; TG-TRACK-CODE=index_search; SEARCH_ID=3ec21cea985a4a5fa2ab279d868560c8', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache'} for n in range(30): Form = {'first':'false', 'kd':'Python', 'pn': STR (n)} time.sleep(random.randint(2,5)) # request. Post (url,data=form,headers = header) # request re.findall('{"companyId":.*? ,"positionName":"(.*?) ","workYear":"(.*?) ","education":"(.*?) ","jobNature":"(.*?) ","financeStage":"(.*?) ","companyLogo":".*?" ,"industryField":".*?" ,"city":"(.*?) ","salary":"(.*?) ","positionId":.*? ,"positionAdvantage":"(.*?) ","companyShortName":"(.*?) ","district"',html.text) # Convert to data box data = pd.dataframe (data) # Save to local data.to_csv(r'd :\Windows 7) Documents\Desktop\My\LaGouDataMatlab.csv',header = False, index = False, mode = 'a+')Copy the code

Note: don’t crawl too fast when grabbing data, unless you have other anti-crawl measures, such as changing IP address, etc., in addition, no login is required. I added time module in the code to limit the crawl speed.

Data visualization

The downloaded data looks like this:


Note that I added the title myself.

Import the module and configure the drawing style

Import pandas as pd # data box operation import numpy as NP import matplotlib.pyplot as PLT # plot import jieba # from wordcloud Import matplotlib ["font. Sans-serif "] import matplotlib from Pyecharts = ["axes. Labelsize "] = 7. plt.rcParams["ytick.labelsize"] = 14. plt.rcParams["legend.fontsize"] = 12. plt.rcParams["figure.figsize"] = [15., 15.]Copy the code

Note: everything else is easy to fix when importing modules, except for wordcloud, which I recommend you install manually. If PIP is installed, it will tell you that you are missing C++14.0 or something. Manually download the WHL file and the installation will go smoothly.

Data preview

Data = pd.read_csv('D:\\Windows 7 Documents\\Desktop\\My\\ lagoudatapython. CSV ',encoding=' GBK ')Copy the code

The read_CSV path should not contain Chinese characters

data.tail()
Copy the code

Degree required

Data [' degree required] value_counts (). The plot (kind = 'barh', rots = 0) PLT. The show ()Copy the code

Work experience

Data [' experience '] value_counts (). The plot (kind = 'bar', rots = 0, color = 'b') PLT. The show ()Copy the code

Python Hot jobs

Final = 'stopwords = [' PYTHON', 'PYTHON', 'PYTHON', 'engineer', '(',') ', '/'] # stop words for n in the range (data) shape [0]) : Seg_list = list(jieba.cut(data[' job title '][n])) for seg in: If seg not in stopwords: Final = final + seg + ' '# finalCopy the code

Working place

Data [' work '] value_counts (). The plot (kind = 'pie', autopct = '% % % 1.2 f, explodes. = np linspace (0,1.5, 25)) PLT. The show ()Copy the code

Working map

# extract data frame data2 = list (map (lambda x: (data [' work '] [x], eval (re) split (' k | k, data [' wages'] [x]) [0]) * 1000), the range (len (data)))) # Data3 = pd.DataFrame(data2) # convert to the format required by Geo data4 = list(map(lambda) X: (data3. Groupby (0). The mean () [1]. The index [x], data3. Groupby (0). The mean () [1]. The values [x]), the range (len (data3. Groupby (0))))) # show geo location Title_pos ="# FFF ", title_pos="left", width=1200, height=600, background_color='#404a59') attr, value = geo.cast(data4) geo.add("", attr, value, type="heatmap", is_visualmap=True, Visual_range =[0, 300], visual_text_color='# FFF 'Copy the code

About the author:

  • Name: Mai Yantao (Former Family name Bai)
  • Net name: Excavator Little Prince
  • Personal Website:Excavator Little Prince
  • Communication group: 581465069
  • QQ email: [email protected]
  • Crawler, data analysis, machine learning enthusiast

Please indicate the source of reprint:zhuanlan.zhihu.com/p/34200452