This is the 14th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge

Related crawler information involved in this paper is only used for display, and sensitive data has been specially processed. Before the release of this article, the relevant website data format, request mode has been updated, will not affect the website.

By the time you read this, it means that you have gone from “knowing nothing” to “new crawler”! In previous articles, we started with the basics and went step by step through how to use Python to crawl the information we wanted, interlaced with HTML, Linux, Ajax, and more. Starting with this chapter, we will develop our final practice, the grade SMS notification service. Attention! Because the academic administration system of each institution is not uniform, this course can only choose the most widely used academic administration system, so it may not be suitable for your school. But the technology is the same. Trust me, by the end of these two chapters, you’ll be glad your school’s educational administration system isn’t a tutorial system — in fact, someone’s anti-crawler technology and code format made me so miserable during development that I thought about giving up. Fortunately, through unremitting efforts, these two articles are finally presented to you as planned. Without further ado, let’s get right to it!

Login system

First of all, we need to analyze the parameters to be passed by the login system. As shown in the figure below, we need to input the following parameters when we log in: user name, password, verification code and selectionstudents.But is it really the case? Obviously, it is wrong! As we mentioned in the previous chapter, in addition to the explicit information we need to fill in above, we also need to pass on a proof of identity_VIEWSTATE. So, what is_VIEWSTATE?

View State is a control in ASP.NET. If the client and server segment to maintain a round-trip state, in.net by adding a hidden control _ViewState to achieve, these states do not need to be controlled by the programmer, saving programmer effort.

So where exactly is the _VIEWSTATE parameter to be found? Very simple, in the source code of the page!

Let’s break it down:

import requests
from bs4 import BeautifulSoup
url = "target_url"
rensponse = requests.get(url)
soup = BeautifulSoup(rensponse.text,'lmxl')
_VIEWSTATE = soup.find('input', attrs={'name': '__VIEWSTATE'}) ['value']
Copy the code

Using fiddler4 from the previous section, we can determine that the following parameters need to be passed to log in to the system:

  • _VIEWSTATE: A control used by the.NET framework to mark a user’s identity
  • TextBox1: Parameter of student id
  • TextBox2: Password parameter
  • TextBox3: verification code
  • RadioButtonList1: Student mark
  • Button1: a null value
  • lbLanguage: a null value

In fact, a certain educational administration system has a login interface without verification code, I have been using this page in the debugging process. However, in order to make development universal, this gate uses a login page with a verification code by default.

As I mentioned above, because the verification code needs to be manually input, so in the development process, we first download the verification code image in the web page to the local machine, and then use manual input to ensure accuracy. Note: the verification code recognition model based on machine learning will be provided in the future to reduce the manual input step SO, the parameters required for login are clear, let’s directly develop!

import requests
from bs4 import BeautifulSoup
username='Your username'
password='Your password'
url = 'targeturl'
HEA = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.3; WOW64) AppleWebKit/'
                     '537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36,
       }

response = requests.get(url,HEA)
if response.status_code=='OK':
    soup = BeautifulSoup(response.text,'lxml')
else:
    hash_url = response.headers['Content-Type']
    url = 'targeturl'
    response = requests.get(url,HEA)
    soup = BeautifulSoup(response.text, 'lxml')
viewState = soup.find('input', attrs={'name': '__VIEWSTATE'}) ['value']
RadioButtonList1 = U "students".encode('gb2312'.'replace')
imgUrl = "targeturl"
imgresponse = s.get(imgUrl, stream=True)
image = imgresponse.content
DstDir = os.getcwd()+"\ \"
print("Save the verification code to:"+DstDir+"code.jpg"+"\n")
try:
    with open(DstDir+"code.jpg" ,"wb") as jpg:
        jpg.write(image)
except IOError:
    print("IO Error\n")
finally:
    jpg.close
Enter the verification code manually
code =input("The verification code is:")
data = {
    '__VIEWSTATE':viewState,
    'TextBox1':username,
    'TextBox2':password' 'RadioButtonList1':RadioButtonList1, 'Button1':''
}
web_data = requests.post(url=url,data=data,headers=HEA)
print(web_data.text)
Copy the code

To run the code, first enter the verification code we downloaded to the local, and then press Enter to verify whether the system is logged in:

Bingo! We successfully entered the educational administration system! #### Analysis grade page Let’s analyze the URL of the grade:

In the URL of the score, we need to pay attention to the following parameters:

  • xh: Student user name
  • xm: The URL encoding of the student namexmIs a string obtained by urL-encoding the student’s name. You can be inThis url validates the information before encoding. For example, there is a boy named “Xiao Qiang”, we use the URL encoding tool to encode it, the corresponding encoded string is%e5%b0%8f%e5%bc%ba. After explaining the meaning of the parameters, let’s select the grades of the second semester of 2016-2017, and use Fiddler4 to obtain the data transmission details of the server:

Again, we need to analyze the transmission parameters:

  • _VIEWSTATE: A control used by the.NET framework to mark a user’s identity
  • _VIEWSTATEGENERATOR: Default is ok
  • ddlXN: School year selection
  • ddlXQ: The selected term
  • Button1: Vacant_VIEWSTATEIt’s also available on the web,ddlXNandddlXQIt’s the school year and the term that we choose.Because of somebody’s verification method, we have to simulate the browser header and addRefererParameters,RefererThe parameter is the URL of the score page we are onLet’s start writing code:
chengji_page = 'http://xxxx.xxxx.edu.cn/xscj_gc.aspx?xh=xxxxxxxx&xm=%e5%b0%8f%e5%bc%ba&gnmkdm=N121605'
header = {
    'Referer':' 'http://xxxx.xxxx.edu.cn/xscj_gc.aspx?xh=xxxxxxxx&xm=%e5%b0%8f%e5%bc%ba&gnmkdm=N121605', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.02623.110. Safari/537.36' } response = s.get(chengji_page,headers=header) __VIEWSTATE= get_view(response) data1 = { "__VIEWSTATE":__VIEWSTATE, "ddlXN":"2016-2017", "ddlXQ":"2", "Button1":"" } chengji = s.post(chengji_page,data=data1) print(chengji.text)Copy the code

Let’s run this code:

Everything appears normal! We have obtained the corresponding result! Next, we need to parse the web page, save the grades into the database, and concatenate the previous interface. Although the steps are many, they are all the content we have learned before, so let’s complete the achievement short message service in the next chapter.

conclusion

  1. This chapter completes the results SMS notification serviceThe loginandTo get resultsStep, these two steps are the most important process in our development, so students do not understand or do not understand the place, be sure to express their views in the discussion area!
  2. In fact, a party’s educational administration system code is chaotic, or it is intentional. Therefore, the next page analysis, also hope that you can seriously study;