The official account has not been updated for almost a month due to personal reasons. First of all, I have to thank you for not taking off my account without you. I am very happy. Secondly, I will output high-quality articles for you in the following two months, so that you can learn something and I can also improve myself. All right, without further ado, start the text!

This article is to write how to deal with complex ajax requests, the previous article wrote a simple point of simple Ajax requests, also only 10 lines of code can capture the data down, can be said to be very powerful. Interested can look at how to crawl ajax dynamic website.

The tool needed this time is Charles tool for packet capture. This tool we own baidu download, can not use the words of Baidu, this article is not the tool said more. Using this tool is that it has a powerful search function, I can search for the web request I want with one click.

This time the website is http://drugs.dxy.cn/

The need is to get detailed information about all the drugs, which is easy at first glance, but not until you click on it.

Like the one above, to get the details you need to simulate clicking the triangle button, and you need to send an Ajax request to get the details.

And when you click open, you need to log in to get all the information, so this is an extra step, we need to simulate the successful login before we can conduct aAX request to get information, otherwise it is useless.

1. Simulated landing

Login is conducted on the site at https://auth.dxy.cn/accounts/login?service=http%3A%2F%2Fdrugs.dxy.cn%2Flogin.do%3Fdone%3Dhttp%253A%252F%252 Fdrugs.dxy.cn%252Fdrug%252F54565%252Fdetail.htm

You can see the need for a captcha, but it’s not a big problem and can be fixed. At this time we need to open the developer tool, press F12, and then click on the log to log in, you can see the picture below

Scroll down to FromData to see the following data

After many tests, username and password are login account and password, Validatecode is verification code, NLT is an encryption parameter, loaded from JS, other things are unchanged. Since NLT parameters are loaded by JS, this requires the Charles tool.

This request can be easily found in Charles after a successful login, so let’s take a look at where the NLT parameter comes from.

We can copy the NLT parameter and press CTRL + F in the Charles tool to bring up the page

Check those two boxes and fill in the NLT parameters, click find and you’ll see the following. This is where the NLT parameters are generated. Click in and you’ll see the following.

The NLT argument is provided in the HTML, so there is no need to parse the JS, which is relatively easy. Look at the request url

The request url can be seen as the same as the login url, indicating that the NLT parameter is provided directly, we just need to use the re to extract the NLT parameter, now look at the verification code generated in the request can be.

To see where the captcha comes from, just use the GET request for that url.

Analysis done, the next part of the code.

2. Use Python to simulate login

def __get_nlt(self):

       """ Give ult parameters """

       url = "https://auth.dxy.cn/accounts/login?service=http%3A%2F%2Fdrugs.dxy.cn%2Flogin.do%3Fdone%3Dhttp%253A%252F%252Fdrugs.dxy.c n%252Fdrug%252F89790%252Fdetail.htm&qr=false&method=1"

       response = self.session.post(url)

       nlt = re.findall('nlt" value="([^"]+?) "', response.text)  # Match NLT parameters

       return nlt



   def __login(self):

       """ "Login website """

       The form that needs to be submitted for login

       print('Make a landing')

       data = {'username':  13060629578.# account

               'password':  'asd12678'.# your password

               'loginType''1'.

               'validateCode':  self.__get_chapter(),  # verification code

               'keepOnlineType''2'.

               'trys':  '0'.

               'nlt':  self.__get_nlt(),  Get NLT parameters

               '_eventId':  'submit'}

       # login url

       url = 'https://auth.dxy.cn/accounts/login?service=http%3A%2F%2Fdrugs.dxy.cn%2Flogin.do%3Fdone%3Dhttp%253A%252F%252Fdrugs.dxy.c n%252Fdrug%252F89790%252Fdetail.htm&qr=false&method=1'

       response = self.session.post(url, headers=self.headers, data=data)  # request login

       if 'dxy_zbadwxq6' in response.text:  Fill in your user name here to verify that you have logged in successfully

           print('Successful landing')

       else:

           print('Failed to log in, trying to re-log in')

           self.__login()



   def __get_chapter(self):

       """ Obtain the verification code """

       try:

           url = 'https://auth.dxy.cn/jcaptcha'

           response = self.session.get(url, headers=self.headers)

           # Save the verification code

           with open('code.jpg'.'wb')as f:

               f.write(response.content)

           im = Image.open('code.jpg')

           im.show()

           valide_code = input('Enter a verification code')

           return valide_code

       except Exception as e:

           """ Verification code failed, get again """

           print(e)

           self.__get_chapter()Copy the code

This is part of the important code, there are comments, so I don’t have to go into the important part.

3. Analyze ajax requests

After a successful landing. Click on any page and click on the triangle button to see the details

We continued to use Charles tool to capture packages. First, we cleaned the packages Charles just captured, and then click the triangle button on the page to enter the school to obtain information

As you can see from the above request, the data is all Unicode encoded, we need to transfer, in fact, we can just copy it and print it in the command line window, and we can see that this is the detailed data we want

Let’s look at the request and what else is required

As you can see, is a post request, successful status code is 200, there are a lot of parameters, after many tests, found the following five parameters of arrow is will change, the first id of the pharmaceuticals, the second can be found through caught (NLT) arguments are the same as above) is loaded by js, pay attention to, If you want to load js in Charles, you need to clear the browser’s cache first, otherwise it won’t load, and you won’t be able to capture the package. The third change is also the id of the drug, the fourth is loaded through the drug page, and the final batchId starts with the ID of 2, which is incremented by 1 each time a detailed content is obtained.

All right, now that we’ve analyzed everything we need, we’re left with the implementation.

4. Make ajax requests with code

This is the page to get the content of the drug

def __get_content(self, item, href):

       """ Get the information that needs to be extracted. ""

       param0 = int(re.findall('\d+', item)[0])

       href_id = re.findall('\d+', href)[0]

       html = self.__get_html(item)  Get the HTML information for the drug

       name = re.findall('fl commend">(.*?) < ', html, re.S)[0].replace('\n'.' ').replace('\t'.' ')

       batchId = 2  # initializes the id required to submit the form, starting with 2

       id = "20A548B2C7B5F05093DFD2C71F112EEE"  # scriptSessionId Specifies the data needed for encryption

       scriptSessionId = id + str(int(random() * 1000))  Get the form data required by the detail page

       soup = BeautifulSoup(html, 'lxml')  # Use BS4 parsing

       content_dd = soup.find_all('div', id='container') [0].find_all('dl') [0].find_all('dd', style=False)  Get all data for the entire page

       content_dt = soup.find_all('div', id='container') [0].find_all('dl') [0].find_all('dt')  The type of data to be obtained is, for example, indication

       keys = re.findall(' (. *?) : < ' *?>, str(content_dt), re.S)  # data cleaning

       values = []  Store all cleaned data

       for i in content_dd:  Get all the data and clean it

           if '... ' in i.get_text():  # This proof data is incomplete, need to click

               param = re.search('id="([0-9]+?) _([\d]+?) "', str(i))

               Get the relevant form data

               param1 = param.group(2)

               # Get details

               data_content = detail = self.__get_detail(scriptSessionId, param0, param1, batchId)

               if 'img' in detail:  # Determine if there is an image link

                   data_content = ' '

                   for x in re.split(' ' *?>, detail, re.S):

                       data_content += x

                   # Find image links

                   src = re.findall(", detail, re.S)

                   for s in src:

                       data_content += s+' '

               data_content = self.dr.sub(' ', data_content).replace('\n'.' ').replace('\t'.' ')

               values.append(data_content)

               batchId += 1

           else:

               if 'Trade Name' in i.get_text():

                   con = str(i)

               else:

                   con = self.dr.sub(' ', i.get_text().strip().replace('\n'.' ').replace('\t'.' '))

               values.append(con)Copy the code

This is the way to get the request for Ajax on the page


def __get_detail(self, scriptSessionId, param0, param1, batch_id):

       """ To get all the data that you need to click to see is to simulate clicking. """

       data = {'callCount'1.

               'page''/drug/%s/detail.htm' % param0,  # This parameter is the drug ID

               'httpSessionId'' '.

               'scriptSessionId': scriptSessionId,  Get the corresponding parameters

               'c0-scriptName''DrugUtils'.

               'c0-methodName''showDetail'.

               'c0-id''0'.

               'c0-param0''number:%s' % param0,  # This parameter is the drug ID

               'c0-param1''number:%s' % param1,  # This parameter is needed to get the id of the page

               'batchId': batch_id

               }

       # request header

       headers = {

                   'User-Agent''the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'.

                   'content-type''text/plain'}

       # send request

       r = self.session.post('http://drugs.dxy.cn/dwr/call/plaincall/DrugUtils.showDetail.dwr', headers=headers, data=data)

       # Encode the returned HTML and find the corresponding data

       detail = re.findall('" (. *?) "', r.text.encode('utf-8').decode("unicode-escape"), re.S)[0].strip()

       return detailCopy the code

Well, that’s it! Important is not the code above, but the mind, as long as your ideas to keep up with, what other ajax requests are like this, so nothing difficult crawler, analyzing the ajax request is main or meeting the encryption parameters, need to parse the confusion of js, the natural is the main part of the crawler, do the crawler is main or to find a way to avoid these things.

The last

See here are generally true love powder, first of all, I have to thank you to support ha!! If you find an article useful to you, you can like it, leave a comment, or retweet it. This is the biggest encouragement for me.

Recommend the article

Use Python to climb netease cloud music and store the data in mysql

How to crawl ajax dynamic web sites


Daily learning python

Code is not just buggy, it’s beautiful and fun