1. Target URL and page analysis

If search protects skin suit in vipate official website, the page that returns is as follows

! [](https://p26-tt.byteimg.com/large/pgc-image/e036fac7f34b4e2d8e1e29ee4137ab47)

When you pull down the scroll bar on the right, the page will automatically refresh the data of the product when you slide to the bottom, which reflects ajax interaction, indicating that the information of the product is stored in the JSON interface. Then you can find the page turning button as follows

! [](https://p26-tt.byteimg.com/large/pgc-image/290dcd7aaf064256b41b9029fd2f7528)

2. Preliminary study of reptiles

Trying to get caught, real goods data in the website page, first check the right mouse button to enter interface, click the refresh the page after the Network, then will return to the request of information, to find screening, find specific link files contain commodity information, through related to examine what is found mostly in the callback file, as follows

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/e61c52d299a64700a373fc7765f8623a)

Analysis of the seven files revealed that only four were useful, and the second rank file contained the numbers of all the items on the current page

! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/1cd4e08305ba4d439f55fee24f67c90f)

Then the remaining 3 V2 files are to split the 120 products as follows (the serial numbers of the products all start from 0).

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/8b01fcf1a58a44499f4365e32b62d5ad)
! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/353383d2747e40ffb03e68fc73cc817a)
! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/90faee0be45a482a9643931f90cd40fc)

Therefore, the real data interface of the information of 120 commodities in the search page is searched. Then, try to obtain crawler data with a link file to see how the results are obtained, and then summarize the rules to see whether all the data in the page can be crawled at the same time

After adding user-agent, cookie and refer related information, set the Headers, copy and paste the URL of the page interface data, assign values, and make data requests. The code is as follows, for example, request the data of 20 commodities first

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/6c18376e0a444b2782dca00088b3411f)

Get the cookie, you can unfilter callback, and then select the first suggest file returned by default, as follows

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/e7f07de5f7cc4faf8ac6d84af3cda07a)

Note: Set the request headers based on what your browser returns

import requests headers = { 'Cookie': 'vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7 Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375', 'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1', 'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/75.0.3770.100 Safari/537.36'} URL = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop _pc & app_version = 4.0 & warehouse = VIP_HZ & fdc_area_id = 104101115 & client = pc&mobile _platform = 1 & province_id = 104101 & 70 f712 api_key = 80d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=69 18324165453150280%2C6918256118899745105%2C6918357885382468749%2C6918449056102396358%2C6918702822359352066%2C691847937403 6836673%2C6918814278458725896%2C6918585149106754305%2C6918783763771922139%2C6917924417817122013%2C6918747787667990790%2C 6918945825686792797%2C6918676686121468885%2C6918690813799719966%2C6917924776628925583%2C6918808484587649747%2C6918524324 182323338%2C6917924083191145365%2C6917924119199990923%2C6917924081998898069%2C&scene=search&standby_id=nature&extParams= %7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%22 1%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865440' html = requests.get(url,headers=headers) print(html.text) 1234567891011Copy the code

The output result is :(the final output result is consistent with that returned on the interface)

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/0b56c14ad85748768317da5c91912d9b)

So you can explore the differences between the actual request urls in the three V2 files to see the patterns

'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop _pc & app_version = 4.0 & warehouse = VIP_HZ & fdc_area_id = 104101115 & client = pc&mobile _platform = 1 & province_id = 104101 & 70 f712 api_key = 80d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=69 18324165453150280%2C6918256118899745105%2C6918357885382468749%2C6918449056102396358%2C6918702822359352066%2C691847937403 6836673%2C6918814278458725896%2C6918585149106754305%2C6918783763771922139%2C6917924417817122013%2C6918747787667990790%2C 6918945825686792797%2C6918676686121468885%2C6918690813799719966%2C6917924776628925583%2C6918808484587649747%2C6918524324 182323338%2C6917924083191145365%2C6917924119199990923%2C6917924081998898069%2C&scene=search&standby_id=nature&extParams= %7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%22 1%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865440' 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets1&app_name=shop _pc & app_version = 4.0 & warehouse = VIP_HZ & fdc_area_id = 104101115 & client = pc&mobile _platform = 1 & province_id = 104101 & 70 f712 api_key = 80d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=69 18241720044454476%2C6917919624790589569%2C6917935170607219714%2C6918794091804350029%2C6918825617469761228%2C691882168154 1400066%2C6918343188631192386%2C6918909902880919752%2C6918944714357405314%2C6918598446593061836%2C6917992439761061707%2C 6918565057324098974%2C6918647344809112386%2C6918787811445699149%2C6918729979027610590%2C6918770949378056781%2C6918331290 238460382%2C6918782319292540574%2C6918398146810241165%2C6918659293579989333%2C6917923814107067291%2C6918162041180009111% 2C6918398146827042957%2C6917992175963801365%2C6918885216264034310%2C6918787811496047181%2C6918273588862755984%2C69179247 52735125662%2C6918466082515404493%2C6918934739456193886%2C6917924837261255565%2C6918935779609622221%2C691792011749438274 7%2C6917987978233958977%2C6917923641027928222%2C6918229910205674453%2C6917970328155673856%2C6918470882161509397%2C691865 9293832008021%2C6918750646128649741%2C6917923139576259723%2C6918387987850605333%2C6917924445491982494%2C6918790938962557 837%2C6918383695533143067%2C6918872378378761054%2C6918640250037793602%2C6918750646128641549%2C6917937020463562910%2C6917 920520629265102%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223% 22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865436 ' 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets2&app_name=shop _pc & app_version = 4.0 & warehouse = VIP_HZ & fdc_area_id = 104101115 & client = pc&mobile _platform = 1 & province_id = 104101 & 70 f712 api_key = 80d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=69 18690813782926366%2C6918447252612175371%2C6918159188446941835%2C6918205147496443989%2C6918006775182997019%2C691871013050 1497419%2C6917951703208964235%2C6918936224464094528%2C6918394023211385035%2C6918872268898919262%2C6918397905200202715%2C 6918798460682221086%2C6918800888595138517%2C6917919413703328321%2C1369067222846365%2C6917924520139822219%2C6918904223283 803413%2C6918507022166130843%2C6918479374087209281%2C6917924176900793243%2C6918750646145443341%2C6918449056102412742%2C6 918901362318117467%2C6918570897095177292%2C6917924520223884427%2C6918757924517328902%2C6918398146827051149%2C69187896867 47831253%2C6918476662192264973%2C6917919300445017109%2C6917919922739126933%2C6917920155539928286%2C6918662208810186512%2 C6917923139508970635%2C6918859281628675166%2C6918750645658871309%2C6918820034693202694%2C6918689681141637573%2C691791991 6536480340%2C6918719763326603415%2C6918659293579997525%2C6917920335390225555%2C6918589584225669211%2C6918386595131470421 %2C6918640034622429077%2C6917923665227256725%2C6918331290238476766%2C6917924054840074398%2C6917924438479938177%2C6917920 679932125915%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22% 2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865437' 123Copy the code

By comparing the urls of three commodity information, it is found that the fundamental difference lies in the productIds parameter in the middle. Therefore, as long as the ID of all commodities is obtained, the information of all commodities can be obtained, which is also the law of URL discovery

! [](https://p26-tt.byteimg.com/large/pgc-image/c7b2c73029c24d9b92ec3ed69cc0245b)

All the product ids are stored in the second rank file, so you need to first request the link file to get the product ID information, and then recombine the URL to get the product details

3. Crawler practice

3.1 Crawl the commodity ID information

In order to realize the page turning requirement, you can find the parameters that control the number of pages, such as the first page of 120 data, where the parameter pageOffset is 0

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/562e8ba9ffac4f8a81a9e00ef3c11c72)

The parameter of pageOffset in the second page is 120, and so on. The parameter of the third page is 240. The number of pages will increase by 120 after each page is turned, and the parameters of the rest parts are almost unchanged

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/acc7b8595b8c444c8a042d1fcfcd0538)

3.2 Url construction of commodity ID data

So the code for the request is as follows

import requests import json headers = { 'Cookie': 'vip_province_name=%E6%B2%B3%E5%8D%97%E7%9C%81; vip_city_name=%E4%BF%A1%E9%98%B3%E5%B8%82; vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7 Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375', 'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1', 'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'} n = 1 #n We can replace for num in range(120,(n+1)*120,120) with input statement: Select * from page 2; The first argument can be set to 0 url = f'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_ Version = 4.0 & warehouse = VIP_HZ & fdc_area_id = 104101115 & client = pc&mobile _platform = 1 & province_id = 104101 & 70 f71280d5d547 api_key = b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&standby_id=nature&key word=%E6%8A%A4%E8%82%A4%E5%A5%97%E8%A3%85&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vip Service=&sort=0&pageOffset={num}&channelId=1&gPlatform=PC&batchSize=120&_=1600158865435' html = requests.get(url,headers=headers) print(html.text)Copy the code

The output result is :(the information about the product id can be obtained successfully)

! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/ee5e65ec4e0241d7b8a49b44ba4e50b9)

3.3 Commodity ID data format conversion and quantity verification

To parse JSON data, that is, to convert the output data without fixed format into a Python format, the code is as follows

Start = html.text.index('{') end = html.text.index('})')+1 json_data = json.loads(html.text[start:end]) print(json_data)Copy the code

The output is as follows :(contains the id information of the desired item data)

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/62daacfbe5724692881c15af7fc0675a)

Verify whether the total quantity of commodity data, that is, the quantity of commodity ID (in this case, pid field data) is equal to 120, the code is as follows

Print (json_data['data']['products']) print('') print(len(json_data['data']['products']))Copy the code

Print () print () print () print () print () print ()

! [](https://p26-tt.byteimg.com/large/pgc-image/cb5557680a6545fbb26d04f016e6b2da)

3.4 Obtaining detailed product information

So you can iterate through the loop again to get the id of each item. Notice how product_URL is constructed. Delete the middle item ID and replace it with the format method

For product_id in product_ids: Print (' id',product_id['pid']) product_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop _pc & app_version = 4.0 & warehouse = VIP_HZ & fdc_area_id = 104101115 & client = pc&mobile _platform = 1 & province_id = 104101 & 70 f712 api_key = 80d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds={} %2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponV er%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600164018137'.format(produc t_id['pid']) product_html = requests.get(product_url,headers = headers) print(product_html.text)Copy the code

The output results are as follows :(partial output results are intercepted)

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/afdad04bcd43461c82312c2991b125aa)

It can be found that similar to the initial acquisition of commodity ID information, specific information data also needs to be converted into format and then extracted, such as the name, brand and price of the commodity

Product_start = product_html.text.index('{') product_end = product_html.text.index('})')+1 product_json_data = json.loads(product_html.text[product_start:product_end]) product_info_data = product_json_data['data']['products'][0] # print(product_info_data) product_title = product_info_data['title'] Product_brand = product_info_data['brandShowName'] product_price = product_info_data['price']['salePrice'] print( {}'. Format (product_title,product_brand,product_price)Copy the code

The output results are as follows :(relevant information can be obtained normally. Here, the title, brand and selling price of the product are taken as examples, and other more detailed data can be obtained.)

! [](https://p26-tt.byteimg.com/large/pgc-image/fac942c3306543d189ba63c970c1a26a)

The last step is to write the data locally:

with open('vip.txt','a+',encoding = 'utf-8') as f: F. Format (product_title,product_brand,product_price) 12Copy the code

The output is as follows :(data is climbed and saved locally)

! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/71e217a81ec741928337a277fe4039a5)

4. All codes

The whole process can be encapsulated as a function, or the data can be stored locally in the form of CSV or XLSX. The storage of TXT text data is only listed here

import requests import json headers = { 'Cookie': 'vip_province_name=%E6%B2%B3%E5%8D%97%E7%9C%81; vip_city_name=%E4%BF%A1%E9%98%B3%E5%B8%82; vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7 Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375', 'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1', 'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, For num in range(0,n*120,120): url = f'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_ Version = 4.0 & warehouse = VIP_HZ & fdc_area_id = 104101115 & client = pc&mobile _platform = 1 & province_id = 104101 & 70 f71280d5d547 api_key = b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&standby_id=nature&key word=%E6%8A%A4%E8%82%A4%E5%A5%97%E8%A3%85&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vip Service=&sort=0&pageOffset={num}&channelId=1&gPlatform=PC&batchSize=120&_=1600158865435' html = requests.get(url,headers=headers) # print(html.text) start = html.text.index('{') end = html.text.index('})')+1 json_data = json.loads(html.text[start:end]) product_ids = json_data['data']['products'] for product_id in product_ids: Print (' id',product_id['pid']) product_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop _pc & app_version = 4.0 & warehouse = VIP_HZ & fdc_area_id = 104101115 & client = pc&mobile _platform = 1 & province_id = 104101 & 70 f712 api_key = 80d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds={} %2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponV er%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600164018137'.format(produc t_id['pid']) product_html = requests.get(product_url,headers = headers) product_start = product_html.text.index('{') product_end = product_html.text.index('})')+1 product_json_data = json.loads(product_html.text[product_start:product_end]) product_info_data = product_json_data['data']['products'][0] # print(product_info_data) product_title = product_info_data['title'] product_brand = product_info_data['brandShowName'] Print (' salePrice', 'salePrice') print(' salePrice', 'salePrice', 'salePrice') {}'.format(product_title,product_brand,product_price)) with open('vip.txt','a+',encoding = 'utf-8') as f: Format (product_title,product_brand,product_price). Format (product_title,product_brand,product_price)Copy the code

If n=4, run the code again and the output is as follows :(to see the amount of data, open TXT file with sublime and you can find that it is exactly the total number of products in 4 pages, so the whole crawling of vipshop product information ends here)

! [](https://p26-tt.byteimg.com/large/pgc-image/aa53eba0bac840b090a5aafcf34d88d8)

This article reproduced, such as infringement contact xiaobian delete! Copyright belongs to the author!

Original address: blog.csdn.net/lys_828/art…

Source code access point this!