Jingdong commodity crawler actual combat notes

Disclaimer: The technical means and implementation process recorded in this paper are only used as the learning and use of crawler technology, and we do not assume any responsibility for any thing or consequences caused by any act or omission of any person based on all or part of the contents of this paper.

Crawler demand: crawler jingdong Mall to search for goods according to keywords, product name, price, cumulative evaluation information;

Crawl tools: Chrome, PyCharm

Python library: selenium

01 Website structure analysis

Open the home page of JINGdong and enter “mobile phone” to search:

Click on the product details page, all information can be found in the details page.

02 Create Selenium crawler

Open Pycharm, create selenium_Jd.py, and write the following code:

import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.chrome.options import Options from pyquery import PyQuery options = Options(); Add_experimental_option ("excludeSwitches", ['enable-automation']); User_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/91.0.4472.124 Safari/537.36' options.add_argument('--user-agent=%s' % user_agent  = webdriver.Chrome('C:\chromedriver.exe', options=options) browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ }) jd_url = 'https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&wq=%E6%89%8B%E6%9C%BA&pvid=eb248c3d491144a99d93c073ef Def start_page(page): Current_page = 0 while current_page < page: if (current_page == 0): browser.get(jd_url) else: Wait = WebDriverWait(browser, 30) browser.maximize_window() current_page = current_page + 1 start_page(1)Copy the code

Run the code, can successfully open the JINGdong commodity search page.

03 Climb to the product details page

By analyzing the page, find the URL information of the product details page:

Write code to crawl the product details page URL:

Def get_detail_url(html_source): dom = PyQuery(html_source) items = dom('.m-list .ml-wrap .gl-item').items() detail_urls = [] for item in items: detail_urls.append(item.find('.p-name.p-name-type-2 a').attr('href')) return detail_urlsCopy the code

Call item detail parsing method, run successfully!

04 Page turning

Only when the page is pulled down slowly through the drop-down box will the following information be delayed loading, so it is necessary to control the scroll bar to pull down slowly, and then climb the product details page, and then carry out paging processing:

# slowly pull the scroll bar to pull to the bottom window_height = the execute_script (' return document. Body. ScrollHeight) curent_height = 0 for I in range(curent_height, window_height, 100): browser.execute_script('window.scrollTo(0, {})'.format(I)) time.sleep(0.5)# store the current page details URL detail_url_list.extend(get_detail_URL (browser.page_source))Copy the code

To continue parsing the paging information, skip to the next page by clicking:

In order to facilitate processing, directly use the mouse click to achieve the jump:

element = browser.find_element_by_class_name('pn-next') browser.execute_script("arguments[0].click();" , element)Copy the code

Run the code, successfully turn the page!

05 Processing of commodity details page

After crawling all the details page urls, you can open the details page URL to crawl details:

Def get_detail_info(html_source): dom = PyQuery(html_source) detail_dom = dom('.w .itemInfo-wrap') detail_info = { 'name': detail_dom.find('.sku-name').text(), 'price': detail_dom.find('.summary.summary-first').find('.summary-price-wrap').find('.dd').find('.p-price').text(), 'comment_count': detail_dom.find('.summary.summary-first').find('.summary-price-wrap').find('.comment-count').text(), } print(detail_info) return detail_infoCopy the code

Iterate through the list of previously crawlable item details page urls and crawl item details one by one:

Def start_detail_page(detail_url_list): detail_info_list = [] for detail_url in detail_url_list: browser.get(detail_url) time.sleep(1) detail_info_list.append(get_detail_info(browser.page_source)) print(detail_info_list)Copy the code

Run the code and the result is as follows:

Climb JINGdong commodity information success!

Jingdong commodity crawler actual combat notes

01 Website structure analysis

02 Create Selenium crawler

03 Climb to the product details page

04 Page turning

05 Processing of commodity details page

Related Posts

Bei Chat system architecture servitization road

From Docker to IStio ii – Deploy the application using Compose

The K8S Storageclass automatically creates a PV