Python crawler | a review site information collection shops

This is the 9th day of my participation in the August Wen Challenge.More challenges in August

1. A brief introduction

Today, let’s use skiing as the key word to demonstrate how to use Python crawler to collect dianping store information.

The page data can be obtained through request.get() in the form of page turning in the search results, and then the relevant analysis of the page data can obtain the shop information we need.

However, in the crawler process, we will find that the information such as the number of store comments, per capita consumption and the address of the store is displayed as □ on the web page, which is similar to &# xF622 in get data. This is actually a font crawling backwards, so let’s break each of them.

The following are the data fields we need to collect:

field	instructions	access	The font
shop_id	Shop’s ID	Direct access to
shop_name	Shop name	Direct access to
shop_star	Shop’s star	Direct access to
shop_address	Store address	Direct access to
shop_review	Shop rating	Font the climb	shopNum
shop_price	Per capita consumption in shops	Font the climb	shopNum
shop_tag_site	Area of shops	Font the climb	tagName
shop_tag_type	Classification of shops	Font the climb	tagName

2. Font crawl processing

Open Dianping and search for skiing. Press F12 on the search results page to enter the developer mode and select the number of reviews. It can be seen that the class is shopNum and the content is □. The font family of the styles on the right is pingfangsc-regular-shopnum. In fact, click on the.css link on the right to find the font file link. Considering that the font file links corresponding to other field information involving font reverse crawling may be different, we collect another method for one-time acquisition (see the next paragraph for details).

2.1. Get the font file link

In the head section of the web page, you can find the text mash-up CSS, and the corresponding CSS address contains all the font file links that will be used later. Using requess.get(), you can return all the font names and the font file download links.

– Define the function method to get web page data get_html()

Get web data
def get_html(url, headers) :   
    try:
        rep = requests.get(url ,headers=headers)
    except Exception as e :
        print(e)
    text = rep.text
    html = re.sub('\s'.' ', text) # Remove non-character data
    
    return html
Copy the code

– Get web page data

import re
import requests
# Cookie, just copy the Cookie from your browser
headers = {
        "user-agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"."Cookie":"Your browser Cookie",}# Search keywords
key = 'skiing'
# base url
url = f'https://www.dianping.com/search/keyword/2/0_{key}'
Get web data
html = get_html(url, headers)
Copy the code

– Gets the font file link

Get the link to the font file in the head text mashup CSS
text_css = re.findall('<! - by mixed CSS - > < linkrel = "stylesheet" type = "text \ / CSS" href = "(. *?) "> ', html)[0]  
# 'http://s3plus.meituan.net/v1/mss_0a06a471f9514fc79c981b5466f56b91/svgtextcss/29de4c2bc5d95d1e147c3c25a5f4aad8.css'
# Combine into CSS links
css_url = 'http:' + text_css
Get the page data of the font file link
font_html = get_html(css_url, headers)
Get a list of font information using regular expressions
font_list = re.findall(r'@font-face{(.*?) } ', font_html)

Get the font used and its link
font_dics = {}
for font in font_list:
    Get the font file name from the regular expression
    font_name = re.findall(r'font-family:"PingFangSC-Regular-(.*?) "', font)[0]
    # regular expression to get font file corresponding link
    font_dics[font_name] = 'http:' + re.findall(r',url\("(.*?) \ "); ', font)[0]
Copy the code

– Download font files to the local PC

# Since we only use shopNum, tagName and address, we only download these three fonts
font_use_list = ['shopNum'.'tagName'.'address']
for key in font_use_list:
    woff = requests.get(font_dics[key], headers=headers).content
    with open(f'{key}.woff'.'wb')as f:
        f.write(woff)
Copy the code

Font file (saved locally, installedFontCreatorCan open the font file, view the font content, replyFontCreatorYou can obtain the installation package download address.)

2.2. Create mappings between three types of fonts and actual characters

Let’s first look at the HTML content of the evaluation number in the requested webpage data as follows:

<b>
    <svgmtsi class="shopNum">&#xf8a1;</svgmtsi>
    <svgmtsi class="shopNum">&#xee4c;</svgmtsi>
    <svgmtsi class="shopNum">&#xe103;</svgmtsi>
    <svgmtsi class="shopNum">&#xe62a;</svgmtsi>
</b>The evaluation ofCopy the code

The corresponding web page shows 4576 reviews, and we know that the corresponding relationships are 4=&# xf8A1, 5=&# xEE4c, 7=, 6=&# xE62a.

We use FontCreator to open the shopNum font file as follows:

By comparison, it can be found that 4 corresponds to uniF8A1 in shopNum, and 5 corresponds to uniEE4C… And so on. Therefore, finding the rule, we know that the corresponding data information in the requested data, such as &# xf8A1, is actually uniF8A1 and the actual corresponding number or text needs to correspond to a certain character (4) in the font file.

Here, we need to introduce Python’s third-party font processing library fontTools to modify the mapping of the three types of fonts:

from fontTools.ttLib import TTFont

# Modify the font mapping of three types
real_list = {}
for key in font_use_list:
    Open the local font file
    font_data = TTFont(f'{key}.woff')
    # font_data.saveXML('shopNum.xml')
    # get all the encoding, remove the first two non-useful characters
    uni_list = font_data.getGlyphOrder()[2:]
    # xf8A1 is "&# xf8A1 "in the corresponding encoding is "uniF8A1", we replace it with the requested data
    real_list[key] = ['&#x' + uni[3:] for uni in uni_list]
Copy the code

By opening these three font files, we found that the corresponding character order is consistent (order and character content), copy as follows:

# string
words = Special 40-yuan car market of China and the United States' 1234567890 store gold heart and wine line from product power generation industry business department super living garden field food has the new limit day face working clothes haihua steam booth decoration city le east specially of son, the old art flower fragrance rice dish to learn the best meal tea service through the mountains door medicine silver farmer dragon stopped ShangAn guang xin has a capacity to move south yuan xing fresh baked timing of the GuoYang motivation letter baoda place garment specialty group of fang state NiuJia Five meters to fix the love station at the north will keep selling building materials three chicken room red DE name Wang Guang beautiful oil hospital hall burned Jiang She star type cargo village from families quickly private and living TongMingQi smoke bing fine house in ZhuangShiShun Lin Er county hand hall pin ya sheng body with hospitality fire brigade of shoes hot as a school of fish powder package floor flat color on it protect forever animism eat net d made FengJian soup is celebrating skill, the cause and wash material matched hui wood union WeiChuan Thai color the young sheep hot wind to high side" Plant orchid abbe skin all female pull yun-dimensional trade doctrine yun bo rui macro jing river mouth international town of Mr Green hutch empower you even Ma Honggang xun rich shadow a help window cloth card first four makeup more auspicious YuanSha hang lung spring dry cake's second GuanCheng in producing fine long hin miscellaneous vice qing meter yellow - too duck street near to fork layer near the lake bridge section of lane building ring province township mansion house shop inside of the yuan to buy house before Xue chicken port under place to seat switch QuanTang put chang zheng step line bay scene Solution pak tin town creek 18 GuShuangSheng this single with nine to meet after the end of the first Taiwan jade kam seven Angle of oblique period Wu Lingsong JiChaoFeng six vibration transverse edge bead bureau duty state economy well run in the han dynasty years island ling lane group outside the tower Yang Tiepu word MeiJinRong friend hong central guangxi along about tianjin KaiLianDing show liu set purple flag rockhampton is not very also that I am also in to the wrong haven't been to sense the secondary than sleep I have to say really but most often xi yao don't who is not for huan But he's the price of the meaning of come up with a member of the two push to do real sweetness up between points occupied to commend, drink and so on the hot able again to a few friends now wait sample straight as bean amount to choose and buy in milk to play less per rating and find some affection of optimum gas assorted eggs division you elder sister good attempt always set foot level whole shrimp, such as state and taste the main message is strong when be type plate bosom friend acid free let into laughter great sauce to like poor went to tender just noon after heavy string back to the late week of micro value cost sex table with block The cake '
Copy the code

For numeric class (actually only 10 characters, located in the first 10 characters of font mapping and words string), when we get the anti-crawl character as  is actually uniF8A1, we first find its corresponding position in shopNum, and then replace the characters in the same position of words string.

for i in range(10):
	s.replace(real_list['shopNum'][i], words[i])
Copy the code

For Chinese character classes (len(real_list[‘tagName’] at most), the substitution logic is similar to that of numeric classes.

for i in range(len(real_list['tagName'])):
    s.replace(real_list['tagName'][i], words[i])
Copy the code

3. Single-page store information analysis

Through the processing of font reverse crawling in part 2, combined with the store information field that can be directly obtained, we can complete the analysis and collection of all store information. Here we use re regular expression for parsing ha, interested students can also use xpath, BS4 and other tools library processing.

We create a function get_items(HTML, real_list, words) that gets the entire store information data for a single page:

Get all information on a single page
def get_items(html, real_list, words) :    
    Get the whole HTML information of a single page of all shops
    shop_list = re.findall(r'<divclass="shop-listJ_shop-listshop-all-list"id="shop-all-list">(.*)<\/div>',html)[0]
    Get a single page list of all the store'S HTML information
    shops = re.findall(r'
      
       (.*?) <\/li>'
      ="">, shop_list)
    
    items = []
    for shop in shops:
        # parse individual store information
        # shop = shops[0]
        item = {}
        # store ID (unique, used for deduplication in data cleaning phase)
        item['shop_id'] = re.findall(r'
      
       
        .*data-shopid="(.*?) "'
       ="tit">
      ="txt">, shop)[0]
        # Store name
        item['shop_name'] = re.findall(r'<divclass="txt"><divclass="tit">.*<h4>(.*)<\/h4>', shop)[0]
        # shop star, since it is a two-digit number, need to divide by 10.0 to convert to floating point number
        item['shop_star'] = re.findall(r'<divclass="nebula_star"><divclass="star_icon"><spanclass="starstar_(\d+)star_sml"><\/span>', shop)[0]
        item['shop_star'] = int(item['shop_star') /10.0
        
        Class ="operate J_operate Hide" where data-address is available
        # Therefore, we do not need to use the font crawl, direct to the regular fetch
        # Store address
        item['shop_address'] = re.findall('
      
       .*? data-address="(.*?) "'
      ="operatej_operatehide">, shop)[0]
        
        shop_name = item['shop_name']
        # shopNum is used for rating and per capita price
        try:
            shop_review = re.findall(r'(.*?) <\/b> Comment ', shop)[0]
        except:
            print(f'{shop_name}No evaluation data ')
            shop_review = ' '
            
        try:
            shop_price = re.findall(R 'per capita ￥(.*?) <\/b>', shop)[0]
        except:
            print(f'{shop_name}No per capita consumption data ')
            shop_price = ' '
            
        for i in range(10):
            shop_review = shop_review.replace(real_list['shopNum'][i], words[i])
            shop_price = shop_price.replace(real_list['shopNum'][i], words[i])
        # Evaluation number and per capita price, just take the numbers and combine them
        item['shop_review'] = ' '.join(re.findall(r'\d',shop_review))
        item['shop_price'] = ' '.join(re.findall(r'\d',shop_price))
        
        # tagName is used for store location and store classification
        shop_tag_site = re.findall(r'
      
       .*data-click-name="shop_tag_region_click"(.*?) <\/span>'
      ="tag">, shop)[0]
        # Shop classification
        shop_tag_type = re.findall('
      
       .*? 
       
        (.*?) '
       ="tag">
      ="tag-addr">, shop)[0]
        for i in range(len(real_list['tagName'])):
            shop_tag_site = shop_tag_site.replace(real_list['tagName'][i], words[i])
            shop_tag_type = shop_tag_type.replace(real_list['tagName'][i], words[i])
        [u4e00-u9fa5] [u4e00-u9fa5]
        item['shop_tag_site'] = ' '.join(re.findall(r'[\u4e00-\u9fa5]',shop_tag_site))
        item['shop_tag_type'] = ' '.join(re.findall(r'[\u4e00-\u9fa5]',shop_tag_type))
        items.append(item)
    
    return items
Copy the code

Taking skiing as an example, the following is the information data of all stores on the home page:

4. Data acquisition of all pages

In most cases, our search results are composed of multiple pages of data. In addition to collecting single page data, we need to obtain the data of all pages. In this case, the number of pages is usually obtained first, and then the page turning cycle is carried out to climb the data of all pages.

4.1. Get data pages

For single-page data, there is no total page count; For multi-page data, drag to the bottom of the page, select the control on the last page to find the HTML node where the value is located, and then use the regular expression to get the value.

Get page number
def get_pages(html) :
    try:
        page_html = re.findall(r'
      
       (.*?) '
      ="page">, html)[0]
        pages = re.findall(r'<ahref=.*>(\d+)<\/a>', page_html)[0]
    except :
        pages = 1
    
    return pages
Copy the code

4.2. Collect all data

When we parse the page data of the first page, we can obtain the number of data pages, download the anti-crawl font, obtain the real font real_list mapping relationship and the string words composed of real characters. Meanwhile, we also obtain the list of all the shop data composition of the first page. We can then iterate from the second page to the last page, adding the single page data to the first list.

# list of store data on the first page
shop_data = get_items(html, real_list, words)
# Start from page 2 to the last page
for page in range(2,pages+1):
    aim_url = f'{url}/p{page}'
    html = get_html(aim_url, headers)
    items = get_items(html, real_list, words)
    shop_data.extend(items)
    print(F 'crawl{page}Page data ')
Convert to dataframe type
df = pd.DataFrame(shop_data)
Copy the code

The dataframe type of all data obtained is as follows:

5. To summarize

Under the anti-crawling mechanism of the public comments and types of fonts, we first obtain the font file to parse out the real character mapping relationship corresponding to its character code, and then replace the code with the real character.

But in fact, in the actual operation of Python to retrieve the information of Dianping.com, we may encounter more complicated situations, such as prompting the verification center to verify or prompting the IP limit of the account, etc. In such cases, we can handle them by setting cookies and adding IP proxies.

Complete code in the public number back0104To obtain ~

All the above content is for technical exchange only, please do not use it for any illegal business activities!