Python crawls and analyzes taobao commodity information

  • background
  • 1. Simulated landing
  • Two, climb commodity information
    • 1. Define parameters
    • 2. Analyze and define the re
    • 3. Data crawl
  • 3. Simple data analysis
    • 1. The import libraries
    • 2. Chinese display
    • 3. Read data
    • 4. Analyze price distribution
    • 5. Analyze distribution
    • 6. Word cloud analysis
  • Write in the last

This article is for learning and communication only. Do not use it for illegal purposes!!

background

A classmate asked me: “XXX, there is no way to collect taobao commodity information ah, I want to do a statistics.” As a result, I have nothing to do, and began to think about this matter…



As the saying goes, knowledge comes from practice

Amway is here to give you a set of “2020 Latest Enterprise Pyhon Project Practice” video tutorial.Click here toI hope we can make progress together.
\

1. Simulated landing

I rushed into taobao to prepare a disorderly search:



Type in the keywords in the search bar: “graphics card”, tapping the enter key lightly.

In a happy mood, I waited for the full product information to return, only to be changed302″And I came to the login screen by accident.



That’s basically what’s going on…

Then I checked, as Taobao anti climb means to strengthen, many small partners should have found that taobao search function is neededThe user login!!!!

This method requires analyzing various requests for Taobao login and simulating the generation of corresponding parameters, which is relatively difficult. Therefore, I decided to change my mind by selenium+ TWO-DIMENSIONAL code:

# Open image
def Openimg(img_location) :
    img=Image.open(img_location)
    img.show()

# Login to get cookies
def Login() :  
    driver = webdriver.PhantomJS() 
    driver.get('https://login.taobao.com/member/login.jhtml')
    try:
        driver.find_element_by_xpath('//*[@id="login"]/div[1]/i').click()
    except:
        pass
    time.sleep(3)
    Get the CANVAS QR code by executing JS
    JS = 'return document.getElementsByTagName("canvas")[0].toDataURL("image/png"); '
    im_info = driver.execute_script(JS) # Execute JS to get image information
    im_base64 = im_info.split(', ') [1]  Get the base64 encoded image information
    im_bytes = base64.b64decode(im_base64)  Change to bytes
    time.sleep(2)
    with open('./login.png'.'wb') as f:
        f.write(im_bytes)
        f.close()
    t = threading.Thread(target=Openimg,args=('./login.png',))
    t.start()
    print("Logining... Please sweep the code! \n")
    while(True):
        c = driver.get_cookies()
        if len(c) > 20:   Login succeeded in obtaining cookies
            cookies = {}
            for i in range(len(c)):
                cookies[c[i]['name']] = c[i]['value']
            driver.close()
            print("Login in successfully! \n")
            return cookies
        time.sleep(1)
Copy the code

Open the Taobao login interface through Webdriver, download the TWO-DIMENSIONAL code to the local and open it for the user to scan the code (the corresponding elements can be easily found through the F12 element analysis of the browser). After the scan succeeds, the cookies in the WebDriver are converted to DICT and returned. (This is used for subsequent requests crawls)

Two, climb commodity information

When I get the cookies, I can crawl the product information. (Sample ~ I’m coming)

1. Define parameters

Define the corresponding request address, request class, etc. :

# define parameters
headers = {'Host':'s.taobao.com'.'User-Agent':'the Mozilla / 5.0 (Windows NT 6.3; Win64; x64; The rv: 63.0) Gecko / 20100101 Firefox 63.0 / '.'Accept':'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language':'zh-CN,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q = 0.2 '.'Accept-Encoding':'gzip, deflate, br'.'Connection':'keep-alive'}
list_url = 'http://s.taobao.com/search?q=%(key)s&ie=utf8&s=%(page)d'
Copy the code

2. Analyze and define the re

When an HTML page is requested, it must be extracted in order to get the data we want, and I chose the regular method. By viewing the page source code:



I only marked two values above, but the rest were similar, so I got the following re:

# regular pattern
p_title = '"raw_title":"(.*?) "'       # titles
p_location = '"item_loc":"(.*?) "'    # sale
p_sale = '"view_sales":"(.*?) Payment "' # sales
p_comment = '"comment_count":"(.*?) "'# comments
p_price = '"view_price":"(.*?) "'     # Selling price
p_nid = '"nid":"(.*?) "'              # unique ID of product
p_img = '"pic_url":"(.*?) "'          # image URL
Copy the code

(Ps. If you are smart enough to know that the product information is stored in the g_page_config variable, you can also extract this variable (a dictionary) and then read the data.)

3. Data crawl

All that is done is due to the east wind. Then came the east wind:

# data crawl
key = input('Please enter the keyword :') # Keywords of goods
N = 20 # number of pages to climb
data = []
cookies = Login()
for i in range(N):
    try:
        page = i*44
        url = list_url%{'key':key,'page':page} res = requests.get(url,headers=headers,cookies=cookies) html = res.text title = re.findall(p_title,html) location  = re.findall(p_location,html) sale = re.findall(p_sale,html) comment = re.findall(p_comment,html) price = re.findall(p_price,html) nid = re.findall(p_nid,html) img = re.findall(p_img,html)for j in range(len(title)):
            data.append([title[j],location[j],sale[j],comment[j],price[j],nid[j],img[j]])
        print('-------Page%s complete! --------\n\n'%(i+1))
        time.sleep(3)
    except:
        pass
data = pd.DataFrame(data,columns=['title'.'location'.'sale'.'comment'.'price'.'nid'.'img'])
data.to_csv('%s.csv'%key,encoding='utf-8',index=False)
Copy the code

The above code crawls 20 commodity information and saves it in a local CSV file, which looks like this:

3. Simple data analysis

With the data, is it a waste to put it, BUT I am a good socialist youth, how can do this kind of thing? So, let’s take a quick look at these numbers. (Of course, the numbers are small, just for fun.)

1. The import libraries

# import related libraries
import jieba
import operator
import pandas as pd
from wordcloud import WordCloud
from matplotlib import pyplot as plt
Copy the code

The corresponding library installation method (in fact, basically can be solved by PIP) :

  • jieba
  • pandas
  • wordcloud
  • matplotlib

2. Chinese display

# matplotlib
plt.rcParams['font.family'] = ['sans-serif']
plt.rcParams['font.sans-serif'] = ['SimHei']
Copy the code

Do not set the possible Chinese garble and other disturbing situation oh ~

3. Read data

# fetch data
key = 'graphics'
data = pd.read_csv('%s.csv'%key,encoding='utf-8',engine='python')
Copy the code

4. Analyze price distribution

# Price distribution
plt.figure(figsize=(16.9))
plt.hist(data['price'],bins=20,alpha=0.6)
plt.title('Price frequency distribution histogram')
plt.xlabel('price')
plt.ylabel('frequency')
plt.savefig('Price distribution.png')
Copy the code

Price frequency distribution histogram:

5. Analyze distribution

# Distribution of sales
group_data = list(data.groupby('location'))
loc_num = {}
for i in range(len(group_data)):
    loc_num[group_data[i][0]] = len(group_data[i][1])
plt.figure(figsize=(19.9))
plt.title('Place of sale')
plt.scatter(list(loc_num.keys())[:20].list(loc_num.values())[:20],color='r')
plt.plot(list(loc_num.keys())[:20].list(loc_num.values())[:20])
plt.savefig('Place.png')
sorted_loc_num = sorted(loc_num.items(), key=operator.itemgetter(1),reverse=True)# sort
loc_num_10 = sorted_loc_num[:10]  # take the top 10
loc_10 = []
num_10 = []
for i in range(10):
    loc_10.append(loc_num_10[i][0])
    num_10.append(loc_num_10[i][1])
plt.figure(figsize=(16.9))
plt.title('TOP10 places of sale')
plt.bar(loc_10,num_10,facecolor = 'lightskyblue',edgecolor = 'white')
plt.savefig('Sales place Top10.png')
Copy the code

Distribution of sales:



TOP10 places of sale:

6. Word cloud analysis

# Make word clouds
content = ' '
for i in range(len(data)):
    content += data['title'][i]
wl = jieba.cut(content,cut_all=True)
wl_space_split = ' '.join(wl)
wc = WordCloud('simhei.ttf',
               background_color='white'.# Background color
               width=1000,
               height=600,).generate(wl_space_split)
wc.to_file('%s.png'%key)
Copy the code

Taobao commodity “graphics card” word cloud:

Write in the last

Finally, what to say ~ or to give you this amway “2020 latest Enterprise Pyhon Project practice” video tutorial, click here to get, I hope we can progress together oh. Thank you very much for your patience