Python Web crawler 5 - Crawl QQ space album

Since graduation, I haven’t used QQ any more. What IS recorded in Qzone is some not wonderful years, but it is still a memory. Recently, I want to put what I have learned into practice, and use Python to climb down all the photos in qzone album for backup.

Analyzing QZone

Log in to Qzone

Climb the first step, analysis site, first need to know how to log in QQ space. The initial idea was to use the Requests library to configure login requests to simulate login, but this idea was soon abandoned

According to the listening event bound to the login button, the click event of the button can be traced as follows:

Account encryption is inevitable, but this pile of code really bad parsing, patient warriors enjoy a try!

After excluding this login method, choosing Selenium simulated user login is a time and effort saving method, and we just need to complete the login through Selenium, get the Cookies and g_TK parameters described below, and then disable it, so it’s not too inefficient.

Analyzing spatial album

After login, the page will jump to a {QQ_NUMBER} [https://user.qzone.qq.com/] (javascript:;) If you hover over the navigation bar, you’ll see that all the navigation bar links are javascript:; 😳. That’s exactly what happened. It was all a black box.

Of course, this is not too difficult to handle, just use a debugging tool to capture the click generated request, and then filter out the correct request package. Because there are so many network packets, how to filter, guess the album data API must return a list list, try to filter the list and then exclude one by one, and finally locate the request packet. The following packets are filtered by fcG_list. The list information is returned in JSONP format, which can be read as JSON format with a little manipulation (more on that later).

Two important sets of information can be obtained from Headers and Response, respectively:

requestGets the required request information for the album list, including request links and parameters
responseThe packet contains information about all albums and is the source of data for the request packet parameters corresponding to the photos contained in each album

First look at the request package:

# url
https://h5.qzone.qq.com/proxy/domain/photo.qzone.qq.com/fcgi-bin/fcg_list_album_v3

# args
g_tk: 477819917
callback: shine0_Callback
t: 691481346
hostUin: 123456789
uin: 123456789
appid: 4
inCharset: utf-8
outCharset: utf-8
source: qzone
plat: qzone
format: jsonp
notice: 0
filter: 1
handset: 4
pageNumModeSort: 40
pageNumModeClass: 15
needUserInfo: 1
idcNum: 4
callbackFun: shine0
_ : 1551788226819
Copy the code

Among them, hostUin and UIN are QQ numbers, g_tk is required and will be updated every time you log in again (how to obtain it will be explained later), other parameters are not required, I tried to sort out the following request parameters:

query = {
    'g_tk': self.g_tk,
    'hostUin': self.username,
    'uin': self.username,
    'appid': 4.'inCharset': 'utf-8'.'outCharset': 'utf-8'.'source': 'qzone'.'plat': 'qzone'.'format': 'jsonp'
}
Copy the code

Let’s look at the cross-domain response package in JSONP format:

shine0_Callback({ "code":0, "subcode":0, "message":"", "default":0, "data": { "albumListModeSort" : [ { "allowAccess" : 1, "anonymity" : 0, "bitmap" : "10000000", "classid" : 106, "comment" : 11, "createtime" : 1402661881, "desc" : "", "handset" : 0, "id" : "V13LmPKk0JLNRY", "lastuploadtime" : 1402662103, "modifytime" : 1408271987, "name" : "Graduation season ", "order" : 0, "pre" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfGuwSk58K2rQY! \/a\/dIY29GUbJgAA", "priv" : 1, "pypriv" : 1, "total" : 4, "viewtype" : 0 },Copy the code

Shine0_Callback is determined by the callbackFun parameter of the request package. Without this parameter, the response package will have _Callback as the default name, which of course doesn’t matter. All album information is stored in an albumListModeSort in JSON format, and only one album is captured.

In the album information, name stands for the name of the album, id as the unique identifier can be used to request the photo information in the album, and Pre is just a link to preview the thumbnail, it doesn’t matter.

Analyzing individual albums

Similar to obtaining photo album information, enter an album and use cGI_list to filter data packets to find the photo information of the album

In the same way, according to the packet can obtain the photo list information request packet and response information, first look at the request:

# url
https://h5.qzone.qq.com/proxy/domain/photo.qzone.qq.com/fcgi-bin/cgi_list_photo

# args
g_tk: 477819917
callback: shine0_Callback
t: 952444063
mode: 0
idcNum: 4
hostUin: 123456789
topicId: V13LmPKk0JLNRY
noTopic: 0
uin: 123456789
pageStart: 0
pageNum: 30
skipCmtCount: 0
singleurl: 1
batchId: 
notice: 0
appid: 4
inCharset: utf-8
outCharset: utf-8
source: qzone
plat: qzone
outstyle: json
format: jsonp
json_esc: 1
question: 
answer: 
callbackFun: shine0
_ : 1551790719497
Copy the code

There are several key parameters:

g_tk– Consistent with the album list parameter
topicId– With album list parameteridconsistent
pageStart– Indicates the start number of the requested photo
pageNum– Number of photos requested this time

To get all the photos at once, you can set pageStart to 0 and pageNum to the maximum number of photos in all albums.

You can also simplify the above parameters by adding topicId, pageStart and pageNum on the basis of the album list request parameters.

Here is the list of returned photos:

shine0_Callback({ "code":0, "subcode":0, "message":"", "default":0, "data": { "limit" : 0, "photoList" : [ { "batchId" : "1402662093402000", "browser" : 0, "cameratype" : " ", "cp_flag" : false, "cp_x" : 455, "cp_y" : 388, "desc" : "", "exif" : { "exposureCompensation" : "", "exposureMode" : "", "exposureProgram" : "", "exposureTime" : "", "flash" : "", "fnumber" : "", "focalLength" : "", "iso" : "", "lensModel" : "", "make" : "", "meteringMode" : "", "model" : "", "originalTime" : "" }, "forum" : 0, "frameno" : 0, "height" : 621, "id" : 0, "is_video" : false, "is_weixin_mode" : 0, "ismultiup" : 0, "lloc" : "NDN0sggyKs3smlOg6eYghjb0ZRsmAAA!", "modifytime" : 1402661792, "name" : "QQ photo 20140612104616", "origin" : 0, "origin_upload" : 0, "origin_URL" : ", "owner" : "123456789", "ownername" : "123456789", "photocubage" : 91602, "phototype" : 1, "picmark_flag" : 0, "picrefer" : 1, "platformId" : 0, "platformSubId" : 0, "poiName" : "", "pre" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfSk58K2rQY! \/a\/dIY29GUbJgAA&bo=pANtAgAAAAABCeY!", "raw" : "http:\/\/r.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfSk58K2rQY! \/r\/dIY29GUbJgAA", "raw_upload" : 1, "rawshoottime" : 0, "shoottime" : 0, "shorturl" : "", "sloc" : "NDN0sggyKs3smlOg6eYghjb0ZRsmAAA!", "tag" : "", "uploadtime" : "2014-06-13 20:21:33", "url" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfSk58K2rQY! \/b\/dIY29GUbJgAA&bo=pANtAgAAAAABCeY!", "width" : 932, "yurl" : 0 }, // ... ]  "t" : "952444063", "topic" : { "bitmap" : "10000000", "browser" : 0, "classid" : 106, "comment" : 1, "cover_id" : "NDN0sggyKs3smlOg6eYghjb0ZRsmAAA!" , "createtime" : 1402661881, "desc" : "", "handset" : 0, "id" : "V13LmPKk0JLNRY", "is_share_album" : 0, "laSTUploadTime" : 1402662103, "modiFYTime" : 1408271987, "Name" :" Graduation season ", "ownerName" : "707922098", "ownerUin" : "707922098", "pre" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfGuwSk58K2rQY! \/a\/dIY29GUbJgAA", "priv" : 1, "pypriv" : 1, "share_album_owner" : 0, "total" : 4, "url" : "http:\/\/b171.photo.store.qq.com\/psb?\/V13LmPKk0JLNRY\/eSAslg*mYWaytEtLysg*Q*5Km91gIWfGuwSk58K2rQY! \/b\/dIY29GUbJgAA", "viewtype" : 0 }, "totalInAlbum" : 4, "totalInPage" : 4 }Copy the code

The returned photo information is stored in photoList. Again, only one photo is captured on the top, and some basic information of the current album is returned on the bottom. TotalInAlbum, totalInPage stores the total number of photos contained in the current album and the number of photos returned this time. The image link we need to download is the URL!

OK, now that all request and response data have been analyzed, it is time for coding.

Determine the crawl scheme

createqqzoneClass to initialize user information
useSeleniumTo simulate the login
To obtainCookiesandg_tk
userequestsGet album list information
Walk through the album to get the photo list information and download the photos

Create qqzone

class qqzone(object):
    """QQ Space album crawler ""
    def __init__(self, user):
        self.username = user['username']
        self.password = user['password']
Copy the code

To simulate the login

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import WebDriverExceptio

#...

def _login_and_get_args(self):
    """ Log in to QQ and get Cookies and G_TK """
    opt = webdriver.ChromeOptions()
    opt.set_headless()

    driver = webdriver.Chrome(chrome_options=opt)
    driver.get('https://i.qq.com/')
    # time.sleep(2)

    logging.info('User {} login... '.format(self.username))
    driver.switch_to.frame('login_frame')
    driver.find_element_by_id('switcher_plogin').click()
    driver.find_element_by_id('u').clear()
    driver.find_element_by_id('u').send_keys(self.username)
    driver.find_element_by_id('p').clear()
    driver.find_element_by_id('p').send_keys(self.password)
    driver.find_element_by_id('login_button').click()

    time.sleep(1)
    driver.get('https://user.qzone.qq.com/{}'.format(self.username))
Copy the code

Note here:

useseleniumNeed to install the correspondingwebdriver
Can be achieved bywebdriver.Chrome()Specify the browser location, otherwise the search defaults to the path defined by the environment variable
If your computer is slow to open a browser, you may need to use thedriver.getaftersleepA few seconds

To get the Cookies

Selenium is very convenient for getting Cookies

self.cookies = driver.get_cookies()
Copy the code

Get g_tk

Getting G_TK is the biggest difficulty of this crawler at the beginning, because there is no value directly written from the web page, only various function calls. I did a global search and found that many places have access to it.

Finally, one of them was selected, and g_TK! Was successfully obtained through selenium’s scripting capabilities.

self.g_tk = driver.execute_script('return QZONE.FP.getACSRFToken()')
Copy the code

At this point, Selenium is done, and the rest will be done through Requests.

Initializes the request. The Session

The next step is to generate the request and then retrieve the data. However, for convenience, data is requested in session mode, and cookie and headers are configured to save each request being set.

def _init_session(self):
    self.session = requests.Session()
    for cookie in self.cookies:
        self.session.cookies.set(cookie['name'], cookie['value'])
    self.session.headers = {
        'Referer': 'https://qzs.qq.com/qzone/photo/v7/page/photo.html?init=photo.v7/module/albumList/index&navBar=1'.'User-Agent': 'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'
    }
Copy the code

Request album Information

To obtain the album information, you need to encapsulate the request parameters, then crawl the data through session.get, then read the JSONP data in JSON format through regular matching, and finally parse the required name and ID.

def _get_ablum_list(self):
    """ Get list information for album """
    album_url = '{} {}'.format(
        'https://h5.qzone.qq.com/proxy/domain/photo.qzone.qq.com/fcgi-bin/fcg_list_album_v3?',
        self._get_query_for_request())

    logging.info('Getting ablum list id... ')
    resp = self.session.get(album_url)
    data = self._load_callback_data(resp)

    album_list = {}
    for item in data['data'] ['albumListModeSort']:
        album_list[item['name']] = item['id']

    return album_list
Copy the code

The parameter combinations come from the _get_query_for_request function.

def _get_query_for_request(self, topicId=None, pageStart=0, pageNum=100):
    PageStart: the starting page number required to request the photo list information of an album. PageNum: The number of photos in an album that are requested at a time. Returns: The number of photos in an album that are requested at a time. A string that combines all the request parameters.
    query = {
        'g_tk': self.g_tk,
        'hostUin': self.username,
        'uin': self.username,
        'appid': 4.'inCharset': 'utf-8'.'outCharset': 'utf-8'.'source': 'qzone'.'plat': 'qzone'.'format': 'jsonp'
    }
    if topicId:
        query['topicId'] = topicId
        query['pageStart'] = pageStart
        query['pageNum'] = pageNum
    return '&'.join('{} = {}'.format(key, val) for key, val in query.items())
Copy the code

The jSONP parsing function is as follows, the body of which is a regular match, very simple.

def _load_callback_data(self, resp):
    Parse returned JSONP data in JSON format
    try:
        resp.encoding = 'utf-8'
        data = loads(re.search(r'.*? \ [({*}). *? \]. * ', resp.text, re.S)[1])
        return data
    except ValueError:
        logging.error('Invalid input')
Copy the code

Parse and download the photos

After getting the album list, request the photo list information one by one and then download it one by one

def _get_photo(self, album_name, album_id):
    """ Gets the photo list information for a single album and downloads all the photos in that album. ""
    photo_list_url = '{} {}'.format(
        'https://h5.qzone.qq.com/proxy/domain/photo.qzone.qq.com/fcgi-bin/cgi_list_photo?',
        self._get_query_for_request(topicId=album_id))

    logging.info('Getting photo list for album {}... '.format(album_name))
    resp = self.session.get(photo_list_url)
    data = self._load_callback_data(resp)
    if data['data'] ['totalInPage'] = =0:
        return None

    file_dir = self.get_path(album_name)
    for item in data['data'] ['photoList']:
        path = '{}/{}.jpg'.format(file_dir, item['name'])
        logging.info('Downloading {}-{}'.format(album_name, item['name']))
        self._download_image(item['url'], path)
Copy the code

Images are also downloaded via request, so remember to set the timeout.

def _download_image(self, url, path):
    """ Download a single photo """
    try:
        resp = self.session.get(url, timeout=15)
        if resp.status_code == 200:
            open(path, 'wb').write(resp.content)
    except requests.exceptions.Timeout:
        logging.warning('get {} timeout'.format(url))
    except requests.exceptions.ConnectionError as e:
        logging.error(e.__str__)
    finally:
        pass
Copy the code

Crawl test

The crawl process

Crawl results

Write in the last

If the request parameters informatbyjsonptojson, can be directly obtainedjsondata
This use case does not use multi-process or multi-thread, so the speed is not fast, which needs to be optimized
Women, that crawler is received a storm of applause. That’s what’s expected of you

This article was originally published at www.litreily.top

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python Web crawler 5 – Crawl QQ space album

Analyzing QZone

Log in to Qzone

Analyzing spatial album

Analyzing individual albums

Determine the crawl scheme

Create qqzone

To simulate the login

To get the Cookies

Get g_tk

Initializes the request. The Session

Request album Information

Parse and download the photos

Crawl test

Write in the last

Python Web crawler 5 – Crawl QQ space album

Analyzing QZone

Log in to Qzone

Analyzing spatial album

Analyzing individual albums

Determine the crawl scheme

Create qqzone

To simulate the login

To get the Cookies

Get g_tk

Initializes the request. The Session

Request album Information

Parse and download the photos

Crawl test

Write in the last

Related Posts

Go-zero series gRPC SSL/TLS one-way authentication

Kafka high performance architecture is hot! You don’t know!

Go Advanced 17, reflex mechanism, three laws of reflex!