Use scrapy to crawl the contents of wechat public accounts

Configure the packet capture tool

1. Fiddler(windows)

Installation: fiddler
Configuration:

# 1. Allow remote connections
Tools > Fiddler Options > Connections > Allow remote computers to connect
# 2. Support HTTPS
Tools > Fiddler Options > HTTPS > Decrypt HTTPS traffic
# 3. Restart
restart Fiddler

# 4. Mobile configurationMobile phone wifi set the proxy, the server address is the COMPUTER IP address XXX.XXX.XXX.xxx, port is8888
# 5. Download the root certificateOpen safari XXX. XXX. XXX.xxx:8888Download and install the root certificate# 6. Full trust certificatesSettings > General > About Local > Certificate Trust Settings > Enable full trust for root certificatesCopy the code

2.Charles(MacOS)

Installation: Charles Web Debugging Proxy
Configuration:

The server address is the COMPUTER IP address XXX.XXX.XXX.xxx, and the port is 8888 # 2. Help -> SSL Proxying -> Install Charles Root Certificate on a Mobile Device Safari Download root certificate # 3. Full Trust Certificate Settings > General > About This Machine > Certificate Trust Settings > Enable Full Trust for root Certificate # 4. Proxy Proxy -> SSL Proxying Settings... Select Enable SSL Proxying # 5. Add the domain name Host: https://mp.weixin.qq.com Port: 443Copy the code

Scrapy (python package)

About the official website:

An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way. 
Copy the code

Use:

#1. Create the project WCSpider
scrapy startproject WCSpider

#2. Switch to the project directory
cd WCSpider

#3. Create crawler wechat
scrapy genspider wechat mp.qq.com

#4. Set the AgentOpen Agent comments in Settings
#5. Add the itemName = scrapy.Field age = scrapy.Field...#6. Parse the responseSet start_urls to parse response in wechat. Py and return item res = Response.xpath (...) item['name'] = res['_name'] item['age'] = res['_age'] ... yield item#7. Set the pipe fileInitialize database in open_spider etc self.f = open("file.json","w") close database in close_spider etc self.close () handle item in deal_spider Self.write (json.dumps(dict(item))) enables the pipe file in SettingsCopy the code

Public historical message capture package

This packet capture takes Charles as an example.

1. Get the home page link

Var msgList = ‘{… }’, because it’s not standard HTML, xpath doesn’t work here, so try regular expressions to quickly extract content.

rex = "msgList = '({.*?})'"
pattern = re.compile(pattern=rex, flags=re.S)
match = pattern.search(response.text)
Copy the code

In this example, all the contents of the public number are sent in text format. Therefore, the comm_MSg_info content is directly obtained. For details, parse according to the actual situation.

if match:
    data = match.group(1)
    data = html.unescape(data)
    data = json.loads(data)
    articles = data.get("list")

    item = WechatItem()
    for article in articles:
        info = article["comm_msg_info"]
        content = html.unescape(info["content"])
        Pass the retrieved data to the pipe
        item["content"] = content
        item["we_id"] = info["id"]
        item["datetime"] = info["datetime"]
        item["we_type"] = info["type"]
        yield item
Copy the code

2. Obtain the drop-down refresh link

Scroll up to the bottom of the history list. When you get to the bottom of the history list, you will automatically get more content. At this point, you can capture the content through packet capture.

As shown in the figure, the content is returned in JSON format, with can_MSG_continue to determine if there is any subsequent content. General_msg_list is what is parsed. By analyzing multiple loading histories, offset controls where data is loaded. In this example, python third-party library Requests is used to load data

def getMore(a):
   # header
	header = sj_utils.header2dict(base_header)
	response = requests.get(getUrl(),headers = header, verify = False)
	
	js = json.loads(response.text)
	if js["errmsg"] = ="ok":
		alist = js["general_msg_list"] The content is a JSON string that needs to be converted to JSON first
		alist = json.loads(alist)["list"]
		for item in alist:
		  # Specific treatment
			print(item["comm_msg_info"] ["content"])

	if js["can_msg_continue"]:
	   url_offset += 10 Set the offset
		getMore()
Copy the code

3. Crunch the data

The use of scrapy was introduced at the beginning, making it easy to handle crawls through pipe files.

Save the JSON file
self.f.write(json.dumps(dict(item)))
Copy the code

reference

Implement crawler of wechat public number based on Python