What? [1] [Python crawler]

A few days ago a friend wants to climb Wangjing near freely rent the price, encounter a point of problem, want to let me help analysis.

1 analysis

I thought, this thing I have done before, but also what difficulty is not. So open a random rental page.

The forehead (even though it’s o… Instead of a picture. There should have been a separate Ajax request to get the price information.

① Although the price is composed of four < I > tags, the background image is the same. Width: 20px; width: 20px; height: 30px; Is fixed and only sets the offset of the image.

But that doesn’t bother me. Let’s get this straight:

Request a web page
Get the picture information, get the price offset information
Cut the image for identification
Get the price data

Just recently in the study of CNN image recognition related, such a regular number, with a little training recognition rate can certainly reach 100%.

2 of actual combat

Say to do it, first find an entry, and then get a wave of web pages again.

2.1 Get the original page

Press the subway directly, find line 15 Wangjing East, and then get the room list, and then deal with the next pagination.

Sample code:

# -*- coding: UTF-8 -*- import os import time import random import requests from lxml.etree import HTML __author__ = 'lpe234' index_url = 'https://www.ziroom.com/z/s100006-t201081/?isOpen=0' visited_index_urls = set() def get_pages(start_url: Param start_url :return: """ "# to reset if start_url in visited_index_urls: Return visited_index_urls.add(start_url) headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} resp = point. get(start_url, requests) Headers =headers) Resp.Content.Decode (' UTF-8 ') root = HTML(Resp.Content.Decode (' UTF-8 ' root.xpath('//div[@class="Z_list-box"]/div/div[@class="pic-box"]/a/@href') for href in hrefs: if not href.startswith('http'): Href = 'href :' + href. Strip () print(href) parse_detail(href root.xpath('//div[@class="Z_pages"]/a/@href') for page in pages: if not page.startswith('http'): + page get_pages(page) def parse_detail(detail_url: STR): "" "" headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Filename = 'pages/' + detail_url.split('/')[-1] if (like Gecko) Chrome/90.0.4430.212 Safari/537.36'} filename = 'pages/' + detail_url.split('/')[-1] if (like Gecko) Chrome/90.0.4430.212 Safari/537.36' os.path.exists(filename): Return time.sleep(random. RandInt (1, 5)) resp = point.get (detail_url, request) headers=headers) resp_content = resp.content.decode('utf-8') with open(filename, 'wb+') as page: page.write(resp_content.encode()) if __name__ == '__main__': get_pages(start_url=index_url)

A simple access to nearby housing sources, a total of about 600.

2.2 Analyze web pages to get pictures

The logic is simple. Go through all the previous pages, parse the price picture and save it.

Sample code:

# -*- coding: UTF-8 -*- import os import re from urllib.request import urlretrieve from lxml.etree import HTML __author__ = 'lpe234' Poss = list() def walk_pages(): "" return: "" for dirpath, dirnames, filenames in os.walk('pages'):" for page in filenames: page = os.path.join('pages', page) print(page) parse_page(page) def parse_page(page_path: str): "" with open(page_path, 'rb') as page: "" page_content = ''.join([_.decode('utf-8') for _ in page.readlines()]) root = HTML(page_content) styles = root.xpath('//div[@class="Z_price"]/i/@style') pos_re = re.compile(r'background-position:(.*?)px; ') img_re = re.compile(r'url\((.*?)\); ') for style in styles: style = style.strip() print(style) pos = pos_re.findall(style)[0] img = img_re.findall(style)[0] if img.endswith('red.png'): continue if not img.startswith('http'): img = 'http:' + img print(f'pos: {pos}, img: {img}') save_img(img) poss.append(pos) def save_img(img_url: str): img_name = img_url.split('/')[-1] img_path = os.path.join('imgs', img_name) if os.path.exists(img_path): return urlretrieve(img_url, img_path) if __name__ == '__main__': walk_pages() print(sorted([float(_) for _ in poss])) print(sorted(set([float(_) for _ in poss])))

Finally get the price related picture data.

There are 21 pictures in total, among which 20 orange pictures are for ordinary prices and 1 red picture is for special prices.

It looks like we don’t need image recognition anymore. We can just map by image name and offset.

2.3 Price analysis

I wanted to write identification, but it didn’t feel right. What kind of identification is this? Is not an image name + offset mapping.

Sample code:

# -*- coding: UTF-8 -*- import re from lxml.etree import HTML import requests __author__ = 'lpe234' PRICE_IMG = { '1b68fa980af5e85b0f545fccfe2f8af1.png': [8, 9, 1, 6, 7, 0, 2, 4, 5, 3], '4eb5ebda7cc7c3214aebde816b10d204.png': [9, 5, 7, 0, 8, 6, 3, 1, 2, 4], '5c6750e29a7aae17288dcadadb5e33b1.png': [4, 5, 9, 3, 1, 6, 2, 8, 7, 0], '6f8787069ac0a69b36c8cf13aacb016b.png': [6, 1, 9, 7, 4, 5, 0, 8, 3, 2], '7ce54f64c5c0a425872683e3d1df36f4.png': [5, 1, 3, 7, 6, 8, 9, 4, 0, 2], '8e7a6d05db4a1eb58ff3c26619f40041.png': [3, 8, 7, 1, 2, 9, 0, 6, 4, 5], '73ac03bb4d5857539790bde4d9301946.png': [7, 1, 9, 0, 8, 6, 4, 5, 2, 3], '234a22e00c646d0a2c20eccde1bbb779.png': [1, 2, 0, 5, 8, 3, 7, 6, 4, 9], '486ff52ed774dbecf6f24855851e3704.png': [4, 7, 8, 0, 1, 6, 9, 2, 5, 3], '19003aac664523e53cc502b54a50d2b6.png': [4, 9, 2, 8, 7, 3, 0, 6, 5, 1], '93959ce492a74b6617ba8d4e5e195a1d.png': [5, 4, 3, 0, 8, 7, 9, 6, 2, 1], '7995074a73302d345088229b960929e9.png': [0, 7, 4, 2, 1, 3, 8, 6, 5, 9], '939205287b8e01882b89273e789a77c5.png': [8, 0, 1, 5, 7, 3, 9, 6, 2, 4], '477571844175c1058ece4cee45f5c4b3.png': [2, 1, 5, 8, 0, 9, 7, 4, 3, 6], 'a822d494f1e8421a2fb2ec5e6450a650.png': [3, 1, 6, 5, 8, 4, 9, 7, 2, 0], 'a68621a4bca79938c464d8d728644642.png': [7, 0, 3, 4, 6, 1, 5, 9, 8, 2], 'b2451cc91e265db2a572ae750e8c15bd.png': [9, 1, 6, 2, 8, 5, 3, 4, 7, 0], 'bdf89da0338b19fbf594c599b177721c.png': [3, 1, 6, 4, 7, 9, 5, 2, 8, 0], 'de345d4e39fa7325898a8fd858addbb8.png': [7, 2, 6, 3, 8, 4, 0, 1, 9, 5], 'eb0d3275f3c698d1ac304af838d8bbf0.png': [3, 6, 5, 0, 4, 8, 9, 2, 1, 7], 'img_pricenumber_detail_red.png': [6, 1, 9, 7, 4, 5, 0, 8, 3, 2)} POS_IDX = [0.0, 31.24, 62.48, 93.72, 124.96, 156.2, 187.44, 218.68, 249.92, -281.16] def parse_price(img: STR, pos_list: list): price_list = price_img.get (img) if not price_list: raise Exception('img not found. %s', img) step = 1 price = 0 _pos_list = reversed(pos_list) for pos in _pos_list: price += price_list[POS_IDX.index(float(pos))]*step step *= 10 return price def parse_page(content: str): root = HTML(content) styles = root.xpath('//div[@class="Z_price"]/i/@style') pos_re = re.compile(r'background-position:(.*?)px; ') pos_img = re.findall('price/(.*?) \ \); ', styles[0])[0] poss = list() for style in styles: style = style.strip() pos = pos_re.findall(style)[0] poss.append(pos) print(pos_img) print(poss) return Parse_price (pos_img, poss) def request_page(url: STR): headers = {' user-agent ': 'MOZILLA /5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} resp = requests. Get (url, WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} resp = requests. headers=headers) resp_content = resp.content.decode('utf-8') return resp_content

For convenience, has made the interface services: testing interface = = > https://lemon.lpe234.xyz/common/ziru/

3 summary

I thought I could show off my new skills to my friends, but I lost track of them. And then I was thinking, since I’m comfortable with this set of things, why not get some more pictures to make it more difficult.

But if you want to get in front of your friends, you’re going to have to use CNN.

1 analysis

2 of actual combat

2.1 Get the original page

2.2 Analyze web pages to get pictures

2.3 Price analysis

3 summary

Related Posts

【 Crawler skills 】 Summary of browser developer tools using skills

Learn four Python sorting algorithms

Tornado cooperates with Celery and RabbitMQ to implement asynchronous non-blocking Web request