“This is the 12th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

preface

Today take you to use Python to batch capture huizhou new house data

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests module;

Fake_useragent module;

LXML module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Thought analysis

Open the new house list webpage, click ** “next page” **, the url becomes:

http://www.fz0752.com/project/list.shtml?state=&key=&qy=&area=&danjia=&func=&fea=&type=&kp=&mj=&sort=&pageNO=2
Copy the code

Obviously, this is a static page, and the page turning parameter is ** “pageNO”., area parameter is“Qy” **, other parameters are also very easy to understand, click the corresponding filter can be found webpage link changes.

By traversing the area and page number, the house URL of the new house list is extracted, and these urls are traversed again to capture the details of each house.

Detailed page analysis

Select a new house URL and click on it. The link is as follows:

http://newhouse.fz0752.com/fontHtml/html/project/00020170060.html
Copy the code

That is, the id of the new house is ** “00020170060” **, and then click the details, the link becomes:

http://newhouse.fz0752.com/project/detail.shtml?num=20170060
Copy the code

That is, the ** “details” of the new house ID is “20170060” **, we can boldly assume that this ID is part of the new house ID interception. Find a few more bridal chamber click try, verify this rule very easily.

Analysis of the climb

The same IP address frequently accessing the same webpage may be blocked. In this paper, FAke_userAgent is adopted to randomly generate user-agent request headers to access the webpage to reduce the risk of IP blocking.

Code implementation

Import the crawler library, define a main function, build a list of zones (different zones for different zone ids), traverse and request urls concatenated by zone parameters and page number parameters with requests. Here, the page number is set at the upper limit of 50. When the URL length of a house traversed is 0 (that is, there is no new house data), the program is directly broken to traverse the next area until all data are captured and the program stops.

# -*- coding = uft-8 -*-
# @time: 2020/12/21 9:29 PM
# @author: J
# @File : newhouse.py

import csv
import time
import random
import requests
import traceback
from lxml import etree
from fake_useragent import UserAgent

def main() :
#46: Huicheng District 47: Zhongkai District 171: Huiyang District 172: Daya Bay 173: Boluo County 174: Huidong County 175: Longmen County
qy_list = [46.47.171.172.173.174.175]
for qy in qy_list: # traversal the region
for page in range(1.50) :Number of pages to traverse
url = f'http://www.fz0752.com/project/list.shtml?state=&key=&qy={qy}&area=&danjia=&func=&fea=&type=&kp=&mj=&sort=&pageNO={page}'
response = requests.request("GET", url, headers = headers,timeout = 5)
print(response.status_code)
if response.status_code == 200:
re = response.content.decode('utf-8')
print("Extraction in progress" + str(qy) +'the first' + str(page) + "Page")
#time.sleep(random.uniform(1, 2))
print("-" * 80)
# print(re)
parse = etree.HTML(re)
get_href(parse,qy)
num = ' '.join(parse.xpath('//*[@id="parent-content"]/div/div[6]/div/div[1]/div[2]/div[1]/div[2]/div[1]/div[1]/a/@href'))
print(len(num))
if len(num) == 0:
break

if __name__ == '__main__':
ua = UserAgent(verify_ssl=False)
headers = {"User-Agent": ua.random}
time.sleep(random.uniform(1.2))
main()
Copy the code

Send a request to get the new house list web page and parse to all new house urls, replacing the new house ID with the details ID. In the running of the program, a few new house URLS are found to be inconsistent. Therefore, we make a judgment here. After modification, we can obtain the complete id of detailed information and splice the corresponding URL.

def get_href(parse,qy) :
items = parse.xpath('//*[@id="parent-content"]/div/div[6]/div/div[1]/div[2]/div')
try:
for item in items:
href = ' '.join(item.xpath('./div[2]/div[1]/div[1]/a/@href')).strip()
print("Initial href:",href)
#print(len(href))
if len(href) > 25:
href1 = 'http://newhouse.fz0752.com/project/detail.shtml?num=' + href[52:].replace(".html"."")
else:
href1 = 'http://newhouse.fz0752.com/project/detail.shtml?num=' + href[15:]
print("Href:",href1)
try:
get_detail(href1,qy)
except:
pass
except Exception:
print(traceback.print_exc())
Copy the code

Results:

Once the details URL is found, define a function to request the details page data, along with the QY parameter, and finally save it to CSV.

def get_detail(href1,qy) :
time.sleep(random.uniform(1.2))
response = requests.get(href1, headers=headers,timeout = 5)
if response.status_code == 200:
source = response.text
html = etree.HTML(source)
Copy the code

Start parsing each field in the detail page. Xpath is used for data parsing here. As there are too many fields to parse, as many as 41, only part of the field parsing code is given below due to space limitation. Of course, other field parsing is basically the same.

# Project status
try:
xmzt = html.xpath('//*[@id="parent-content"]/div/div[3]/div[3]/div[1]/div[1]/text()') [0].strip()
except:
xmzt = None
# Project name
try:
name = html.xpath('//*[@id="parent-content"]/div/div[3]/div[3]/div[1]/h1/text()') [0].strip()
except:
name = None
# Project Introduction
ps = html.xpath('//*[@id="parent-content"]/div/div[3]/div[5]/div[2]/div')
for p in ps:
try:
xmjj = p.xpath('./p[1]/text()') [0].strip()
except:
xmjj = None
infos = html.xpath('//*[@id="parent-content"]/div/div[3]/div[5]/div[1]/div/table/tbody')
for info in infos:
# Administrative area
try:
xzqy = info.xpath('./tr[1]/td[1]/text()') [0].strip()
except:
xzqy = None
# Property type
try:
wylx = info.xpath('./tr[2]/td[1]/text()') [0].strip()
except:
wylx = None
# Selling price
try:
xsjg = info.xpath('./tr[3]/td[1]/text()') [0].strip()
except:
xsjg = None· · · · · · data = {'xmzt':xmzt,
'name':name,
'xzqy': xzqy, · · · · · ·'qy':qy
}
print(data)
Copy the code

After parsing the data, place it in a dictionary and print the following:

Then append and save to CSV:

try:
with open('hz_newhouse.csv'.'a', encoding='utf_8_sig', newline=' ') as fp:
fieldnames = ['xmzt'.'name'.'xzqy', · · · · · ·,'qy']
writer = csv.DictWriter(fp, fieldnames = fieldnames)
writer.writerow(data)
except Exception:
print(traceback.print_exc())
Copy the code

Of course, we can also read the CSV file and write to Excel:

df = pd.read_csv("newhouse.csv",names=['name'.'xzqy'.'wylx', · · · · · ·,'state'])\
df = df.drop_duplicates()\
df.to_excel("newhouse.xlsx",index=False)
Copy the code