Problem is introduced into

The previous section gave a brief introduction to the single page crawl, involving the request module using URllib and the parse module using BeautifulSoup, which crawls the single page of the linker, processes the page elements we want, and outputs them to the console. We open the page through the browser. There are two sources, one is the known address, such as Google, and the other is to obtain the entry from the previous page, such as the href attribute value in the A tag. Crawler automation is used to obtain the data that the browser can obtain, to simulate the behavior of the browser, get more entrances, and then process the web pages that need to be processed.

Here are a few questions, and better solutions are welcome.

01 simple single page crawler on the single page of the community information to crawl, you can refer to the example code, with the sale of real estate and transaction records of information to do a small case. Is there a better way to request and process web pages?

01 simple single page crawler output to the console, how should subsequent storage modules be organized? How to combine multi-threaded or multi-process crawl?

Thinking is introduced

Previously mentioned, whether it is residential information, in the sale of property information or transaction record information, the total page is more than 100 pages, so through the curve to save the country, the first climb residential information, and then cycle through each residential property in the sale and transaction records. In the process of crawling the community information, it also circulates through the administrative region (county), simulates the browser to click the page to obtain more entrances, and then analyzes the access, continues to weaken the storage module, output to the console.

Get code examples of more entries

def xiaoqu_batch_crawl(db_xiaoqu=None, region=u'the tianhe'):
    try:
        # Tianhe UTF-8: %E5%A4%A9%E6%B2%B3
        url_region = u'http://gz.lianjia.com/xiaoqu/rs' + region + '/'
        req = urllib2.Request(url_region,
                              headers=HDS[random.randint(0, len(HDS) - 1)])
        source_code = urllib2.urlopen(req, timeout=10).read()
        plain_text = source_code.decode('utf-8')
        soup = BeautifulSoup(plain_text, 'html.parser')
    except (urllib2.HTTPError, urllib2.URLError), e:
        print e
        return
    except Exception, e:
        print e
        return
    page_data = 'page_data = ' + soup.find('div', attrs={
        'class': 'page-box house-lst-page-box'}).get('page-data')
    exec (page_data)
    total_pages = page_data['totalPage']
    threads = []
    for page in range(total_pages):
        url_page = u'http://gz.lianjia.com/xiaoqu/pg%drs%s' % (page + 1, region)
        t = threading.Thread(target=xiaoqu_crawl, args=(db_xiaoqu, url_page))
        threads.append(t)
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print u'Climbed down all the cell information in %s' % (region)


if __name__ == '__main__':
    xiaoqu_batch_crawl(db_xiaoqu=None, region=u'%E5%A4%A9%E6%B2%B3')
Copy the code

Problem to be solved

Error ‘ASCII’ codec can’t encode characters in position 14-15: Ordinal not in range(128), ordinal not in range(128), ordinal not in range(128).

import sys

reload(sys)
sys.setdefaultencoding("utf-8")
Copy the code

But the question is, Why reload(sys)?

Google say: reload(sys) is evil


You might want to see more

Hadoop/CDH

Hadoop Combat (1) _ Aliyun builds the pseudo-distributed environment of Hadoop2.x

Hadoop Deployment (2) _ Vm Deployment of Hadoop in full distribution Mode

Hadoop Deployment (3) _ Virtual machine building CDH full distribution mode

Hadoop Deployment (4) _Hadoop cluster management and resource allocation

Hadoop Deployment (5) _Hadoop OPERATION and maintenance experience

Hadoop Deployment (6) _ Build the Eclipse development environment for Apache Hadoop

Hadoop Deployment (7) _Apache Install and configure Hue on Hadoop

Hadoop Deployment (8) _CDH Add Hive services and Hive infrastructure

Hadoop Combat (9) _Hive and UDF development

Hadoop Combat (10) _Sqoop import and extraction framework encapsulation


The wechat official account “Data Analysis” is used to share self-cultivation of data scientists. Since we met each other, it is better to grow up together.

Reprint please specify: Reprint from wechat official account “Data Analysis”


Reader communication telegraph group:

https://t.me/sspadluo