One, foreword

1. Related introduction:

The main reasons for using Python for crawler are: Python language is simple, easy to use, and there are many convenient libraries for crawler, such as Urllib. In addition to crawling data, Python can also process images, process data, export Execl tables, and so on.

This post is a compilation of Python materials that beginners may need, including books/videos/online documentation and editors/sources

Installation of Python Qun: 850973621

2. Install Python

The Python compiler, python2.x, is generally installed on Apple systems by default. The code that appears in this article is Python2.7. If you need Python3.x or run it on Windows, you’ll need to install Python yourself

Second, the crawler

This article implements a crawler that crawls a picture from a web page as an example.

Preliminary knowledge

Python Basics Usage of the Urllib library in Python Regular expressions Usage of the Re library in Python

1. Python basics

Py 2 # is a comment symbol. 3. Python does not have curly braces {}

#eg:
for url in urls :
      print url

Copy the code

4. Function call writing is similar to JS

#eg func(parameter)Copy the code

5. Run Python

Enter:

python **.py

Copy the code

If python3. x type:

python3 **.py

Copy the code

The urllib library in Python

The urlopen and read

Urlopen: Opens a web page from a URL. Read: Read this page.

#eg
import urllib

url = 'http://www.thejoyrun.com'
page = urllib.urlopen(url)
html = page.read()
print html

Copy the code

Get the source code of the web page:

The core of the crawler in this paper is to obtain the picture link in the source code through regular expression.

urlretrieve

Urlretrieve: Download relevant files based on a URL

#eg import urllib urllib.urlretrieve('http://img.mm522.net/flashAll/20120502/1335945502hrbQTb.jpg','%s %s.jpg' % (datetime.datetime.now(),x)) #Copy the code

Python regular expressions

Use \d to match numbers. Matches any character with \s matches a space with * to indicate any number of characters and + to indicate at least one character

Use of the re library in Python

split

String segmentation with regular expression to get a list(mutable array)

#import re
testStr = 'http://www.thejoyrun.com'
print re.split(r'\.',testStr)

Copy the code
match

Matches with a regular expression, returning a Match object if the Match is successful, and None otherwise

#import re

testStr = 'http://www.thejoyrun.com'
if re.match(r'http.*com', test): 
    print 'ok'
else:
    print 'failed'

Copy the code

The complete code

#coding=utf-8 import urllib import re import datetime def getHtml(url): page = urllib.urlopen(url) html = page.read() return html def getImg(html): # splitReg = r'[\s\"\,\, \']+' splitReg = r'[\s\"]+' # TempList = re.split(splitReg, HTML) # get a list (array) imgUrls = [] # empty list x = 0 for STR in tempList: matchReg = r'http:.*.jpg' if re.match(matchReg,str) : print '%s--' %x +str imgUrls.append(str) x = x + 1 urllib.urlretrieve(str,'%s %s.jpg' % (datetime.datetime.now(),x)) matchReg1 = r'http:.*.png' if re.match(matchReg1,str) : print '%s--' %x +str imgUrls.append(str) x = x + 1 urllib.urlretrieve(str,'%s %s.jpg' % (datetime.datetime.now().date(),x)) return imgUrls HTML = getHtml(" url ") print(HTML) getImg(HTML)  http://cn.bing.com/images/search?q=%E6%85%B5%E6%87%92%E5%B0%91%E5%A5%B3%E5%86%99%E7%9C%9F&FORM=ISTRTH&id=A87C17F9A484F4 078 c72beb0fe1ec509ba1f59c8 & E7 cat = % % BE A5 8 E5 e % % % % B3 & lpversion = here is the url to open the web page screenshots:! [screenshot.png](http://upload-images.jianshu.io/upload_images/1819750-5e75cfdb6fd6d640.png? ImageMogr2 /auto-orient/strip% 7cimageView2/2 /w/1240) [Download local screenshots. PNG](http://upload-images.jianshu.io/upload_images/1819750-0b046c6fdca2c29c.png? ImageMogr2 /auto-orient/strip% 7cimageView2/2 /w/1240) # We can use urllib library combined with re to make some entry-level crawlers, if you need more powerful, more functional crawlers need to use the crawler framework ' 'Python' development of a fast, high-level screen crawling and Web crawling framework, It is used to crawl web sites and extract structured data from pages. It can be used in a range of applications including data mining, information processing or storing historical data.Copy the code

Author: alanwangmodify links: www.jianshu.com/p/e0eb4635a… The copyright of the book belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.