Define the framework

Main functions: 1. Initial url 2. Get data 3. Save data

In the process

from bs4 importBeautifulSoup # parse web pageimportRe # re extractionimportUrllib. Request, urllib. Error # error handling, web page extractionimportXLWT # table processingCopy the code
def main(a)Get the information before saving the original url baseurl= ""Datalist =getData(baseurl) # savepath =".xls"SaveData (datalist,savepath) def getData(baseurl)returnDatalist def saveData(datalist,savepath):# save information def askURL(url): # retrieve pagereturn html

if __name__ == "__main__": # function run, equivalent to c main() main()Copy the code

Extraction of web pages

The modularization process of urllib.requst is as follows:

def askURL(url): # Disguised head head= {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400"
        }
    request = urllib.request.Request(url,headers = head)
    html = ""Pyth will automatically assign it to you when you use itstringtypetry:
            response = urllib.request.urlopen(request)
            html = response.read().decode("utf-8")
            #print(html)
    except urllib.error.URLError as e:
            if hasattr(e,"code"):
                print(e.code)
            if hasattr(e,"reason"):
                print(e.reason)
    return html
   

Copy the code

Head is where we disguise the request to the page, as shown in the figure below:Note that this is a dictionary, and to deal with key-value pairs, it is usually enough to disguise user-Agent.

request = urllib.request.Request(url,headers = head)
Copy the code

This is to send a Request, Request can view the source, will not expand.

 response = urllib.request.urlopen(request)
 html = response.read().decode("utf-8")
Copy the code

Response is the returned information, and urlopen is used to open the HTML page information for reading. At this point, print (HTML) can be used to print out the information, and you can see the HTML structure of the tree information table.

Parse web pages

# Crawl urldef getData(baseurl):
    datalist = []
    for i in range(0.10)Call the function that gets the web page 10 times= baseurl + str(i*25) HTML = askURL(URL) # Save the information obtained from the page # parse soup = BeautifulSoup(HTML,"html.parser")
        for item in soup.find_all('div',class_="item"Data = [] # Save all information about a movie item = STR (item)#print(item)
            
            Link = re.findall(findLink,item)[0ImgSrc = re.findall(findImgSic,item)[0[data.append(imgSrc) # add titles = re.findall(findTitle,item)if(len(titles)==2):
               
                ctitle = titles[0]
                data.append(ctitle)
               
                otitle = titles[1].replace("/"."")  
                data.append(otitle)
            else :
                data.append(titles[0])
                data.append(' 'Rating =re.findall(findRating,item) data.append(rating) judgeNum =re.findall(rating) data.append(judgeNum) inq = re.findall(findIng,item)iflen(inq) ! =0:
                inq = inq[0].replace(","."")
                data.append(inq)
            else :
                data.append("")
                
            bd = re.findall(findBd,item)[0]
            bd = re.sub('br(\s+)? />(\s+)? '."",bd)
            bd = re.sub('/'."",bd) data.append(bd.strip()) datalist. Append (data) # Store a processed movie#print(datalist)
    return datalist

Copy the code

I use BeautifulSoup, which is different, so just find one that’s convenient

soup = BeautifulSoup(html, "html.parser")
Copy the code

Parse the extracted web pages with html.parser. Meanwhile, we can print out page information such as:

print(type(soup.head))
    print(soup.title)
    print(soup.title.string)#string is printed text
Copy the code

“. Title “”. A “”. Head “are tags in HTML pages that can also be used

print(soup.a.attrs)
Copy the code

get

Attrs: print (type ()) to find the exact type of each part.

Search for documents

1.find_all();

1 String filtering: Searches for strings that exactly match the stringCopy the code
list = soup.find_all("a")
    print(list)
Copy the code
2 Regular search: Matches with search ()Copy the code
import re
list=soup.find_all(re.compile("a"))
Copy the code

2. The kwargs () parameter

list=soup.find_all(id="head")
Copy the code

3. The text parameter

list=soup.find_all(text = "hao123")
Copy the code

That is, search in text form

Regular extraction

Will write regular blog, here will not expand, here just remember how to use. Different things need to be extracted from each webpage. Here, taking Douban Top250 as an example, we can see from the webpage HTML that the information we want is in item.

soup = BeautifulSoup(html,"html.parser")
for item in soup.find_all('div',class_="item"Data = [] # Save all information about a movie item = STR (item) # Force conversion to stringCopy the code

Put all the information we need into the item, and then we’re going to use the regex in the item to extract what we want to put into the data, and in this for loop, let’s say we’re looking for a link to a movie,

 Link = re.findall(findLink,item)[0] #Copy the code

[0] is required to fetch the first link. Now let’s define findLink as a global variable, i.e.

# findLink = re.compile(r'<a href="(. *?) ">')# create regular expression rulesCopy the code

Here we use regular rules to extract information from the HTML of the web page and then we use

data.append(Link)
Copy the code

Add information to data, and then

Datalist. Append (data) # Store the processed movie informationreturn datalist
Copy the code

This is the general extraction process, different information needs different details, and the preservation of information also needs further explanation, today more about this, next time to continue.