Python Crawler Note 1 (from MOOC)

Tip: This article is a part of the code that I learned by myself and typed in the MOOC of Chinese university. It is a pure record. If someone happens to be reading this course, it is easy to carry and run on their computer.

Course: Beijing Institute of Technology – Songtian -Python crawler and Information Extraction ==


Tip: a lot of self-exertion can help you learn language logic! @TOC


preface

General code framework:

import requests
def getHTMLText(url) :
    try:
        r=requests.get(url,timeput=30)
        r.raise_for_status()If the state is not 200, raise HTTPError
        r.encoding=r.apparemt_encoding
        return r.text
    except:
        return "Generate an exception"

if __name__=="__main__":
    url="http://www.baidu.com"
    print(getHTMLText(url))
Copy the code

The examples are from this week


Tip: Here’s the code and the results

One, songtian teacher courseware to give the code part

1. The crawling of jingdong commodity page

The code is as follows:

import requests
url="https://item.jd.com/2967929.html"
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print ("Crawl failed")
Copy the code

Running results: process terminated, exit code 0

2. The crawling of amazon product pages

The code is as follows:

import requests
url="https://www.amazon.cn/gp/product/B01M8L5Z3Y"
try:
    kv={'user-agent':'the Mozilla / 5.0'}
    r = requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("Crawl failed")
Copy the code

Running results:

  ue_sid = (document.cookie.match(/session-id= ([0-9-] +) /) | | []) [1],
        ue_sn = "opfcaptcha.amazon.cn",
        ue_id = 'FNY2VQ38P3R6JETHXGX2'; } </script> </head> <body> <! -- To discuss automated access to Amazon data please contact [email protected]. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv,or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac foradvertising use cases. --> <! -- Correios.DoNotSend --> <divclass="a-container a-padding-double-large" style="min-width:350px; padding:44px0! important"> a-row a-spacing-double-large" style="width: 350px; margin: 0 auto"> a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"> a-box a-alert a-alert-info a-spacing-base"> a-box-inneR "> Process terminated, exit code 0Copy the code

3. Baidu /360 keyword submission

Baidu code is as follows:

import requests
keyword="Python"
try:
    kv = {'wd':keyword}
    r = requests.get("http://www.baidu.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except :
    print("Crawl failed")
Copy the code

The results

http://www.baidu.com/s?wd=Python
660082Process terminated, exit code0

Copy the code

360 code is as follows:

import requests
keyword="Python"
try:
    kv={'q':keyword}
    r=requests.get("http://www.so.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("Crawl failed")
Copy the code

The results

https://www.so.com/s?q=Python
327996Process terminated, exit code0

Copy the code

4. Crawl and store network pictures

The code is as follows:

import requests
import os
url="https://imgsa.baidu.com/forum/w%3D580/sign=dc59751a6181800a6ee58906813433d6/5c40b4003af33a87e4518c8fcb5c10385243b5e4.jp g"
root="C://Users// Saved Pictures//
path=root+url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r=requests.get(url)
        with open(path,'wb')as f:
            f.write(r.content)
            f.close()
            print("File saved successfully")
    else:
        print("File already exists")
except:
    print("Crawl failed")
Copy the code

Running results:

File saved successfully process terminated, exit code0

Copy the code

5. Automatically query the IP address

The code is as follows:

import requests
url="http://m.ip138.com/ip.asp?ip="
try:
    r=requests.get(url+'202.204.80.112')
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.text[-500:)except:
    print("Crawl failed")
Copy the code

Running results:

Failed crawl process has finished, exit code0

Copy the code

Two, some problems in the process of personal operation and try methods

1. The IP address is automatically queried

I don’t think the problem with this code is a problem with the code, but it will give back the information of failure to crawl during operation, so there is no problem with the operation of the program, that is, the state is not 200. I think the problem should be here, GIao.

2. Some crawling process of other web pages

I found that some webpage will also appear some results that are not given by me above, most of which I have to log in the account, so maybe because of this reason (blind guess), if there is a solution, I would like to give you advice.

Small exercises

It’s also a topic in the teaching review section

The statistics takes 100 times to access a web page

Most of the code is similar, so mine is also copied and modified, but == I found a very interesting phenomenon ==, first on the code, the code is as follows:

import requests

import time


def getHtmlText(url) :
    try:

        r = requests.get(url, timeout=30)Get the content of the URL. Set timeout to 30 seconds

        r.raise_for_status()If not 200, raise requests.HTTPError
                    # r.raise_for_status()
                    # check whether r.status_code is equal to 200 within the method
                    # Add an additional if statement that facilitates exception handling with try‐except
        r.encoding = r.apparent_encoding    # Guess the encoding of the response content from the HTTP header

        return r.text# return the string form of the HTML response content, that is, the page content corresponding to the URL

    except:

        return 'Abnormal operation'


if __name__ == "__main__":

    url = 'https://www.tudou.com'  # Arbitrary fill in a url can be, I climb potatoes

    totaltime = 0

    for i in range(100):
        starttime = time.perf_counter()

        getHtmlText(url)

        endtime = time.perf_counter()

        print('{0} climb time {1:.4f} seconds; '.format(i + 1, endtime - starttime))

        totaltime = totaltime + endtime - starttime

    print('Total time {:.4f} seconds'.format(totaltime))
Copy the code

First run result:

C: \ Users \ ash 2 \ AppData \ Local \ designed \ Python \ Python39 \ Python exe C: / Users/grey/PycharmProjects/spiders/spider1 py first1When the second crawl is used0.8227Seconds; The first2When the second crawl is used0.7693Seconds; The first3When the second crawl is used0.7562Seconds; The first4When the second crawl is used0.7901Seconds; The first5When the second crawl is used0.1493Seconds; The first6When the second crawl is used0.1525Seconds; The first7When the second crawl is used0.1509Seconds; The first8When the second crawl is used0.1475Seconds; The first9When the second crawl is used0.1459Seconds; The first10When the second crawl is used0.1505Seconds; The first11When the second crawl is used0.1519Seconds; The first12When the second crawl is used0.1520Seconds; The first13When the second crawl is used0.1479Seconds; The first14When the second crawl is used0.1407Seconds; The first15When the second crawl is used0.1476Seconds; The first16When the second crawl is used0.1528Seconds; The first17When the second crawl is used0.1507Seconds; The first18When the second crawl is used0.1508Seconds; The first19When the second crawl is used0.1563Seconds; The first20When the second crawl is used0.1515Seconds; The first21When the second crawl is used0.1475Seconds; The first22When the second crawl is used0.1574Seconds; The first23When the second crawl is used0.1467Seconds; The first24When the second crawl is used0.1551Seconds; The first25When the second crawl is used0.1580Seconds; The first26When the second crawl is used0.1489Seconds; The first27When the second crawl is used0.1469Seconds; The first28When the second crawl is used0.1578Seconds; The first29When the second crawl is used0.1576Seconds; The first30When the second crawl is used0.1541Seconds; The first31When the second crawl is used0.1482Seconds; The first32When the second crawl is used0.1489Seconds; The first33When the second crawl is used0.1493Seconds; The first34When the second crawl is used0.1560Seconds; The first35When the second crawl is used0.1531Seconds; The first36When the second crawl is used0.1519Seconds; The first37When the second crawl is used0.1480Seconds; The first38When the second crawl is used0.1476Seconds; The first39When the second crawl is used0.1481Seconds; The first40When the second crawl is used0.1491Seconds; The first41When the second crawl is used0.1460Seconds; The first42When the second crawl is used0.1420Seconds; The first43When the second crawl is used0.1724Seconds; The first44When the second crawl is used0.1520Seconds; The first45When the second crawl is used0.1509Seconds; The first46When the second crawl is used0.1536Seconds; The first47When the second crawl is used0.1484Seconds; The first48When the second crawl is used0.1499Seconds; The first49When the second crawl is used0.1478Seconds; The first50When the second crawl is used0.1471Seconds; The first51When the second crawl is used0.1593Seconds; The first52When the second crawl is used0.1560Seconds; The first53When the second crawl is used0.1606Seconds; The first54When the second crawl is used0.1516Seconds; The first55When the second crawl is used0.1518Seconds; The first56When the second crawl is used0.1562Seconds; The first57When the second crawl is used0.1541Seconds; The first58When the second crawl is used0.1452Seconds; The first59When the second crawl is used0.1510Seconds; The first60When the second crawl is used0.1504Seconds; The first61When the second crawl is used0.1475Seconds; The first62When the second crawl is used0.1588Seconds; The first63When the second crawl is used0.1615Seconds; The first64When the second crawl is used0.1512Seconds; The first65When the second crawl is used0.1497Seconds; The first66When the second crawl is used0.1524Seconds; The first67When the second crawl is used0.1565Seconds; The first68When the second crawl is used0.1565Seconds; The first69When the second crawl is used0.1765Seconds; The first70When the second crawl is used0.1601Seconds; The first71When the second crawl is used0.1574Seconds; The first72When the second crawl is used0.1463Seconds; The first73When the second crawl is used0.1488Seconds; The first74When the second crawl is used0.1771Seconds; The first75When the second crawl is used0.1589Seconds; The first76When the second crawl is used0.1582Seconds; The first77When the second crawl is used0.1474Seconds; The first78When the second crawl is used0.1692Seconds; The first79When the second crawl is used0.1542Seconds; The first80When the second crawl is used0.1560Seconds; The first81When the second crawl is used0.1439Seconds; The first82When the second crawl is used0.1464Seconds; The first83When the second crawl is used0.1505Seconds; The first84When the second crawl is used0.1574Seconds; The first85When the second crawl is used0.1706Seconds; The first86When the second crawl is used0.1520Seconds; The first87When the second crawl is used0.1603Seconds; The first88When the second crawl is used0.1629Seconds; The first89When the second crawl is used0.1483Seconds; The first90When the second crawl is used0.1504Seconds; The first91When the second crawl is used0.1560Seconds; The first92When the second crawl is used0.1702Seconds; The first93When the second crawl is used0.1525Seconds; The first94When the second crawl is used0.1501Seconds; The first95When the second crawl is used0.1587Seconds; The first96When the second crawl is used0.1555Seconds; The first97When the second crawl is used0.1535Seconds; The first98When the second crawl is used0.1521Seconds; The first99When the second crawl is used0.1463Seconds; The first100When the second crawl is used0.1486Seconds; Elapsed time17.8437Second process has ended, exit code0

Copy the code

Second running result:

C: \ Users \ ash 2 \ AppData \ Local \ designed \ Python \ Python39 \ Python exe C: / Users/grey/PycharmProjects/spiders/spider1 py first1When the second crawl is used0.2139Seconds; The first2When the second crawl is used0.1623Seconds; The first3When the second crawl is used0.1626Seconds; The first4When the second crawl is used0.1517Seconds; The first5When the second crawl is used0.1464Seconds; The first6When the second crawl is used0.1650Seconds; The first7When the second crawl is used0.1583Seconds; The first8When the second crawl is used0.1636Seconds; The first9When the second crawl is used0.1567Seconds; The first10When the second crawl is used0.1541Seconds; The first11When the second crawl is used0.1458Seconds; The first12When the second crawl is used0.1575Seconds; The first13When the second crawl is used0.1507Seconds; The first14When the second crawl is used0.1615Seconds; The first15When the second crawl is used0.1579Seconds; The first16When the second crawl is used0.1538Seconds; The first17When the second crawl is used0.1548Seconds; The first18When the second crawl is used0.1672Seconds; The first19When the second crawl is used0.1584Seconds; The first20When the second crawl is used0.1739Seconds; The first21When the second crawl is used0.1481Seconds; The first22When the second crawl is used0.1510Seconds; The first23When the second crawl is used0.1552Seconds; The first24When the second crawl is used0.1521Seconds; The first25When the second crawl is used0.1567Seconds; The first26When the second crawl is used0.1539Seconds; The first27When the second crawl is used0.1452Seconds; The first28When the second crawl is used0.1547Seconds; The first29When the second crawl is used0.1510Seconds; The first30When the second crawl is used0.1476Seconds; The first31When the second crawl is used0.1540Seconds; The first32When the second crawl is used0.1586Seconds; The first33When the second crawl is used0.1588Seconds; The first34When the second crawl is used0.1574Seconds; The first35When the second crawl is used0.1663Seconds; The first36When the second crawl is used0.1593Seconds; The first37When the second crawl is used0.1474Seconds; The first38When the second crawl is used0.1612Seconds; The first39When the second crawl is used0.1568Seconds; The first40When the second crawl is used0.1677Seconds; The first41When the second crawl is used0.1660Seconds; The first42When the second crawl is used0.1542Seconds; The first43When the second crawl is used0.1844Seconds; The first44When the second crawl is used0.1568Seconds; The first45When the second crawl is used0.1601Seconds; The first46When the second crawl is used0.1524Seconds; The first47When the second crawl is used0.1578Seconds; The first48When the second crawl is used0.1521Seconds; The first49When the second crawl is used0.1598Seconds; The first50When the second crawl is used0.1508Seconds; The first51When the second crawl is used0.1464Seconds; The first52When the second crawl is used0.1452Seconds; The first53When the second crawl is used0.1617Seconds; The first54When the second crawl is used0.1652Seconds; The first55When the second crawl is used0.1500Seconds; The first56When the second crawl is used0.1532Seconds; The first57When the second crawl is used0.1473Seconds; The first58When the second crawl is used0.1525Seconds; The first59When the second crawl is used0.1594Seconds; The first60When the second crawl is used0.1496Seconds; The first61When the second crawl is used0.1482Seconds; The first62When the second crawl is used0.1484Seconds; The first63When the second crawl is used0.3039Seconds; The first64When the second crawl is used0.1562Seconds; The first65When the second crawl is used0.1579Seconds; The first66When the second crawl is used0.1717Seconds; The first67When the second crawl is used0.1652Seconds; The first68When the second crawl is used0.1505Seconds; The first69When the second crawl is used0.1652Seconds; The first70When the second crawl is used0.1548Seconds; The first71When the second crawl is used0.1624Seconds; The first72When the second crawl is used0.1704Seconds; The first73When the second crawl is used0.1552Seconds; The first74When the second crawl is used0.1550Seconds; The first75When the second crawl is used0.1539Seconds; The first76When the second crawl is used0.1476Seconds; The first77When the second crawl is used0.1586Seconds; The first78When the second crawl is used0.1500Seconds; The first79When the second crawl is used0.1553Seconds; The first80When the second crawl is used0.1504Seconds; The first81When the second crawl is used0.1666Seconds; The first82When the second crawl is used0.1464Seconds; The first83When the second crawl is used0.1562Seconds; The first84When the second crawl is used0.1534Seconds; The first85When the second crawl is used0.1571Seconds; The first86When the second crawl is used0.1542Seconds; The first87When the second crawl is used0.1549Seconds; The first88When the second crawl is used0.1472Seconds; The first89When the second crawl is used0.1523Seconds; The first90When the second crawl is used0.1807Seconds; The first91When the second crawl is used0.1606Seconds; The first92When the second crawl is used0.1585Seconds; The first93When the second crawl is used0.1551Seconds; The first94When the second crawl is used0.1577Seconds; The first95When the second crawl is used0.1603Seconds; The first96When the second crawl is used0.1542Seconds; The first97When the second crawl is used0.1575Seconds; The first98When the second crawl is used0.1590Seconds; The first99When the second crawl is used0.1623Seconds; The first100When the second crawl is used0.1639Seconds; Elapsed time15.8824Second process has ended, exit code0

Copy the code

Maybe you don’t see what I’m saying is interesting, that the first four climbs take more time than the other ones, right

The first1When the second crawl is used0.8227Seconds; The first2When the second crawl is used0.7693Seconds; The first3When the second crawl is used0.7562Seconds; The first4When the second crawl is used0.7901Seconds; The first5When the second crawl is used0.1493Seconds;Copy the code

But that won’t happen the second time.

The first1When the second crawl is used0.2139Seconds; The first2When the second crawl is used0.1623Seconds; The first3When the second crawl is used0.1626Seconds; The first4When the second crawl is used0.1517Seconds; The first5When the second crawl is used0.1464Seconds;Copy the code

So I was kind of curious. Baidu did not have what to have the answer of reference (== also just casually searched ==), if the elder brother knows can inform the younger brother again good.