Python crawler practical tips, combining Python with Excel to get douban movies

The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article is from the author of Tencent Cloud: Python learning tutorial

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.) # This article mainly introduces the Python crawler to obtain douban movie and write it into Excel. The introduction of example code in this article is very detailed, which has certain reference learning value for everyone’s study or work. If you need it, you can refer to it

Douban top250 is divided into 10 pages. The url on the first page is movie.douban.com/top250, but in fact… The following parameter 0 indicates the number from which to start, such as 0 indicates from the first (Shawshank Redemption) to the twenty-fifth (unreachable), movie.douban.com/top250?star…

So you can use a for loop with a range of 25 steps

The copy code is as follows:

for i in range(0, 250, 25): print(i)

After analyzing the composition of the page, request. Get () returns nothing and prints the response code

url = 'https://movie.douban.com/top250?start=0'res = request.get(url=url)print(res.status_code)
Copy the code

Discovery returns response code 418

Add a header parameter to get

You can use your own browser’s user-agent, for example

Headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36’}

Or use fake_agent(PIP install fake_agent) to generate a random agent for yourself to add to the header dictionary

from fake_useragent import UserAgentheaders = {'User-Agent': UserAgent().random}
Copy the code

Then you can get the source code for the page.

The source code of the page is then parsed using lxml.etree, or xpath. Use the browser plug-in xpath Finder to quickly locate elements

import requestsimport lxml.etree as etreeurl = 'https://movie.douban.com/top250?start=0'headers = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}res = requests.get(url=url, headers=headers)print(res.text) html = etree.HTML(res.text)name = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]")print(name)
Copy the code

But if you go straight like this, you get something like this

[<Element span at 0x20b2f0cc488>] There’s a good article about what this thing is:www.jb51.net/article/132…

So I’m going to go straight to the solution here and use /text() when parsing with xpath

name = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()")
Copy the code

After that, use the xpath Finder plug-in to retrieve all of the movie’s data step by step

And then you write that in the function, and then you loop around it like you did in the beginning, and you’re good

# -*- coding: utf-8 -*- import requestsimport lxml.etree as etree def get_source(page):url = 'https://movie.douban.com/top250?start= {}'. The format (page) headers = {' the user-agent ':' Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}print(url)res = requests. Get (url=url, headers=headers)print(res.status_code)html = etree.HTML(res.text)for i in range(1, 26):name = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[1]/a/span[1]/text()".format(i))info = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[1]/text()".format(i))score = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/div/span[2]/text()".format(i))slogan = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[2]/span/text()".format(i))print(name[0])pr int(info[0].replace(' ', ''))print(info[1].replace(' ', ''))print(score[0])print(slogan[0]) n = 1for i in range(0, 250, 25) : print (' the first page % d % n) n + = 1 get_source (I) print (" = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ')
Copy the code

During positioning, it was found that four movies did not have a slogan, resulting in an empty list of information obtained, which also led to an error in list.append(). So I’ve added a couple of errors, maybe a little silly solution, if there’s a better solution, I’m all ears

The code can be seen at the endEXCEL Save section

I’m using XLWT here

book = xlwt.Workbook()

sheet = book.add_sheet(u’sheetname’, cell_overwrite_ok=True)

Create a sheet form.

The data is stored in a large list, with lists nested within lists

The data is then imported into the Excel form through a loop

r = 1for i in LIST: Print (x)sheet. Write (r, c, x)c += 1r += 1
Copy the code

I’ll save it at the end

book.save(r’douban.xls’)

Note that the file suffix must be XLS, XLSX will cause the file cannot be opened

And then you’re done

Open the file, manually join the ranking, and other part of the information (these can also be completed in the program, I too trouble, did not write, direct manual to the fast)The ✓ in front is my own, to record those who have seen, those who have not seen

That’s why I wrote it in the first place

The complete code is below for reference only

# -*- coding: utf-8 -*- import requestsimport lxml.etree as etreeimport xlwt def get_source(page):List = []url = 'https://movie.douban.com/top250?start= {}'. The format (page) headers = {' the user-agent ':' Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}print(url)res = requests. Get (url=url, headers=headers)print(res.status_code)html = etree.HTML(res.text)for i in range(1, 26):list = []name = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[1]/a/span[1]/text()".format(i))info = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[1]/text()".format(i))score = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/div/span[2]/text()".format(i))slogan = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[2]/span/text()".format(i))try:list.append( name[0])except:list.append('----')try:list.append(info[0].replace(' ', '').replace('\n', ''))except:list.append('----')try:list.append(info[1].replace(' ', '').replace('\n', ''))except:list.append('----')try:list.append(score[0])except:list.append('----')try:list.append(slogan[0])except:list.a ppend('----') List.append(list) return List n = 1LIST = []for i in range(0, 250, Format (n))n += 1List = get_source(I) list.append (LIST) def excel_write(LIST):book = xlwt.workbook ()sheet = book.add_sheet(u'sheetname', cell_overwrite_ok=True)r = 1for I in LIST: Print (x)sheet.write(r, c, x)c += 1r += 1 book.save(r'douban1.xls')
Copy the code

The above is all the content of this article, I hope to help you learn, but also hope that you support us more.

Python crawler practical tips, combining Python with Excel to get douban movies

Related Posts

R language spline curve, decision tree, Adaboost, gradient lifting (GBM) algorithm for regression, classification and dynamic visualization

Principle and practice of Flink Sql Gateway

Download Swift Package Dependencies through Xcode