The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with
The following article is from the author of Tencent Cloud: Python learning tutorial
(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.) # This article mainly introduces the Python crawler to obtain douban movie and write it into Excel. The introduction of example code in this article is very detailed, which has certain reference learning value for everyone’s study or work. If you need it, you can refer to it
Douban top250 is divided into 10 pages. The url on the first page is movie.douban.com/top250, but in fact… The following parameter 0 indicates the number from which to start, such as 0 indicates from the first (Shawshank Redemption) to the twenty-fifth (unreachable), movie.douban.com/top250?star…
So you can use a for loop with a range of 25 steps
The copy code is as follows:
for i in range(0, 250, 25): print(i)
After analyzing the composition of the page, request. Get () returns nothing and prints the response code
url = 'https://movie.douban.com/top250?start=0'res = request.get(url=url)print(res.status_code)
Copy the code
Discovery returns response code 418
Add a header parameter to get
You can use your own browser’s user-agent, for example
Headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36’}
Or use fake_agent(PIP install fake_agent) to generate a random agent for yourself to add to the header dictionary
from fake_useragent import UserAgentheaders = {'User-Agent': UserAgent().random}
Copy the code
Then you can get the source code for the page.
The source code of the page is then parsed using lxml.etree, or xpath. Use the browser plug-in xpath Finder to quickly locate elements
import requestsimport lxml.etree as etreeurl = 'https://movie.douban.com/top250?start=0'headers = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}res = requests.get(url=url, headers=headers)print(res.text) html = etree.HTML(res.text)name = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]")print(name)
Copy the code
But if you go straight like this, you get something like this
[<Element span at 0x20b2f0cc488>] There’s a good article about what this thing is:www.jb51.net/article/132…
So I’m going to go straight to the solution here and use /text() when parsing with xpath
name = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()")
Copy the code
After that, use the xpath Finder plug-in to retrieve all of the movie’s data step by step
And then you write that in the function, and then you loop around it like you did in the beginning, and you’re good
# -*- coding: utf-8 -*- import requestsimport lxml.etree as etree def get_source(page):url = 'https://movie.douban.com/top250?start= {}'. The format (page) headers = {' the user-agent ':' Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}print(url)res = requests. Get (url=url, headers=headers)print(res.status_code)html = etree.HTML(res.text)for i in range(1, 26):name = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[1]/a/span[1]/text()".format(i))info = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[1]/text()".format(i))score = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/div/span[2]/text()".format(i))slogan = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[2]/span/text()".format(i))print(name[0])pr int(info[0].replace(' ', ''))print(info[1].replace(' ', ''))print(score[0])print(slogan[0]) n = 1for i in range(0, 250, 25) : print (' the first page % d % n) n + = 1 get_source (I) print (" = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ')
Copy the code
During positioning, it was found that four movies did not have a slogan, resulting in an empty list of information obtained, which also led to an error in list.append(). So I’ve added a couple of errors, maybe a little silly solution, if there’s a better solution, I’m all ears
The code can be seen at the endEXCEL Save section
I’m using XLWT here
book = xlwt.Workbook()
sheet = book.add_sheet(u’sheetname’, cell_overwrite_ok=True)
Create a sheet form.
The data is stored in a large list, with lists nested within lists
The data is then imported into the Excel form through a loop
r = 1for i in LIST: Print (x)sheet. Write (r, c, x)c += 1r += 1
Copy the code
I’ll save it at the end
book.save(r’douban.xls’)
Note that the file suffix must be XLS, XLSX will cause the file cannot be opened
And then you’re done
Open the file, manually join the ranking, and other part of the information (these can also be completed in the program, I too trouble, did not write, direct manual to the fast)The ✓ in front is my own, to record those who have seen, those who have not seen
That’s why I wrote it in the first place
The complete code is below for reference only
# -*- coding: utf-8 -*- import requestsimport lxml.etree as etreeimport xlwt def get_source(page):List = []url = 'https://movie.douban.com/top250?start= {}'. The format (page) headers = {' the user-agent ':' Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}print(url)res = requests. Get (url=url, headers=headers)print(res.status_code)html = etree.HTML(res.text)for i in range(1, 26):list = []name = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[1]/a/span[1]/text()".format(i))info = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[1]/text()".format(i))score = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/div/span[2]/text()".format(i))slogan = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[2]/span/text()".format(i))try:list.append( name[0])except:list.append('----')try:list.append(info[0].replace(' ', '').replace('\n', ''))except:list.append('----')try:list.append(info[1].replace(' ', '').replace('\n', ''))except:list.append('----')try:list.append(score[0])except:list.append('----')try:list.append(slogan[0])except:list.a ppend('----') List.append(list) return List n = 1LIST = []for i in range(0, 250, Format (n))n += 1List = get_source(I) list.append (LIST) def excel_write(LIST):book = xlwt.workbook ()sheet = book.add_sheet(u'sheetname', cell_overwrite_ok=True)r = 1for I in LIST: Print (x)sheet.write(r, c, x)c += 1r += 1 book.save(r'douban1.xls')
Copy the code
The above is all the content of this article, I hope to help you learn, but also hope that you support us more.