The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

Author: Big Smart Yitai Source: CSDN

Link to this article: blog.csdn.net/weixin\_464…

Source of the case: In the hands-on operation of the online course “Python Web Crawler and Information Acquisition” of Beijing Institute of Technology, I found that the code demonstrated in the video could not run completely. After personal exploration, I recorded as follows:

import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r=requests.get(url,timeout=30) r.raise_for_status() r.encoding=r.apparent_encoding return r.text except: return ""; def fillUnivList(ulist,html): soup=BeautifulSoup(html,"html.parser") for tr in soup.find('tbody').children: if isinstance(tr,bs4.element.Tag): tds=tr('td') ulist.append([tds[0].string,tds[1].string,tds[2].string]) def printUnivList(ulist,num): Print (" {: ^ 10} {: ^ 6} \ \ t t {: ^ 10} ". The format (" rank ", "school name", "total")) for I in range (num) : u=ulist[i] print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2])) def main(): uinfo=[]; url='http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html' html=getHTMLText(url) fillUnivList(uinfo,html) printUnivList(uinfo,20) main()Copy the code

On the first case, the site has some changes, practical jump to www.shanghairanking.cn/rankings/bc… In addition, if the above code is copied to run, it will be found that the variable name error, its UINFO and UList should be unified into one variable

After the above problems are fixed, the execution code will find that the statement to extract key information is invalid, and the information cannot be captured

Re-analyzing the source code of the web page, you can see that all the school information is placed under tbody

Extract one of these messages and take a look:

Tr (‘td’)[1]. String cannot be retrieved. Tr (‘ TD ‘)[1]

The ranking and the total score are also not available in string, so the solution I’ve come up with is to manually convert them to strings

To sum up, I changed the key code for obtaining information into:

ulist.append([str(tr('td')[0].contents[0]).strip(), tds[1].a.string, str(tr('td')[4].contents[0]).strip()])
Copy the code

The complete code is as follows

"" def getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivLIst(ulist, html): soup = bs4.BeautifulSoup(html, "html.parser") for tr in soup.find('tbody').children: If isinstance(tr, bs4. Element.tag): tds = tr('td') ulist.append([str(tr('td')[0].contents[0]).strip(), tds[1].a.string, str(tr('td')[4].contents[0]).strip()]) return ulist def printUnivList(ulist, num): # TPLT = "{10} 0: ^ \ t {1: {3} ^ 6} \ t {10} 2: ^" print (" {: ^ 10} {: ^ 6} \ \ t t {: ^ 10} ". The format (" rank ", "school name", "total")) # Print (tplt.format(" 数 "," 数 ", CHR =(12288)) for I in range(num): u=ulist[i] print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2])) # print(tplt.format(u[0],u[1],u[2],chr=(12288))) def main(): ulist = [] url = 'https://www.shanghairanking.cn/rankings/bcur/2020' html = getHTMLText(url) ulist = fillUnivLIst(ulist,  html) printUnivList(ulist, 20) main()Copy the code

The running results are as follows:

Finally, as for the optimization of typesetting, I tried the solution provided in the video and reported a mistake. The problem has not been solved for the time being, so we will discuss it later when we are free

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

The case of creepers: Chinese university rankings

The complete code is as follows

The case of creepers: Chinese university rankings

The complete code is as follows

Related Posts

Java basics for thread safety

Template engine for Go Web

The JVM GC summary