A, bs4 analytic importrequestsfrombs4importBeautifulSoupimportdatetimeif__name__ = = “__main__ ‘: url =…

First, BS4 analysis

1. import requests 2. from bs4 import BeautifulSoup 3. import datetime 4. if \_\_name\_\_\=='\_\_main\_\_': 5. Url \ = 'https://www.bilibili.com/v/popular/rank/all' 6. Headers \ = {7. / / set their own browser request header 8.} 9. page\_text\=requests.get(url\=url,headers\=headers).text 10. soup\=BeautifulSoup(page\_text,'lxml') 11. li\_list\=soup.select('.rank-list > li') 12. with open('bZhanRank\_bs4.txt','w',encoding\='utf-8') as fp: 14. fp.write(' datetime.datetime.now())+'\\n\ n') 14. for li in li\_list: 15. The analytical video ranking # 16. Li \ _rank \ = li find (' div 'class \ _ \ =' num '). The string 17. Li \ _rank \ = 'video line of behavior: '+ li \ _rank +', '18. # parse video title 19. Li \ _title \ = li find (' div' class \ _ \ = 'info'). The a.s tring. The strip (20), li \ _title \ = 'video entitled: '+ li \ _title +', '21. Analytical video playback volume # 22. Li \ _viewCount \ = li select ('. The detail > span') \ [0 \] text. The strip () 23. Li \ _viewCount \ = 'video playback volume as follows: '+li\_viewCount+', Select ('.detail>span')\[1\].text.strip() \[1\].text. '+li\_danmuCount+', Find ('span',class\_\='data-box upname ').text.strip() 29. Li \_upName\=' video up main: '+li\_upName+', Li \_zongheScore\=li. Find ('div',class\_\=' PTS ').div. Li \_zongheScore =' +li\_zongheScore 33. fp.write(li\_rank+li\_title+li\_viewCount+li\_danmuCount+li\_upName+li\_zongheScore+'\\n')Copy the code

The crawl result is as follows:

Xpath parsing


1. import requests 2. from lxml import etree 3. import datetime 4. if \_\_name\_\_ \== "\_\_main\_\_": # 5. Set the request header 6. Headers \ = {7. / / set their own browser request header 8.} 9. # set url 10. Url \ = 'https://www.bilibili.com/v/popular/rank/all' 11. Page \_text \= requests. Get (url\=url,headers =headers).content.decode(' utF-8 ') Tree \= etree.html (page\_text) 17. Li \_list = tree. Xpath ('//ul\[@class="rank-list"\]/li') 17. With open('./ bzhanrank. TXT ', 'w', encoding\=' utF-8 ') as fp: 20. Fp. write(' time: '+ STR (datetime.datetime.now())+'\\n\ n') 21. 23. # read video ranked 24. Li \ _rank \ = li xpath ('. / / div \ [@ class = "num" \] / text () ') 25. # \ [0 \] using the index took out the string from a list of 26. Li \ _rank \ = 'video list: '+ li \ _rank \ [0 \] +' \ \ n '27. # read the video title 28. Li \ _title \ = li xpath ('. / / a/text ()') 29 li \ _title \ = 'video title: '+ li \ _title \ [0 \] +' \ \ n '30. # read the video playback volume 31. Li \ _viewCount \ = li xpath ('. / / div \ [@ class = "detail" \] / span \ [1 \] / text ()') 32. #.strip() to remove extra Spaces from string 33.li \_viewCount\=' '+ li \ _viewCount \ [0 \] strip () + "\ \ n' 34 # read video barrage number 35 li \ _barrageCount \ = Li. Xpath ('. / / div \ [@ class = "detail" \] / span \ [2 \] / text () ') 36. Li \ _barrageCount \ = 'video barrage number: '+ li \ _barrageCount \ [0 \] strip () + "\ \ n' 37. # read the video up main nickname 38. Li \ _upName \ = li xpath ('. / / span \ [@ class =" data - box Upname "\]//text()') 39. li\_upName\=' '+ li \ _upName \ [0 \] strip () + "\ \ n' 40. # read video integrated scoring 41. Li \ _score \ = li xpath ('. / / div \ [@ class =" PTS \] / div/text () ') 42. Li \_score\=' Video comprehensive score: '+ li \ _score \ [0 \] +' \ \ n \ \ n '43. 44 # storage file. The fp. Write (li + li \ \ _rank _title + li + li \ \ _viewCount _barrageCount + li + li \ \ _upName _score) 45. Print (li\_rank+' select!!!! ')Copy the code

The crawl result is as follows:

Iii. Xpath parsing (show pictures after binarization)


1. #---------- third-party library import ---------- 2. Import requests# crawl web source code 3 Datetime# add time to crawl data 5. from PIL import Image# used to open and reload images 6. from cv2 import cv2# binarized images 7. from IO import BytesIO# format images 8. The import re# to regular the source code 9. # -- -- -- -- -- -- -- -- -- -- function -- -- -- -- -- -- -- -- -- - 10. Def dJpg (url, title) : 11. "" "12. :param URL :(url) 14. :return:(null+ save image file) 15. """ 16. Headers = {17. } 19. Resp \= requests. Get (url, headers\=headers) 20. byte\_stream \= BytesIO(resp.content) 21. im \= Image.open(byte\_stream) 22. if im.mode \== "RGBA": 23. im.load() 24. background \= Image.new("RGB", im.size, (255, 255, 255)) 25. background.paste(im, mask\=im.split()\[3\]) 26. im.save(title+'.jpg', 'JPEG') 27. def handle\_image(img\_path): 28. """ 29. Img \= cv2.imread(img\_path) 30. :param img\_path:(image path) 31. :return: (return image) 32. "" 33. Gray = cv2.cvtcolor (img, cv2.color / _RGB2GRAY) Ret, binary \= cv2.threshold(gray, 127, 255, Cv2. THRESH \ _BINARY) 39. Return binary. 41 # -- -- -- -- -- -- -- -- -- -- the main entrance to the -- -- -- -- -- -- -- -- -- -- 42. If \ _ \ _name \ _ \ _ \ = = "\ _ \ _main \ _ \ _" : 43. # -- -- -- -- -- variable for 44. -- -- -- -- -- the list \ _rank \ = \ [\] \ # store a list of video titles 45 list \ _pic \ _url \ = \ [\] \ # store image url list of 47. Headers = {51. 'user-agent ':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.2261 SLBChan/10' 52.} 53 'https://www.bilibili.com/v/popular/rank/all' 55. # crawl home page source file below 56. The page \ _text \ = Request. Get (url\=url,headers =headers).content.decode(' utF-8 ') 59. # crawl all 60 of the location of the video tag. Li \ _list \ = tree xpath (' / / ul \ [@ class = "rank - list" \] / li ') 62. # -- -- -- -- -- data parsing (image) -- -- -- -- -- 64. \ # Unable to tag the url of the image, Now to the source code for the regular processing 65. Others \ _ex \ = r '" others ". \ *? "dar" (. \ *?) \]' 66. The list \ _others \ = re the.findall (others \ _ex, page \ _text. Re.s) 67. # Replace the others part of the source code with a loop 68. 69. page\_text \= page\_text.replace(l, '') 70. pic\_ex \= r'"copyright":.\*?,"pic":"(.\*?)","title":".\*?"' 71. list\_pic \= re.findall(pic\_ex, page\_text, Rfind ('u002F') 73. index = list _pic [0].rfind('u002F') 74 list\_pic: 76. pic\_url \= 'http://i1.hdslb.com/bfs/archive/' + i\[index + 5:\] + '@228w\_140h\_1c.webp' 77. The list \ _pic \ _url append (PIC \ _url) 79. # -- -- -- -- -- data -- -- -- -- -- 80. # to crawl to the content of storing 81. With the open ('. / bZhanRank2. TXT ', 'w', encoding\='utf-8') as fp: 83.fp. write(' STR (datetime.datetime.now())+'\\n') 84.fp. write(' author: 85 MB \ \ n '). The fp. Write (' \ * \ * 10 + 'content of the following is a list' + '\ * \ * 10 + "\ \ n \ \ n") 87. \ # using a loop structure, 88. For I in range(len(li\_list)): 89. # read video ranked 90. Li \ _rank \ = li \ _list \ [I \] xpath ('. / / div \ [@ class = "num" \] / text () ') 91. PIC \_title\=li\_rank# assign non-Chinese video ranking as picture name 92. #\[0] Use index to fetch string 93. li\_rank =' video ranking: '+ li \ _rank \ [0 \] +' \ \ n '94. # read the video title 95. Li \ _title \ = li \ _list \ [I \] xpath ('. / / a/text ()') 96. Li \ _title \ = 'video title: '+ li \ _title \ [0 \] +' \ \ n '97. # read the video playback volume 98. Li \ _viewCount \ = li \ _list \ [I \] xpath ('. / / div \ [@ class = "detail" \] / span \ [1 \] / text ()') 99. #.strip() to remove extra Spaces from string 100. li\_viewCount\=' '+ li \ _viewCount \ [0 \] strip () + "\ \ n' 101. # read video barrage number 102. Li \ _barrageCount \ = Li \ _list \ [I \] xpath ('. / / div \ [@ class = "detail" \] / span \ [2 \] / text () ') 103. Li \ _barrageCount \ = 'video barrage number: '+ li \ _barrageCount \ [0 \] strip () + "\ \ n' 104. # read the video up main nickname 105. Li \ _upName \ = li \ _list \ [I \] xpath ('. / / span \ [@ class =" data - box Upname "\]//text()') 106. li\_upName\=' '+ li \ _upName \ [0 \] strip () + "\ \ n' comprehensive score 108. 107. # read video li \ _score \ = li \ _list \ [I \] xpath ('. / / div \ [@ class =" PTS \] / div/text () ') 109. Li \_score\=' Video comprehensive score: Write (li\_rank +li\ _title +li\ _viewCount +li\ _barrageCount + DJpg (list\_pic\_url\[I \], Img \= handle\_image(STR (PIC \_title) + '.jpg') 118. # Img \= cv2.resize(img, (120, 40)) 120. height, width \= img.shape 121. for row in range(0, 0) 124. img\[row]\[col]\ == 0: img\[row]\[col]\ == 0: 125. ch \= '1' 126.fp. write(ch) 126. else: 129. The fp. Write (' ') 130. The fp. Write (' \ * \ \ n) 131. The fp. Write (' \ \ n \ \ n \ \ n ') 132. A print (li \ _rank + 'crawl success!!!!!!! ')Copy the code

Before notepad can display results, you need to make the following changes to the format of notepad for better visual effect:

The crawling results are as follows :(for image display, it is the cover image in the downloaded web page (webp format). First, the format is converted to JPG format, and then the binary processing is performed on it (the value is directly 0 for the pixel value greater than 127, and 1 for the pixel value greater than 127). Then go through all the pixels, write “1” for the pixel value of 0 (black), write “space” for the pixel value of 1 (white), to achieve a simple image simulation display.)

Horizontal and below horizontal images are not crawled at the same time.

In order to balance the relationship between text display and image display, the size of the picture is set to a smaller size, so the picture display is not clear. To make the picture display clearly, you can set the size of the picture to a larger size and change the font size in notepad (in case of serial) without considering the effect of text display, so that the picture can be displayed clearly, as shown below.

4. Analysis process


(1) Get URL — get the URL of the video ranking list of station B

(2) Get the request header — (right click — check), open developer tools, click Network, select a random packet, copy the request header

(3) Web page analysis — Click the grabber tool in the upper left corner of developer tool, select the video in the page, and find that each different video is stored in a different LI tag

(4) Web page analysis — select the title of the video on the page and find that the title content is stored in the text content of a tag. The search method for the rest of the video information is the same as above.

(5) Webpage analysis — When viewing the information of video playback volume, it is found that it is stored under the SPAN label and contains Spaces. When writing the code, strip() method is used to remove Spaces

(6) Debug code – When debugging code, the list of image urls to crawl is empty

(7) Error – check the image URL label location and find the correct location

(8) Troubleshooting — The crawl information is empty, which may be that the webpage uses javascript asynchronous loading in order to reduce the loading burden. In the developer tool, click XHR to find the packet storing the image URL in the packet, but it does not exist

(9) troubleshooting – (right-click – view the source code of the web page), search the URL of the picture in the source code, and find that all the URL of the picture are stored in the back of the source code of the web page, you can consider using regular expression for parsing

(10) Troubleshooting: During the process of regular expression parsing, the others list is returned. This list is recommended at the bottom of some videos. Otherwise, regular expression parsing will be affected

That’s the end of this article about the Python crawler crawling the bilibili-top videos list

Search below add teacher wechat

6b9f68CC1e4f3f81AF ~ tplV-k3u1fbpfcp-zoom-1.image)