1

Last week a friend on some projects, which need to use tencent in the anime cartoon image, but his hand a picture a save as to save it waste too much time, and then use Python for fetching, but very helpless tencent anime section cartoon DOM elements is asynchronous loading, the other pictures also use lazy loading, The use of general crawling method is certainly not feasible, so we will analyze a wave, and finally successful access to the image link, then you can do whatever you want ~

2

First of all, we need to choose a cartoon, take “the strongest soldier king” as an example, obtaining the first section of the url, is http://ac.qq.com/ComicView/index/id/631111/cid/4, first of all, we first take a look at page rendering HTML:

As can be seen from the above figure, the theme part of the cartoon is rendered using JS template, so we need to wait until the page is loaded, before obtaining its HTML elements:

from selenium import webdriverimport timeif __name__ == '__main__':
                                            driver = webdriver.PhantomJS()    driver.get("http://ac.qq.com/ComicView/index/id/631111/cid/4")
                                                time.sleep(3)    out_html= driver.find_element_by_xpath("//*").get_attribute("outerHTML")
                                                Copy the code

In the above code, we fetch webDriver to the page and then sleep for 3 seconds (i.e. wait for 3 seconds to continue executing the code below). Then JS has rendered the complete comic DOM element and we can directly fetch the complete HTML after rendering.

However, there is a problem: all images in the comics have the same URL:

Then a simple test found that the cartoon chapter diagram used lazy loading mode, that is, the real URL image was loaded only when the image was scrolled to the top of a fixed PX value, so we need to execute a section of JS to simulate the browser scrolling, but the scrolling should not be too fast, otherwise lazy loading will not be carried out. The modified code is as follows:

driver = webdriver.PhantomJS() driver.get("http://ac.qq.com/ComicView/index/id/631111/cid/4") time.sleep(3) driver.execute_script(""" var temp_index = 100; setInterval(function(){ window.scrollTo(0,temp_index+=100); }, 30); """) time.sleep(5) out_html = driver.find_element_by_xpath("//*").get_attribute("outerHTML")Copy the code

At this point, we have the full HTML code for the animation chapter containing the actual image URL, and we can do whatever we want. Here I use BeautifulSoup to get the property value, so the final script is:

from selenium import webdriverimport timefrom bs4 import BeautifulSoup if __name__ == '__main__': driver = webdriver.PhantomJS() driver.get("http://ac.qq.com/ComicView/index/id/631111/cid/4") time.sleep(3) driver.execute_script(""" var temp_index = 100; setInterval(function(){ window.scrollTo(0,temp_index+=100); }, 30); """) time.sleep(5) out_html = driver.find_element_by_xpath("//*").get_attribute("outerHTML") soup = BeautifulSoup(out_html) comic_contain = soup.find('ul', id='comicContain') img_list = comic_contain.find_all('img', Class_ ='loaded') for I in img_list: print(I [' SRC ']) #Copy the code

Results as follows:

3

In general, the code of the writing process is still smooth, its analysis of the whole situation is relatively clear, in addition, now more and more websites are using asynchronous loading, feeling after a wave of new technology is coming ~