The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article is from Tencent Cloud, by Zhe Luo Bunao

I have always been in the habit of watching American TV series. On the one hand, I can exercise my English listening comprehension, and on the other hand, I can kill time. It used to be possible to watch online on video websites, but since the restrictions imposed by the State Administration of Press, Publication, Radio, Film and Television (SARFT), it seems that imported American and British TV series are no longer updated synchronously as before. However, as a house DIao how I am willing to do not play after it, so the Internet casually check to find a can use thunderbolt download American drama download website [every day American drama], all kinds of resources casually download, recently obsessed with the BBC hd documentary, the beauty of nature is not not.



Although I found the resource website and could download it, I had to open the browser every time, input the url, find the American TV series, and then click the link to download it. For a long time, I feel that the process is tedious, and sometimes the website link will not open, it will be a bit troublesome. Just have been learning Python crawler, so today on a whim to write a crawler, crawl all the links on the website, and save in the text document, which want to directly open the copy link to The Thunder can download it.



Find a URL, use Requests to open the crawl download link, and climb the entire site from the home page. However, a lot of repeated links, as well as the url of its website is not so regular as I think, wrote for a long time did not write the kind of divergent crawler I want, perhaps it is not their own heat, continue to work hard…

Later found that the TV links are in the article, and then the article url has a digital number, so witty and I use the crawler written earlier experience, the solution is to automatically generate the url, followed by the number can not change, and each of the play is the only, so try the roughly how many articles, Then use the range function to generate the number directly to construct the URL.

Don’t worry, we’re using Requests. Status_code is used to check the status of requests, so we’ll skip any urls that return a 404 status.

Here is the code for the above steps.



The rest is very smooth, the Internet to find a similar crawler written by predecessors, but just climb an article, so draw on its regular expression. BeautifulSoup has not had a good regular effect, so I decided to abandon it, learning is endless ah. But the effect is not so ideal, there are about half of the links can not grasp correctly, still need to continue to optimize.





Complete version of the code, which also used multi-threading, but feel useless, because of Python’s GIL, it seems that there are more than 20,000 plays, this thought it would take a long time to complete the crawl, but remove the URL error and did not match, the total crawl time is less than 20 minutes. I wanted to use Redis to crawl on both Linux, but after a bit of fiddling I didn’t think it was necessary, so I’ll do it later when I need more data.

There is a very torture in the process of my problem is to save the file name, must complain about this, TXT file name format can have a space, but can not have a slash, backslash, parentheses, etc.. This is the problem, a morning spent on this, at first I thought it was a mistake to grab the data, after a long time to find out that it was crawling with slashes in the name of the play, this can put me in a bitter hole.