Before climbing things, we need a simple analysis of the structure of the grass website. The crawler crawls according to the URL, so let’s first analyze the URL of the grass.

1# Japanese Who Don't Ride Horses2http://bc.ghuws.men/thread0806.php? fid=2&search=&page=2 3# Japanese on horseback4http://bc.ghuws.men/thread0806.php? fid=15&search=&page=2 5# English teacher section6http://bc.ghuws.men/thread0806.php? fid=4&search=&page=2Copy the code

This is to pick the URL between several different sections, we found that they are fid=XX the number is different ah. Pass in different page parameters, which should be the page number. In this way, we can continuously crawl information between different pages.

After crawling to the page, we are going to find the post of each page, which involves the knowledge of HTML. Since these pages are all the same length and their structure is similar, let’s take a random page and see what the HTML for its post looks like.

1<td class="tal" style="padding-left:8px" id=""> 
2 <h3><a href="htm_data/15/1805/3140529.html" target="_blank" id=""> MP4/1.53 G SDMU - 742 wakayama か ら to た リ ア ル マ ゾ woman [vip1136] < / a > < / h3 > 3 < / td >Copy the code

Found, the key information is here, in the <a> tag. We see a href = “htm_data / 15/1805/3140529 HTML” this should be each post URL, the second half of the first part should be the domain name. The last number should be the ID of the post, and each post has a unique ID corresponding to it. The text wrapped in the tag is the title of the post, so we get the URL to piece together the address of each post, and then yeild it in the Scrapy Spider.

1http://bc.ghuws.men/htm_data/15/1805/3140529.html
Copy the code

Because our goal is to get: the title of each post, a preview of the post, and a seed file of the post.

Now that we have the URL of the post, let’s move on to the second layer: the post.

In the HTML file of each post, we need to find two things: the thumbnail image and the download address of the seed. After a simple right click and check, the HTML code for these two things looks like this:

1 <! <img SRC = 2<br> 3 <img SRC = 3'https://imagizer.imageshack.com/v2/XXX/XXX/XXX.jpg' onclick="XXXXXX" style='cursor:pointer'> 4<br> 5<! --> 6<a target="_blank" onmouseover="this.style.background='#DEF5CD';" onmouseout="this.style.background='none';" style="cursor:pointer; color:#008000;" href="http://www.viidii.info/?http://www______rmdown______com/link______php?hash=XXXXXX&z">http://www.rmdown.com/link.php?hash=XXXXXX</a>
Copy the code

In this case, you can find the image tag <img> and the seed file download tag <a>. However, there is a problem. In every HTML post, if you look for the <img> tag, you will find a lot of images. I’m just looking for non-GIF images here, and I’m just picking the first three as image_URLS to store. And the download location of the seed file, this is the content of the <a> tag package.

At this level, we get the post title, image, and the URL to download the seed file. So our next step is to crawl the URL from which we downloaded the seed file. This brings us to the third level of reptiles.

In the third layer, there is only one download button, so we just need to look at the URL of each seed’s download address.

1http://www.rmdown.com/download.php?reff=495894&ref=182a4da555f6935e11ff2ba0300733c9769a104d51c
Copy the code

Reff (ref) and ref (ref); reff (ref); reff (ref)

1<INPUT TYPE="hidden" NAME="reff" value="495894">
2<INPUT TYPE="hidden" name="ref" value="182a4da555f6935e11ff2ba0300733c9769a104d51c">
Copy the code

In this file, the name of the reff and ref are the only two tags, so it is easy to find, and then get their values, and put together to become the download address of the movie seed file.

Once we get the download address, all we need to do is make a network request in Python and save the returned data as a Torrent file. Here’s a hiccup: I was trying to use URllib3, but was frustrated that URllib3 would freeze up during network requests. I Googled and found that this problem was quite common, and one person suggested using “mankind’s greatest library” for requests. And sure enough, requests took care of my troubles, and I’ve been fanning requests ever since. Too powerful.

If a post is completed, yield a Scrapy Item and store that Item in MongoDB on the pipeline. When this is done, the crawler goes on to fetch the next post or page until the crawl is complete.

The process is so simple, talk is not practice is not drop, especially our programmers, both will be able to talk on paper, but also have to practice. So let’s take a look at the runtime effect, which is absolutely stunning:

Running effect

TXT file for recording results

Locally saved seed file 113MB

Data in MongoDB

The main technical points

  • Python 3.6
  • Scrapy
  • To avoid being blocked by Scrapy web requests, we add cookies and user-agents to the request.
  • BeautifulSoup4 was used for HTML parsing, because I found that Scrapy didn’t provide a very good selector as well as BS4.
  • The DOWNLOAD_DELAY in settings.py needs to be set.
  • When storing in MongoDB, store in different tables according to different sections.
  • Build a TXT file locally and write the results in CSV format, in order to record the crawling results in real time, rather than waiting for the completion of the crawler, and then write the results into CSV files. Easy to provide data for other programs.
  • The name of the Torrent file needs to be processed, because some illegal characters in the title of the post cannot be used to make the file path, so it needs to be processed.