This example realizes the capture of the latest film source name list and details page download address of a movie website.

Webmagic is an open source Java vertical crawler framework, the goal is to simplify the crawler development process, so that developers focus on the development of logical functions.

WebMagic features:

  • Fully modular design, strong scalability.
  • The core is simple but covers the whole process of the crawler, flexible and powerful, is also a good material to learn the introduction of the crawler.
  • Provides a rich extraction page API.
  • No configuration, but a crawler can be implemented via POJO+ annotations.
  • Multithreading is supported.
  • Support distributed.
  • Supports crawling js dynamically rendered pages.
  • No framework dependencies, can be flexibly embedded into the project.

The sample

This example implementation: the movie website https://www.dytt8.net/html/gndy/dyzz/list231.html the latest movie download links page content sources name and particulars of fetching.

Configuring Maven dependencies

Log4j12 is used to remove webMagic logs because the log file conflicts with Spring Boot

<? The XML version = "1.0" encoding = "utf-8"? > < project XMLNS = "http://maven.apache.org/POM/4.0.0" XMLNS: xsi = "http://www.w3.org/2001/XMLSchema-instance" Xsi: schemaLocation = "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" > < modelVersion > 4.0.0 < / modelVersion > < the parent > < groupId > org. Springframework. Boot < / groupId > The < artifactId > spring - the boot - starter - parent < / artifactId > < version > 2.1.9. RELEASE < / version > < relativePath / > <! -- lookup parent from repository --> </parent> <groupId>com.easy</groupId> <artifactId>webmagic</artifactId> <version>0.0.1</version> <name>webmagic</name> <description>Demo project for Spring Boot</description> <properties> < Java version > 1.8 < / Java version > < encoding > utf-8 < / encoding > <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> </properties> <dependencies> <dependency> <groupId>us. Codecraft </groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> <exclusions> <exclusion>  <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion> </exclusions> </dependency> < the dependency > < groupId > us. Codecraft < / groupId > < artifactId > webmagic - the extension < / artifactId > < version > 0.7.3 < / version > </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId>  <scope>test</scope> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <scope>compile</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugins> </build> </project>Copy the code

Create list and detail page resolution classes

PageProcessor is responsible for parsing pages, extracting useful information, and discovering new links. WebMagic uses Jsoup as an HTML parsing tool and has developed Xsoup, a tool for parsing xPaths, based on it.

Listpageprocesser. Java implements the list of movie names to get

package com.easy.webmagic.controller; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.processor.PageProcessor; Public Class ListPageProcesser implements PageProcessor {private Site Site = site.me ().setDomain("127.0.0.1"); @Override public void process(Page page) { page.putField("title", page.getHtml().xpath("//a[@class='ulink']").all().toString()); } @Override public Site getSite() { return site; }}Copy the code

Detailpageprocesser. Java implementation details page movie download address get

package com.easy.webmagic.controller; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.processor.PageProcessor; Public Class DetailPageProcesser implements PageProcessor {private Site Site = site.me ().setDomain("127.0.0.1"); @Override public void process(Page page) { page.putField("download", page.getHtml().xpath("//*[@id=\"Zoom\"]/span/table/tbody/tr/td/a").toString()); } @Override public Site getSite() { return site; }}Copy the code

The results are captured using Pipeline processing

Pipeline is responsible for the processing of extracted results, including calculation, persistence to files, databases, and so on. WebMagic provides two result processing options by default: “Output to console” and “Save to file”.

Pipelines define how results should be saved, and if you want to save them to a specific database, you need to write the corresponding Pipeline. For a class of requirements, only one Pipeline is written.

Here do not do any processing, directly captured packet results in the console output

MyPipeline.java

package com.easy.webmagic.controller; import lombok.extern.slf4j.Slf4j; import us.codecraft.webmagic.ResultItems; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.pipeline.Pipeline; import java.util.Map; @Slf4j public class MyPipeline implements Pipeline { @Override public void process(ResultItems resultItems, Task task) { log.info("get page: " + resultItems.getRequest().getUrl()); for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) { log.info(entry.getKey() + ":\t" + entry.getValue()); }}}Copy the code

The packet capture portal is enabled

Main.java

package com.easy.webmagic.controller; import us.codecraft.webmagic.Spider; Public class Main {public static void Main (String[] args) {// Get the movie title and page link Spider. Create (new) ListPageProcesser()).addUrl("https://www.dytt8.net/html/gndy/dyzz/list_23_1.html") .addPipeline(new MyPipeline()).thread(1).run(); // Get the movie download address of the specified details page spider.create (new) DetailPageProcesser()).addUrl("https://www.dytt8.net/html/gndy/dyzz/20191204/59453.html") .addPipeline(new MyPipeline()).thread(1).run(); }}Copy the code

Run the example

Start running main.java and look at the console

List of titles on the first page of the movie

14:06:28. [704] - thread pool - 1-1 the INFO com. Easy. Webmagic. Controller. MyPipeline - get page: https://www.dytt8.net/html/gndy/dyzz/list_23_1.html 14:06:28. [704] - thread pool - 1-1 of the INFO com.easy.webmagic.controller.MyPipeline - title: (< a href = "/ HTML/gndy/dyzz / 20191204/59453. HTML" class = "ulink" > 2019 drama "Chinese captain" HD mandarin English versions of the word < / a >, < a href = "/ HTML/gndy/dyzz / 20191201/59437. HTML" class = "ulink" > 2019 animated comedy "the snowman" BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191201/59435. HTML" class = "ulink" > 2019 comedy "bernadette where did you go to BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191129/59431. HTML" class = "ulink" > 2019 high drama "Irish/Irish killer" BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191129/59429. HTML" class = "ulink" > 2019 drama "tang's farm film version of" BD English versions of the word [correction subtitles] < / a >, < a href = "/ HTML/gndy/dyzz / 20191129/59428. HTML" class = "ulink" > 2018 suspenseful action "snowstorm" BD mandarin word < / a >, < a href = "/ HTML/gndy/dyzz / 20191128/59427. HTML" class = "ulink" > 2019 thriller plot "official secrets" BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191127/59425. HTML" class = "ulink" > high score plot "young you HD 2019 words in mandarin < / a >, < a href = "/ HTML/gndy/dyzz / 20191126/59424. HTML" class = "ulink" > plot adventure "climbers" HD 2019 mandarin English versions of the word < / a >, < a href = "/ HTML/gndy/dyzz / 20191126/59423. HTML" class = "ulink" > 2019 drama "goldfinch BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191125/59422. HTML" class = "ulink" > high score award 2019 "Hollywood past BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191125/59421. HTML" class = "ulink" > 2018 animated adventure cat and reviewing the BD kingdom yue words in bilingual < / a >, < a href = "/ HTML/gndy/dyzz / 20191124/59418. HTML" class = "ulink" > 2019 terrorist "ready no/shoot wedding game" BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191124/59417. HTML" class = "ulink" > 2019 drama suspense "double soul" BD kingdom yue words in bilingual < / a >, < a href = "/ HTML/gndy/dyzz / 20191122/59409. HTML" class = "ulink" > 2019 sci-fi action "twin killer" HD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191122/59408. HTML" class = "ulink" > 2019 fantasy "heaven mountain/heaven mountain � f" BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191121/59407. HTML" class = "ulink" > 2019 terrorist "clown come back 2" BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191117/59403. HTML" class = "ulink" > 2019 high animation "Klaus: Christmas secrets of BD country the west three language double word < / a >, < a href = "/ HTML/gndy/dyzz / 20191116/59400. HTML" class = "ulink" > action fallen angels, BD 2019 English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191115/59399. HTML" class = "ulink" > 2019 thriller "crime scene" HD kingdom yue words in bilingual < / a >, < a href = "/ HTML/gndy/dyzz / 20191115/59398. HTML" class = "ulink" > high score plot "don't tell her" BD 2019 English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191114/59393. HTML" class = "ulink" > action "primordial" BD 2019 English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191114/59392. HTML" class = "ulink" > 2019 drama "following the wedding ceremony" BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191113/59387. HTML" class = "ulink" > action war in 2019 the crisis: Longtan wars BD English versions of the subtitles < / a >, < a href = "/ HTML/gndy/dyzz / 20191113/59386. HTML" class = "ulink" > 2019 criminal action "silent witness" BD kingdom yue words in bilingual < / a >]Copy the code

Details page video download address

14:06:34. 365 INFO [2 - thread pool - - 1] com. Easy. Webmagic. Controller. MyPipeline - get page: https://www.dytt8.net/html/gndy/dyzz/20191204/59453.html 14:06:34. 365 INFO [2 - thread pool - - 1] Com. Easy. Webmagic. Controller. MyPipeline - download: < a href = "ftp://ygdy8:[email protected]:4233/ sun movie www.ygdy8.com. MKV ">ftp://ygdy8:[email protected]:4233/ Sunshine film www.ygdy8.com. Chinese Captain. Hd.1080p. Mandarin Chinese English double word. MKV </a>Copy the code

It means that you have successfully captured the data, and then you do what you want to do.

The crawler advanced

Extract elements with Selectable

The Selectable extract element chain API is a core feature of WebMagic. The Selectable interface allows you to chain extract page elements without worrying about extraction details.

Configuration, startup, and termination of crawlers

A Spider is the entry point through which a crawler starts. To start the crawler, we need to create a Spider object using a PageProcessor and then start it using run(). Other components of the Spider (Downloader, Scheduler, Pipeline) can be set using the set method.

Jsoup and Xsoup

The extraction of WebMagic mainly uses Jsoup and my own tool Xsoup.

Crawler monitoring

With this feature, you can see how the crawler is performing – how many pages have been downloaded, how many pages are left, how many threads have been started, and so on. This function is implemented through JMX, you can use Jconsole and other JMX tools to view local or remote crawler information.

Configure the agent

ProxyProvider has a default implementation: SimpleProxyProvider. It is a simple Round Robin based ProxyProvider with no failure checking. Any candidate agent can be configured, and one agent will be selected in sequence at a time. It is suitable for use in scenarios where you build relatively stable agents.

Process non-HTTP GET requests

This is done by adding Method and requestBody to the Request object. Such as:

Request request = new Request("http://xxx/path");
request.setMethod(HttpConstant.Method.POST);
request.setRequestBody(HttpRequestBody.json("{'id':1}","utf-8"));Copy the code

Write crawlers using annotations

WebMagic supports writing a crawler using a unique annotation style, which can be used by introducing the WebMagic-Extension package.

In annotation mode, a crawler can be written with very little code using a simple object annotated. For simple crawlers, this is simple, easy to understand, and easy to manage.

data

  • WebMagic crawler entry sample source code
  • WebMagic GitHub

Spring Boot and Cloud learning projects