Author: jessehua source: www.jianshu.com/p/cfead4b3e…

WebMagic is an open source Java crawler framework.

The use of the WebMagic framework is not the focus of this article, please refer to the official document: webmagic. IO /docs/ for details.

This paper is the integration of Spring Boot +WebMagic+MyBatis, using WebMagic to crawl data, and then through MyBatis persistent crawl data to mysql database.

The source code provided with this article can be used as scaffolding for a Java crawler project.

1. Add maven dependencies

<? The XML version = "1.0" encoding = "utf-8"? > < project XMLNS = "http://maven.apache.org/POM/4.0.0" XMLNS: xsi = "http://www.w3.org/2001/XMLSchema-instance" Xsi: schemaLocation = "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" > The < modelVersion > 4.0.0 < / modelVersion > < groupId > hyzx < / groupId > < artifactId > - crawler is qbasic.the < / artifactId > < version > 1.0.0 < / version > < the parent > < groupId > org. Springframework. Boot < / groupId > The < artifactId > spring - the boot - starter - parent < / artifactId > < version > 1.5.21. RELEASE < / version > < relativePath / > <! -- lookup parent from repository --> </parent> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.test.skip>true</maven.test.skip> < Java version > 1.8 < / Java version > < maven.com piler. Plugin. Version > 3.8.1 < / maven.com piler. Plugin. Version > . < maven. Resources. The plugin version > 3.1.0 < / maven. Resources. The plugin. Version > The < mysql version > 5.1.47 < / mysql. Connector. Version > < druid. Spring. The boot. The starter. Version > 1.1.17 < / druid. Spring. Boot. Starter. Version > < mybatis. Spring. The boot. The starter. Version > 1.3.4 < / mybatis. Spring. Boot. Starter. Version > < fastjson version > 1.2.58 < / fastjson version > < Commons. Lang3. Version > 3.9 < / Commons. Lang3. Version > < joda. Time. Version > 2.10.2 < / joda. Time. Version > < webmagic. Core. Version > 0.7.3 < / webmagic. Core. Version > < / properties > <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-devtools</artifactId> <scope>runtime</scope> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-configuration-processor</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>${mysql.connector.version}</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>druid-spring-boot-starter</artifactId> <version>${druid.spring.boot.starter.version}</version> </dependency> <dependency> <groupId>org.mybatis.spring.boot</groupId> <artifactId>mybatis-spring-boot-starter</artifactId> <version>${mybatis.spring.boot.starter.version}</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>${fastjson.version}</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>${commons.lang3.version}</version>  </dependency> <dependency> <groupId>joda-time</groupId> <artifactId>joda-time</artifactId> <version>${joda.time.version}</version> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>${webmagic.core.version}</version> <exclusions> <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion> </exclusions> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>${maven.compiler.plugin.version}</version> <configuration> <source>${java.version}</source> <target>${java.version}</target> <encoding>${project.build.sourceEncoding}</encoding> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-resources-plugin</artifactId> <version>${maven.resources.plugin.version}</version> <configuration> <encoding>${project.build.sourceEncoding}</encoding> </configuration> </plugin> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <fork>true</fork> <addResources>true</addResources> </configuration> <executions> <execution> <goals> <goal>repackage</goal> </goals> </execution> </executions> </plugin> </plugins> </build> <repositories> <repository> <id>public</id> <name>aliyun nexus</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <releases> <enabled>true</enabled> </releases> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>public</id> <name>aliyun nexus</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </pluginRepository> </pluginRepositories> </project>Copy the code

2. Project configuration file application.properties

Configure the location of the mysql data source, druid database connection pool, and MyBatis mapper file.

# mysql data source configuration spring. The datasource. Name = mysql spring. The datasource. Type = com. Alibaba. The druid. Pool. DruidDataSource spring.datasource.driver-class-name=com.mysql.jdbc.Driver Spring. The datasource. Url = JDBC: mysql: / / 192.168.0.63:3306 / GJHZJL? UseUnicode = true&characterEncoding = utf8 & useSSL = false&allowMul TiQueries = true spring. The datasource. The username = root spring. The datasource. The password = root # druid database connection pool configuration spring.datasource.druid.initial-size=5 spring.datasource.druid.min-idle=5 spring.datasource.druid.max-active=10 spring.datasource.druid.max-wait=60000 spring.datasource.druid.validation-query=SELECT 1 FROM DUAL spring.datasource.druid.test-on-borrow=false spring.datasource.druid.test-on-return=false spring.datasource.druid.test-while-idle=true spring.datasource.druid.time-between-eviction-runs-millis=60000 spring.datasource.druid.min-evictable-idle-time-millis=300000 Spring. The datasource. The druid. Max - evictable - idle time - millis = 600000 # mybatis configuration mybatis.mapperLocations=classpath:mapper/**/*.xmlCopy the code

3. Database table structure

CREATE TABLE 'cms_content' (' contentId 'varchar(40) NOT NULL COMMENT' content ', 'title' varchar(150) NOT NULL COMMENT '主 体 ',' content 'longtext COMMENT' 主 体 ', 'releaseDate' datetime NOT NULL COMMENT '已 关 系 ', PRIMARY KEY (' contentId ') ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='CMS ';Copy the code

4. The entity class

import java.util.Date; public class CmsContentPO { private String contentId; private String title; private String content; private Date releaseDate; public String getContentId() { return contentId; } public void setContentId(String contentId) { this.contentId = contentId; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getContent() { return content; } public void setContent(String content) { this.content = content; } public Date getReleaseDate() { return releaseDate; } public void setReleaseDate(Date releaseDate) { this.releaseDate = releaseDate; }}Copy the code

5. Mapper interfaces

public interface CrawlerMapper {
    int addCmsContent(CmsContentPO record);
}
Copy the code

6. CrawlerMapper. XML file

<? The XML version = "1.0" encoding = "utf-8"? > <! DOCTYPE mapper PUBLIC "- / / mybatis.org//DTD mapper / 3.0 / EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" > < mapper namespace="com.hyzx.qbasic.dao.CrawlerMapper"> <insert id="addCmsContent" parameterType="com.hyzx.qbasic.model.CmsContentPO"> insert into cms_content (contentId, title, releaseDate, content) values (#{contentId,jdbcType=VARCHAR}, #{title,jdbcType=VARCHAR}, #{releaseDate,jdbcType=TIMESTAMP}, #{content,jdbcType=LONGVARCHAR}) </insert> </mapper>Copy the code

7.XXX page content processing class XXXPageProcessor

It is mainly used to parse the XXX HTML page that is climbed.

@Component public class XXXPageProcessor implements PageProcessor { private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); @Override public void process(Page page) { page.addTargetRequests(page.getHtml().links().regex("https://www\\.xxx\\.com/question/\\d+/answer/\\d+.*").all()); page.putField("title", page.getHtml().xpath("//h1[@class='QuestionHeader-title']/text()").toString()); page.putField("answer", page.getHtml().xpath("//div[@class='QuestionAnswer-content']/tidyText()").toString()); If (page.getresultitems ().get("title") == null) {// If (page.getresultitems ().get("title") == null) { } } @Override public Site getSite() { return site; }}Copy the code

8.XXX data processing XXXPipeline

It is used to store data parsed from XXX HTML pages to mysql database.

@Component public class XXXPipeline implements Pipeline { private static final Logger LOGGER = LoggerFactory.getLogger(XXXPipeline.class); @Autowired private CrawlerMapper crawlerMapper; public void process(ResultItems resultItems, Task task) { String title = resultItems.get("title"); String answer = resultItems.get("answer"); CmsContentPO contentPO = new CmsContentPO(); contentPO.setContentId(UUID.randomUUID().toString()); contentPO.setTitle(title); contentPO.setReleaseDate(new Date()); contentPO.setContent(answer); try { boolean success = crawlerMapper.addCmsContent(contentPO) > 0; Logger. info(" Saved article successfully: {}", title); } catch (Exception ex) {logger. error(" save article failed ", ex); }}}Copy the code

9. Crawler task XXXTask

Start the crawler every ten minutes.

@Component public class XXXTask { private static final Logger LOGGER = LoggerFactory.getLogger(XXXPipeline.class); @Autowired private XXXPipeline XXXPipeline; @Autowired private XXXPageProcessor xxxPageProcessor; private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor(); Public void crawl() {// Timed task, Once every 10 minutes crawl timer. ScheduleWithFixedDelay (() - > {Thread. CurrentThread (). Elegantly-named setName (" xxxCrawlerThread "); Try {spiders. The create (xxxPageProcessor) / / from https://www.xxx.com/explore started scratching. AddUrl / / (" https://www.xxx.com/explore ") Thread (2) // asynchronously start crawler. Start (); } catch (Exception ex) {logger. error(" timed fetch thread execution Exception ", ex); } }, 0, 10, TimeUnit.MINUTES); }}Copy the code

10.Spring Boot program startup class

@SpringBootApplication @MapperScan(basePackages = "com.hyzx.qbasic.dao") public class Application implements CommandLineRunner { @Autowired private XXXTask xxxTask; public static void main(String[] args) throws IOException { SpringApplication.run(Application.class, args); } @Override public void run(String... String) throws Exception {// Crawl data xxxtask.crawl (); }}Copy the code

Finally, pay attention to the public number Java technology stack, in the background reply: interview, you can get my organized Java/ Spring Boot series interview questions and answers, very complete.

Recent hot articles recommended:

1.1,000+ Java Interview Questions and Answers (2021)

2. Don’t use if/ else on full screen again, try strategy mode, it smells good!!

3. Oh, my gosh! What new syntax is xx ≠ null in Java?

4.Spring Boot 2.5 is a blockbuster release, and dark mode is exploding!

5. “Java Development Manual (Songshan version)” the latest release, quick download!

Feel good, don’t forget to click on + forward oh!