preface

Due to the business needs of the company, the historical articles of WeChat official accounts provided by customers need to be obtained and updated every day. Obviously, more than 300 official accounts cannot be checked every day manually, so the problem was submitted to the IT team. I have been working on WeChat crawler of sogou before, and I have been working on Java Web since then. This project rekindled my love for crawler. It was the first time to use Spring Cloud architecture to do crawler, which lasted more than 20 days, and finally finished. Next, I will share my experience of the project in a series of articles, and present the source code for you to correct!

A brief introduction to the system

This system is based on JAVA development, through a simple configuration of the public number name or WeChat ID, to achieve the timing or instant grab WeChat public number articles (including reading, thumb up, in the look).

Second, system architecture

The technical architecture

Spring Cloud, SpringBoot, MyBatis Plus, Nacos, RocketMQ, Nginx

storage

MySQL, MongoDB, Redis, Solr

The cache

Redis

The agent

Fiddler

Three, the advantages and disadvantages of the system

System advantages

1, after the configuration of the public number can be through Fiddler’s JS injection function and WebSocket to achieve automatic capture; 2. The system is of distributed architecture with high availability; 3. RocketMQ message queue is decoupled, which can solve the failure of collection caused by network jamming. If the consumption fails for three times, the log will be recorded to MySQL to ensure the integrity of the article; 4. Any number of WeChat IDs can be added to improve collection efficiency and resist anti-climbing restrictions; 5. Redis caches the collection records of each WeChat ID within 24 hours to prevent the number closure; 6. As the configuration center, NACOS can adjust the acquisition frequency in real time through thermal configuration; 7. Store the collected data to the Solr cluster to improve the retrieval speed; 8. Store the records returned by the capture package into the MongoDB archive for easy viewing of the error log.

System disadvantages:

1. Collect information through real machine and real number. If you need to collect a large number of public accounts, you need to have multiple WeChat ID as support (if the account is limited on the same day, you can access the information by climbing the WeChat public platform interface); 2, not a public number can be captured immediately, the collection time is set by the system, the message has a certain lag (if the number of public number is not enough WeChat ID can be optimized by improving the collection frequency).

IV. Introduction to the module

Due to the management system and API call functions to be added later, some functions are encapsulated in advance.

common-ws-starter

Public module: holds public messages such as utility classes and entity classes.

redis-ws-starter

Redis module: a secondary package of spring-boot-starter-data-redis, which exposes the encapsulated Redis tool class and Reddisson tool class.

rocketmq-ws-starter

RocketMQ module: A secondary package of RocketMQ-spring-boot-starter that provides consumer retry and failure logging.

db-ws-starter

MySQL data source module: It encapsulates MySQL data source, supports multiple data sources, and realizes dynamic switching of data sources with custom annotations.

sql-wx-spider

MySQL Database Module: Provides all functions for MySQL database operations.

pc-wx-spider

PC terminal collection module: including PC terminal collection of public history message related functions.

java-wx-spider

Java extraction module: contains Java programs to extract article content related functions.

mobile-wx-spider

Simulator acquisition module: it includes functions related to the interaction volume of collecting messages through the simulator or mobile phone terminal.

Five, the general flow chart

6. Run screenshots

PC and mobile



The console





End of the run

conclusion

The project pro test is now in operation, and the sogou temporary link to permanent link problem has been solved in the project development. I hope it will be helpful to brother who are plagued by similar services. Now do Java such as rowing upstream, not to advance is to retreat, I do not know when it was rolled in, wish everyone has a book of their own sunflower treasure, see this also do not give a support. Attach the Java backend source directly: WS-Spider