preface

The previous two articles introduced the system process and function module encapsulation, this article will be on the message list collection to explain, this article focuses on ideas rather than technology and source code, logic code is uniform, reliable algorithm pick one in ten thousand.

The preparatory work

1. Collection sources

The study found that there are many ways to access articles on public accounts:

  • Sogou search: can only get the first 10;

  • Public account history message page, there is a limit on the number of public accounts within 24 hours (currently found 200+ times, different wechat signals may vary);

  • Wechat public platform to write text and text messages reference other public account article query interface, frequent will limit the query;

After demand analysis, the second way is finally selected, which can avoid exceeding the limit times through multiple wechat signals. Access history news page in normal browser can’t open, go directly to Google browser without parameters, through the Fiddler analyze the request, there are a lot of parameters, the actual request link put caught a link in the Google browser can normal open, the default of the first page of the data, after the slide down the asynchronous loading the second page. After analyzing the interface of paging query, it is found that the requested parameters and the parameters on the page are the same except for page number. After assembling the paging link by ourselves, we can directly query the data of any page, so the final idea is as follows: Wechat browser jumps to the history message page, Fidller intercepts the request, submits the complete link to its own server, and then extracts the parameters in the link. Its own database configures how many pages each public account turns and obtains the data before a certain point in time. The integration parameters and custom configuration assemble component page query links that send requests through HttpClient and then parse back to save the data.

2. Page hopping

The previous section introduces the source of data and the way of request, but it cannot be manually clicked every time, which will lose the significance of automation, and the ultimate goal is to achieve automatic jump. There are two general ideas:

  • Automated scripts simulate clicks
  • Proxy tool JS injection

Js injection way occupy less resources, to simulate click there are a lot of restrictions, such as clicking pages must be in the top, and so on, finally chose the js injection way, the proxy tools also have many choices, because there is no demand is too complex, the choice is portable Fiddler, never use readers can baidu, easy-to-install installed before you install the certificate, Intercept history news page requests, before the page back to the browser, in the context of a return to join a simple js code, control page after 10 s jump to their own server page, actually this step may also directly to jump straight to the js code to your server requests for the next public history news page, However, in order to flexibly control the frequency and view the real-time progress, a middle page is established. After the completion of the historical message page, the middle page directly jumps to the middle page, and the middle page goes to the background to obtain a new task for a certain period of time, and then jumps to the historical message page, and cycles successively. Because of cluster deployment, pages are placed in Nginx, and cross-domain issues are also addressed.

3. Task queues

The front has been mentioned to get the task, the execution of the task, the reader may not be clear task is a what kind of data structure, in fact, it is very simple is a queue, save is a public number of a parameter biz, with this parameter can be assembled public number history message page: mp.weixin.qq.com/s?__biz=xxx… Due to the distributed cluster deployment, redis queue is selected, and each request pops out a BIZ, which can avoid repeating the task through the single thread feature of Redis.

4. Task monitoring

When the task queue is empty, the cycle stops. But next time the task enters the queue, the front page needs to get the task again. The most convenient way to solve this problem is obviously long connection. Timed requests can also be just a bit more expensive and slow, and the new version of Nginx is also websocket friendly. When the task is completed, the page randomly establishes a long connection with a node in the background. When the task enters the queue again, a broadcast message is sent through RocketMq to notify each node in the cluster. After receiving the broadcast message, the node notifies all the pages that have established a long connection with it to obtain the task and enters the cycle.

The core processes

People don’t speak dark words, first picture:

1: you can log in to multiple wechat on multiple hosts and visit the middle page at the same time;

1.1: go to the background to request tasks, if there is a task to loop execution, no task to establish a long connection;

2: the page randomly establishes a long connection with a node in the background;

3: The task is queued and RocketMQ sends a broadcast to each node in the cluster.

4: After receiving a broadcast message, the node sends a message to the page through a long connection.

5: After the page receives the message, it requests a task in the background, which is the same as 1.1.

6: Visit the history message page, there is a pit here, there is a pit here, there is a pit here, direct location jump will fail, or incomplete parameters, the link should be filled into a hyperlink page, and the hyperlink target cannot be self, then JS simulation click this hyperlink to jump;

7: Fiddler sends links with parameters to the background;

7.1: Inject js with automatic jump before returning the page to the browser, and the page will jump to the middle page periodically;

7.2: background parameters, according to their own custom configuration of the public number, assembly component page query link, through HttpClient request, parsing returned Json data repository;

8: The page jumps back to 1.1 again, and the cycle is repeated.

Additional features

Account Access Control

Because tencent for each account can be accessed within 24 h history news page paging query interface has a limit, in order to avoid account titles, has carried on the record for each phone, access time, storage structure is redis queue, each mobile phone number and specify a prefix for the key, every page, in addition to access to the current timestamp push to the current mobile phone number corresponding to the queue.

  • Add the parameter of your mobile phone number when visiting the page in the first step of the core process. After the page gets the parameter of the mobile phone number, it first queries the queue corresponding to the current key in Redis to check whether the conversion time of the queue head element has exceeded 24 hours so far. If the conversion time exceeds 24 hours or the queue is empty, then it determines the queue length. If the page exceeds the limit, the page will be accessed again 2 minutes later. If the page does not exceed the limit, the next process will be carried out. The limit threshold depends on the actual situation.
  • The core process of the sixth step jump in the link to add a mobile phone number parameters to jump.
  • The core step 7 will pass the mobile phone number parameter to the background. 7.2 After obtaining the mobile phone number parameter, the current time will be saved to the end of the queue corresponding to the mobile phone number every page visited.
  • Core step 7.1 will also get the mobile phone number parameter, and then add the mobile phone number parameter when jumping to the middle page.

In this way, the page will display the current mobile phone number and the number of times that the current account has executed within 24 hours.

Access frequency control

To prevent blocking, add sleep control to the access frequency, but it may be necessary to increase the access frequency when there are many tasks and the account is full. Step five core processes for task before you go to the random background a sleep time, sleep time of minimum and maximum in the nacos configuration Settings, page regularly after a specified time then get time to perform the following process, so that you can by nacos hot configuration function under the premise of not restart the project change access frequency.

conclusion

So far have been access to the basic information of the article, at this time is not the body and interaction, the first emphasis project has been completed, the relative amount of interactive logic is not very complex, behind will explain interaction and the acquisition process of the body and some giant pit, the original is not easy, I hope brother hand small don’t mean take a walk, Friends who feel helpful can give a collection attention.

This series of articles is purely technical sharing, not for any commercial use, if reprinted, please indicate the source.Copy the code