Public opinion the data acquisition is a key part of the system, although this part of the core technology by the crawler technology framework to build, but scraping of the vast amounts of data over the Internet is not rely on one or two crawlers can fix, especially under the condition of grasping a large number of web sites, every day there are a lot of the status of the website and the style change, crawlers can quick response and maintenance.

Once the scale of distributed crawler is large, there will be many problems, which are all technical challenges and many thresholds, such as:

1. Detect that you are a crawler and block your IP(whether they detect that you are a crawler through your UA, behavior characteristics or something else? How do you get around it?

How can you identify the dirty data returned to you?

3. How do you design scheduling rules when the other party is crawling to death?

4. You are required to climb 10000W data in a day, and your bandwidth is limited. How can you improve efficiency in a distributed way?

5 data crawl back, do you want to clean? Will the dirty data of the other party contaminate the original data?

6. Part of the data of the other party has not been updated. Do you need to download the unupdated data again? How do you know? How do you optimize your rules?

7 Too much data, a database can not put, should be separated?

8 the other data is JavaScript rendering, then how do you catch? Do you want to play PhantomJS?

9 The data returned by the other party is encrypted. How do you decrypt it?

The other party has a verification code. How do you crack it?

11. How can you get their data interface if they have an APP?

12 Data crawl back, how do you show? How do you visualize it? How to use it? How to use value?

13 and so on…

It is necessary to build a complete data acquisition system when collecting large-scale Internet data. Otherwise, your project development and data collection will be inefficient. At the same time, many unexpected problems can occur.

Open source public opinion system

Project address: gitee.com/stonedtx/yu…

Online experience system

  • Environment: open-yuqing.stonedt.com/
  • User name: 13900000000
  • Password: stonedt

Open Source technology stack

  • Development platform: Java EE & SpringBoot
  • Crawler framework: spider-flow & WebMagic & HttpClient
  • APP crawler: Xposed framework
  • URL repository: Redis
  • Web application server: Nginx&Tomcat
  • Data processing and storage task sending: Kafka&Zookeeper
  • Fetch task send: RabbitMQ
  • Configuration management: MySQL
  • Front-end display: Bootstrap & VUE

The overall architecture

(This is the earliest system architecture diagram)

Data processing process

(This is the earliest system design)

Source management

Source, short for source of information. We need to manage the collection type, content, platform, region and other attributes. We have developed three generations of source management platform.

First-generation product form

Second generation product form

Three generations of products

Site portrait

Using the simulated browser request technology to realize the depth and breadth of the crawling algorithm, the whole site is divided into 3 steps, 1) the whole site scanning, 2) data storage, 3) characteristic analysis.

  • SiteMeta identifies the structure of the entire site and parses the storage, creating a “mini-archive” library for each site it crawls.

  • SiteIndex on the basis of identification of all the web pages are pre-stored, and extract a variety of characteristic values for analysis and calculation, from the site directory, to the site column, as well as each crawl target page will mark different characteristic parameters.

  • SiteFeatures finally restored the results of the overall analysis into the site’s capture portrait and features, so that the machine will know which capture strategy to automatically match the site’s feature capture, based on such a design can achieve large-scale data acquisition unattended effect. That is, baidu, Google these large search engines to achieve data effects.

    Robot by “head” of the entire site prefetch again, the equivalent of a spearhead, the crawling situation to clear up after the machine soon know what kind of acquisition strategy, take a lot of need to collect website, only a very small part requires human intervention to collect, and more do not need to write a line reptile collection code, Fully automated and low code large – scale data acquisition.

Data capture

  • Automatic capture with the site portrait properties, you know that match the collection capture strategy, most sites can automatically capture automatically identify capture data, without manual intervention.
  • Manual configuration of some website crawl is difficult, the use of visual technology to extract the label of the whole site to the development engineer, they will be able to quickly configure the site crawl. When we collect any website there will be a variety of “probe” on the site structure, advertising space, key content, navigation bar, paging, list, site features, site data volume, crawl difficulty, site update frequency, and so on.

 

  • Collection templates

    In order to simplify manual operation and improve work efficiency, we also provide crawler templates. The significance of crawler template lies in that when users encounter a site with complicated configuration, they do not need to start from scratch, but only need to find a similar template in the crawler template library, as shown in the figure:

Data staging

  • Temporary storage If the data is directly stored in the system’s big data base, once there is a large number of collected dirty data down is a waste of time and energy, all data will be rehearsed and stored again, after the storage is completed, there will be a program to check and monitor this, so as to avoid data field missing, wrong storage.
  • Warning If any storage error is found in the temporary storage process, we will send a reminder to the R&D engineer through email in time to inform him of the error content and ask him to correct it.

Low code development

  • configuration

    At present, the crawler factory has been a low code development platform. To be more precise, we are not developing on it, but carrying out crawler configuration on it for data collection and capture. As shown in the figure:
  • maintenance

    Through the development of low code, the maintenance of crawler is more convenient. We only need to modify crawler crawler configuration in the Web management interface. At the same time, we can also check the specific crawler error log online debugging. Otherwise a site crawl problems, do not know which server crawler crawl errors. Once the quantity of various site crawlers increases, the maintenance cost is very high.

Distributed acquisition

  • Controller (master)

    The crawler factory has a Web control and management background, on which the developer can add the task plan to be collected and the rules and strategies of data acquisition and capture. The controller only issues the capture instruction to the acquisition task and does not do any capture operation.

  • The dispenser (dispatch)

    The master sends fetch tasks to any host via rabbitMQ messages containing fetch policy instructions and targets. The dispenser sends the instructions and policies.

  • Actuator (downloader)

    The executor can be deployed on any machine in the world that can connect to the Internet. As long as the machine can access the Internet and accept the collection task issued by the distributor, it can collect the data and send the collected data back to the central data warehouse.

 

The crawler management

  • The crawler state

    The crawler is distributed on many servers, so it is very painful not to know which crawler has a problem on which server, and even not to know that the server is down due to the surging amount of crawler data. Therefore, it is necessary to be able to monitor the server and monitor every crawler on the server. Monitor the normal operation of each crawler, and monitor the normal operation of each crawler server.

  • Acquisition state

    The captured sites often change, so we need to know whether the captured data of each target site has been collected normally. By giving the collection task number to each crawler and displaying it on the Web interface, we can see the effect of data collection intuitively. Mail alarms and daily mail statistics can be used to monitor the collection status in real time.

 

Collection and classification

  • Site acquisition

    Generally take two modes, direct HTTP request to view HTML code; The other is to simulate browser technology, restore the REQUESTED JS rendering result to HTML code, find THE HTML tag and URL path for capture.

  • Public number collection

    At present, there are basically two ways: through Sogou wechat and through the public number management background. But the two are sealed is too much, after a variety of attempts to use RPA mode simulation request manual operation + proxy IP address, the public number of data capture. But at the same time, there needs to be a large number of wechat public number, because this capture method is collected according to the number of the public number, no public number will not know the target of capture.

  • App to collect

    Previously, a WIFI sharing was built on the computer in the development environment so that the mobile APP could see the transmitted data when connected to the computer. At present, the data acquisition cost of APP is getting higher and higher, and there is almost no top-grade APP without encryption. Therefore, the Xposed framework for data acquisition is the most stable acquisition scheme.

The climbing strategy

  • Mock request header

    There is a special data table for storing and updating mock request headers for various browser requests, such as Host, Connection, Accept, user-agent, Referrer, Accept-Encoding, accept-language and other combination request header parameters.

  • The proxy IP pool

    Proxy IP is not enough, generally stable IP pools are expensive, and proxy IP resources are always scarce for the data sources that need to be collected. In other words, you need to make full use of proxy IP resources. Two main functions are realized: 1) Checking the validity of proxy IP in real time, discarding invalid IP and improving the efficiency of crawler request. 2) IP_1 grabbed the A_ website and was blocked, but it does not mean that IP_1 grabbed the B_ website and N_ website will be blocked immediately, so as to make full use of proxy IP.

Collect the log

  • Log collection

    The system uses an independent and strong server dedicated to log processing server. This server collects error logs from all sides of the crawler and from various telecom rooms. Each application uses logback’s Kafak plugin to write messages to the message queue and then batch write to a dedicated Elasticsearch log analysis server.

  • Tracking ID

    In order to troubleshoot the problem more effectively, we start from fetching the request until the data is stored. The system gives each job a unique log label, so that no matter what went wrong, what was done in the last step, what procedures were executed, can be effectively tracked and traced.

  • Log analysis

    Through data analysis, we can see which kind of data is currently collected with problems, where the large area of problems are mainly concentrated on that day or during this period of time, as well as which websites have problems, and whether these websites with problems are the focus of attention, etc. Analyze the problem from surface to point.

Data parsing

  • Automatic parsing

    Automatic analysis is mainly used for information, bidding, recruitment, the system adopts text density algorithm. Because these three types of data although roughly the same, but after more sites or thousands of different. Relying on manual methods to configure or code one by one would be a disaster.

  • Manual parsing

    Only in the case that the machine cannot recognize automatically, manual configuration is used to fill in the fields on the web page one by one into the low-code crawler development platform.