Cetty

A lightweight event-based crawler framework.

An event dispatch crawler framework.

Function is introduced

  • Crawler framework based on fully custom event handling mechanism.
  • Modular design provides strong scalability.
  • Synchronous and asynchronous data fetching is supported based on HttpClient.
  • Multithreading is supported.
  • Jsoup page parsing framework provides powerful web page parsing processing capabilities.

Quick start

Using Maven

<dependency>
  <groupId>com.jibug.cetty</groupId>
  <artifactId>cetty-core</artifactId>
  <version>0.1.5</version>
</dependency>
Copy the code

help

1. Detailed documentation: cetty.jibug.com/ 2.QQ group



issues

So let’s write our first demo

/ * * * grab tianya BBS article list title * * * http://bbs.tianya.cn/list-333-1.shtml@author heyingcai
 */
public class Tianya extends ProcessHandlerAdapter {

    @Override
    public void process(HandlerContext ctx, Page page) {
        / / get the Document
        Document document = page.getDocument();
        / / the dom parsing
        Elements itemElements = document.
                select("div#bbsdoc>div#bd>div#main>div.mt5>table>tbody").
                get(2).
                select("tr");
        List<String> titles = Lists.newArrayList();
        for (Element item : itemElements) {
            String title = item.select("td.td-title").text();
            titles.add(title);
        }

        // Get the Result object and pass our parsed Result to the next handler
        Result result = page.getResult();
        result.addResults(titles);
        
        // Pass the result of this handler down through the fireXXX method
        // This tutorial passes the results directly to the ConsoleHandler, which prints the results directly to the console
        ctx.fireReduce(page);
    }

    public static void main(String[] args) {
        // Start the bootstrap class
        Bootstrap.
                me().
                // Use synchronous fetching
                isAsync(false).
                // Start a thread
                setThreadNum(1).
                // Grab the entry URL
                startUrl("http://bbs.tianya.cn/list-333-1.shtml").       
                // Common request information
                setPayload(Payload.custom()).        
                // Add custom handlers
                addHandler(new Tianya()).        
                // Add the default result handler and print it to the console
                addHandler(newConsoleReduceHandler()). start(); }}Copy the code

Version history

version instructions
0.1.0 from Support basic crawler function
0.1.5 1. Support xpath 2. Fix the failure of adding cookies 3

TODO

  • Support for annotations
  • Proxy pool support
  • Berkeley in-memory data is used as a URL manager to provide mass URL storage and improve access efficiency
  • Support for hot updates
  • Support crawler governance