Cetty

A lightweight event-based crawler framework.

An event dispatch crawler framework.

Function is introduced

Crawler framework based on fully custom event handling mechanism.
Modular design provides strong scalability.
Synchronous and asynchronous data fetching is supported based on HttpClient.
Multithreading is supported.
Jsoup page parsing framework provides powerful web page parsing processing capabilities.

Quick start

Using Maven

<dependency>
  <groupId>com.jibug.cetty</groupId>
  <artifactId>cetty-core</artifactId>
  <version>0.1.5</version>
</dependency>
Copy the code

help

1. Detailed documentation: cetty.jibug.com/ 2.QQ group

issues

So let’s write our first demo

/ * * * grab tianya BBS article list title * * * http://bbs.tianya.cn/list-333-1.shtml@author heyingcai
 */
public class Tianya extends ProcessHandlerAdapter {

    @Override
    public void process(HandlerContext ctx, Page page) {
        / / get the Document
        Document document = page.getDocument();
        / / the dom parsing
        Elements itemElements = document.
                select("div#bbsdoc>div#bd>div#main>div.mt5>table>tbody").
                get(2).
                select("tr");
        List<String> titles = Lists.newArrayList();
        for (Element item : itemElements) {
            String title = item.select("td.td-title").text();
            titles.add(title);
        }

        // Get the Result object and pass our parsed Result to the next handler
        Result result = page.getResult();
        result.addResults(titles);
        
        // Pass the result of this handler down through the fireXXX method
        // This tutorial passes the results directly to the ConsoleHandler, which prints the results directly to the console
        ctx.fireReduce(page);
    }

    public static void main(String[] args) {
        // Start the bootstrap class
        Bootstrap.
                me().
                // Use synchronous fetching
                isAsync(false).
                // Start a thread
                setThreadNum(1).
                // Grab the entry URL
                startUrl("http://bbs.tianya.cn/list-333-1.shtml").       
                // Common request information
                setPayload(Payload.custom()).        
                // Add custom handlers
                addHandler(new Tianya()).        
                // Add the default result handler and print it to the console
                addHandler(newConsoleReduceHandler()). start(); }}Copy the code

Version history

version	instructions
0.1.0 from	Support basic crawler function
0.1.5	1. Support xpath 2. Fix the failure of adding cookies 3

TODO

Support for annotations
Proxy pool support
Berkeley in-memory data is used as a URL manager to provide mass URL storage and improve access efficiency
Support for hot updates
Support crawler governance

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Share personal open source crawler framework

Cetty

Function is introduced

Quick start

Using Maven

help

So let’s write our first demo

Version history

TODO

Share personal open source crawler framework

Cetty

Function is introduced

Quick start

Using Maven

help

So let’s write our first demo

Version history

TODO

Related Posts

Reload/rewrite, dynamic/static dispatch? (Revised)

Springboot integrates native Kafka

[CentOS sequela] just graduated operation little sister Linux can not use! Yours won’t work, either