Cockroach crawler: Another Java crawler implementation

Introduction to the

The cockroach chose its name for no apparent reason. It was a long, difficult time to remember, and the coding took a long time because of spelling problems.

This project is another pit for me. I have dug a lot of pits, one more than the other.

A small, flexible, robust crawler framework, let’s call it a framework.

As simple as that, you can create a crawler in a few words.

The environment

Java8 applications use some of the new features of java8
maven

Here is a point-by-point introduction:

small

Create a new Maven project to introduce dependencies in poM files

<dependency> <groupId>com.github.zhangyingwei</groupId> <artifactId>cockroach</artifactId> < version > 1.0.5 - Alpha < / version > < / dependency >Copy the code

If I ever forget to update a document, always remember to use the latest version, the latest version, the new version, the version, the version.

Create a new test class app.java in your project and create a new main method.

The instance

Public static void main(String[] args){CockroachConfig config = new CockroachConfig().setAppName(" CockroachConfig ") .setThread(2); CockroachContext context = new CockroachContext(config); TaskQueue queue = TaskQueue.of(); context.start(queue); New Thread(() -> {int I = 0; int I = 0; int I = 0; while(true){ i++; try { Thread.sleep(1000); String url = "http://www.xicidaili.com/wt/"+i; System.out.println(url); queue.push(new Task(url)); } catch (InterruptedException e) { e.printStackTrace(); } if (i > 1000) { break; } } }).start(); }Copy the code

flexible

What about flexibility

You can customize HTTP clients (optional, okhttp3 is used by default)
You can customize the processing of results (optional, print processor is used by default)

Customize HTTP clients

First let’s try a custom client

Public class SelfHttpClient implements HttpClient {public HttpClient setProxy(HttpProxy proxy){// Set the proxy implementation method} public TaskResponse doGet(Task Task) throws Exception{// Get request implementation method} public HttpClient proxy(){// Apply the proxy to the HTTP client method} public TaskResponse doPost(Task Task) throws Exception{// Post request implementation method} public HttpClient setCookie(String cookie){// Set Public HttpClient setHttpHeader(Map<String, String> httpHeader){// Set header implementation method}}Copy the code

Apply custom HTTP clients to crawlers

CockroachConfig config = new CockroachConfig().setAppName(" I am a cockroach ").setThread(2) // Number of crawler threads .setHttpClient(SelfHttpClient.class)Copy the code

Custom result handling classes

public class SelfStore implements IStore { @Override public void store(TaskResponse response) { System.out.println(response.getContent()); }}Copy the code

Here we simply print out the results, and in practice, we can save them to a database or a file and so on. It’s worth noting that if the result is HTML web text, we also provide select(” CSS selector “) to process the resulting text.

Apply custom Store client to crawler

CockroachConfig config = new CockroachConfig().setAppName(" I am a cockroach ").setThread(2) // Number of crawler threads .setHttpClient(SelfHttpClient.class) .setStore(SelfStore.class);Copy the code

Custom error handling classes

When an error occurs in an HTTP request page, the system will locate the error handling class. If there is no custom error handling class, the system will use DefaultTaskErrorHandler by default, which will print the error information. The specific implementation code is as follows.

public class DefaultTaskErrorHandler implements ITaskErrorHandler { private Logger logger = Logger.getLogger(DefaultTaskErrorHandler.class); @Override public void error(Task task,String message) { logger.info("task error: "+message); }}Copy the code

If you need to customize your error handling class, you can copy the code above and implement the ITaskErrorHandler interface to implement your own processing logic in the error method.

After we define our custom error handling class, we need to apply the custom class to the crawler.

CockroachConfig config = new CockroachConfig().setAppName(" I am a cockroach ").setThread(2) // Number of crawler threads .setHttpClient(SelfHttpClient.class) .setStore(SelfStore.class) .setTaskErrorHandler(SelfTaskErrorHandler.class);Copy the code

robust

Speaking of robust, here is mainly reflected in the following aspects:

Dealing with IP Blocking

Here we use dynamic proxies to solve this problem.

Use of dynamic proxies

CockroachConfig config = new CockroachConfig().setAppName(" I am a cockroach ").setThread(2) // Number of crawler threads SetHttpClient (SelfHttpClient. Class). SetProxys (" 100.100.100.100:8888101101 101.101:8888)"Copy the code

As shown above, we can set several proxy IP addresses and eventually generate a proxy pool for all proxy IP addresses. Before crawler request, we will randomly select an IP address from the proxy pool to act as proxy.

Address user-Agent issues in HTTP requests

A user-Agent pool is implemented in the program, and each request will randomly take out a User-Agent to use. At present, there are 17 kinds of User-Agents integrated in the program, and we will consider opening this part into the configuration and customize the configuration (is it meaningful?). .

Exception handling problems in the program

At present, I am not very good at the exception processing, and I have tried my best to control the exception in a controllable range. There are many custom exceptions defined in the program. I will not go into details because I have no right to say here.

This is called deep crawling

There is no existing implementation of deep crawls in the program, because I don’t find deep crawls useful in general, but it’s not that there isn’t room for deep crawls. We can extract the links from the page and queue them ourselves. To achieve the effect of deep crawling.

public class DemoStore implements IStore { private String id = NameUtils.name(DemoStore.class); public DemoStore() throws IOException {} @Override public void store(TaskResponse response) throws IOException { List<String> urls = response.select("a").stream().map(element -> element.attr("href")).collect(Collectors.toList()); try { response.getQueue().push(urls); } catch (Exception e) { e.printStackTrace(); }}}Copy the code

Annotation support

Recently took a break to add annotation support, so what does a crawler look like with annotations?

@EnableAutoConfiguration @AppName("hello spider") @Store(PrintStore.class) @AutoClose(true) @ThreadConfig(num = 1) @CookieConfig("asdfasdfasdfasdfasfasdfa") @HttpHeaderConfig({ "key1=value1", "Key2 =value2"}) @proxyConfig ("1.1.1.1,2.2.2.2") public class CockroachApplicationTest {public static void main(String[] args) throws Exception { TaskQueue queue = TaskQueue.of(); queue.push(new Task("http://blog.zhangyingwei.com")); CockroachApplication.run(CockroachApplicationTest.class,queue); }}Copy the code

Above is the demo of all the annotations, so forget the demo part, if it is really just a demo, how to write?

@EnableAutoConfiguration
public class CockroachApplicationTest {
    public static void main(String[] args) throws Exception {
        TaskQueue queue = TaskQueue.of();
        queue.push(new Task("http://blog.zhangyingwei.com"));
        CockroachApplication.run(CockroachApplicationTest.class,queue);
    }
}Copy the code

Yes, it’s that simple. The contents of this crawler is to crawl the page http://blog.zhangyingwei.com and print the results. In the process of crawler results, the program uses the PringStore class by default to print all results.

Dynamic Header support

A reptile who recently worked for a job had a problem climbing the retractor. You need to log in to crawl, which of course can be solved by configuring cookies, but there is anti-crawler verification in the cookie of the hook. There is a time in the cookie that needs to change dynamically. So that’s where this function comes in.

This function is used as follows:

Cookie generator

@CookieConfig(cookieGenerator = CookieGeneratorTest.class)Copy the code

/** * Created by zhangyw on 2017/12/19. */ public class CookieGeneratorTest implements StringGenerator { @Override public String get(Task task) { String cookie = "v="+ UUID.randomUUID().toString(); System.out.println(cookie); return cookie; }}Copy the code

Before each HTTP request occurs, the program calls the Generator’s GET method. Gets the cookie value of this time and appends it to the HTTP request header.

The Header generator

Since the headers needed in the program are map data, the header generator is as follows:

@HttpHeaderConfig(headerGenerator = HeaderGeneratorTest.class)Copy the code

/** * Created by zhangyw on 2017/12/19. */ public class HeaderGeneratorTest implements MapGenerator { private Map headers = new HashMap(); @Override public Map get(Task task) { return headers; }}Copy the code

These are all the generators so far, and you can see that the generator passed in the task object, so that the crawler can use different cookies/headers when dealing with different addresses.

Anyway, here’s an example:

/** * Created by zhangyw on 2017/12/19. */ public class HeaderGeneratorTest implements MapGenerator { private Map headers = new HashMap(); @Override public Map get(Task task) { if ("jobs.lagou".equals(task.getGroup())) { header.put("key","value"); return headers; } else { return null; }}}Copy the code

OK, that’s it. That’s it.

I have something to say about distribution

Now the web crawler has to do a little bit of distribution, and it bothers me.

In fact, I have seen several so-called distributed crawler source code, their so-called distributed, even pseudo distributed are not!! Using redis as message middleware is distributed? This is called distributed? This is not distributed at all, I was going to use Redis as message middleware to install a distributed B, but halfway through writing it, I suddenly felt a little sick, so I deleted the code, and the program was quiet, but also my own peace of mind.

Distributed this pit must be dug!!

So, my distribution will include:

Distributed messaging middleware (perhaps using Redis or implementing one yourself; For the sake of quietness, you will most likely implement one yourself.
Distributed task scheduling
Distributed fault tolerance mechanism
Distributed transaction
Condition monitoring

So, is the hole getting bigger? Damn, a little scared!! As for the pit when to fill, can fill, depends on the mood…

In fact, I’m not in the mood to fill this distributed pit yet…

PS

Yesterday afternoon opened dozens of threads to climb Zhihu, the results of the company network management said that there was suspected DOS attack, scared me to run on the cloud.

contact

Email address: [email protected]
WeChat: fengche361

Lisence

Lisenced under Apache 2.0 lisence