This article first in my blog, if you feel useful, welcome to like collection, let more friends see.

Recently, I found less and less interesting questions on Zhihu, so I planned to aggregate technical questions on other platforms, such as SegmentFault and StackOverflow.

To do this, we must need a reptile. I took the time to check out Colly, a crawler framework from Go.

summary

Colly is a well-known crawler framework implemented by Go, and THE advantages of Go in high concurrency and distributed scenarios are exactly what crawler technology needs. Its main characteristics are light weight, fast, elegant design, and distributed support is very simple and easy to scale.

How to learn

The most famous crawler framework is probably the Python scrapy, which is the first framework many people have encountered, and I am no exception. It is well documented and has extensive extensions. When we want to design a crawler framework, we often refer to its design. I’ve seen a few articles about a similar implementation of Scrapy in Go.

By comparison, Colly’s study materials are pitifully sparse. When I first saw it, I couldn’t help but borrow my scrapy skills, but it didn’t work that way.

At this point, we naturally thought of looking for some articles to read, but it turned out that there were really a few articles related to Colly. Basically, what we could find were provided by the authorities, and it seemed not so perfect. Can’t, chew slowly! Official learning materials usually have three places, which are documents, cases and source code.

Today, temporarily from the official document point of view! Begin the text.

The official documentation

The official document introduces the emphasis on the use of methods, if it is a friend with crawler experience, scan a document quickly. I took some time to organize a version of the official website documents according to my own ideas.

The main content is not much, including installation, quick start, how to configure, debug, distributed crawler, storage, use of multiple collectors, configuration optimization, extension.

Each document is so small that it barely needs scrolling.

How to install

Colly is as easy to install as any other Go library. As follows:

go get -u github.com/gocolly/colly
Copy the code

One line of command. So easy!

Quick start

Let’s take a quick look at colly through a Hello Word example. The steps are as follows:

Step one, import Colly.

import "github.com/gocolly/colly"
Copy the code

Second, create the collector.

c := colly.NewCollector()
Copy the code

The third step, event listening, performs event processing through callback.

// Find and visit all links
c.OnHTML("a[href]".func(e *colly.HTMLElement) {
	link := e.Attr("href")
	// Print link
	fmt.Printf("Link found: %q -> %s\n", e.Text, link)
	// Visit link found on page
	// Only those links are visited which are in AllowedDomains
	c.Visit(e.Request.AbsoluteURL(link))
})

c.OnRequest(func(r *colly.Request) {
	fmt.Println("Visiting", r.URL)
})
Copy the code

By the way, colly supports the following types of events:

  • OnRequest is called before the request is executed
  • OnResponse called after the response is returned
  • OnHTML listens to perform the selector
  • OnXML listens to perform the selector
  • OnHTMLDetach, unlisten, takes a selector string
  • OnXMLDetach, unlisten, takes a selector string
  • Onmonopoly, execute after grab, execute after all work is done
  • OnError, error callback

Finally, c.Visit() officially launches the web access.

c.Visit("http://go-colly.org/")
Copy the code

The case completion code is provided in BASIC under the _example directory of colly’s source code.

How to configure

Colly is a flexible framework with a number of configurable options for developers. By default, each option provides a superior default value.

Here is the collector created using the default.

c := colly.NewCollector()
Copy the code

Configure the created collector, such as setting userAgent and allowing repeated access. The code is as follows:

c2 := colly.NewCollector(
	colly.UserAgent("xy"),
	colly.AllowURLRevisit(),
)
Copy the code

We can also create and then change the configuration.

c2 := colly.NewCollector()
c2.UserAgent = "xy"
c2.AllowURLRevisit = true
Copy the code

The configuration of the collector can be changed at any stage of crawler execution. A classic example of a simple reverse crawl can be achieved by randomly changing the User-Agent.

const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

func RandomString(a) string {
	b := make([]byte, rand.Intn(10) +10)
	for i := range b {
		b[i] = letterBytes[rand.Intn(len(letterBytes))]
	}
	return string(b)
}

c := colly.NewCollector()

c.OnRequest(func(r *colly.Request) {
	r.Headers.Set("User-Agent", RandomString())
})
Copy the code

As mentioned earlier, the Collector has selected the preferred configurations for us by default, but they can also be changed by environment variables. This way, we don’t have to recompile every time we change the configuration. The environment variable configuration takes effect when the Collector is initialized and can be overridden after the collector is started.

The following configuration items are supported:

ALLOWED_DOMAINS (string slices), allowed domain names such as []string{"segmentfault.com"."zhihu.com"}
CACHE_DIR (string) Cache directory DETECT_CHARSET (Y /n) Whether to detect response encoding DISABLE_COOKIES (Y /n) Disable cookies DISALLOWED_DOMAINS Same as ALLOWED_DOMAINS type IGNORE_ROBOTSTXT (y/n) Whether to ignore ROBOTS protocol MAX_BODY_SIZE (intMAX_DEPTH (int - 0PARSE_HTTP_ERROR_RESPONSE (y/n) Parsing HTTP response error USER_AGENT (string)
Copy the code

These are pretty easy choices to understand.

Let’s take a look at HTTP configurations, which are common configurations such as proxies, various timeouts, and so on.

c := colly.NewCollector()
c.WithTransport(&http.Transport{
	Proxy: http.ProxyFromEnvironment,
	DialContext: (&net.Dialer{
		Timeout:   30 * time.Second,          // The timeout period
		KeepAlive: 30 * time.Second,          // keepAlive timeout
		DualStack: true,
	}).DialContext,
	MaxIdleConns:          100.// Maximum number of idle connections
	IdleConnTimeout:       90 * time.Second,  // The idle connection timed out
	TLSHandshakeTimeout:   10 * time.Second,  // TLS handshake timed out
	ExpectContinueTimeout: 1 * time.Second,  
}
Copy the code

debugging

When using scrapy, it provides a handy shell for implementing debug. Unfortunately, colly doesn’t have the same functionality. The debugger here is mainly used to collect information at run time.

The Debugger is an interface, and we implement only two of its methods to complete the collection of runtime information.

type Debugger interface {
    // Init initializes the backend
    Init() error
    // Event receives a new collector event.
    Event(e *Event)
}
Copy the code

There is a typical case in the source code, LogDebugger. We only need to provide the corresponding io.Writer variable. How to use it?

A case study, as follows:

package main

import (
	"log"
	"os"

	"github.com/gocolly/colly"
	"github.com/gocolly/colly/debug"
)

func main(a) {
	writer, err := os.OpenFile("collector.log", os.O_RDWR|os.O_CREATE, 0666)
	iferr ! =nil {
		panic(err)
	}

	c := colly.NewCollector(colly.Debugger(&debug.LogDebugger{Output: writer}), colly.MaxDepth(2))
	c.OnHTML("a[href]".func(e *colly.HTMLElement) {
		if err := e.Request.Visit(e.Attr("href")); err ! =nil {
			log.Printf("visit err: %v", err)
		}
	})

	if err := c.Visit("http://go-colly.org/"); err ! =nil {
		panic(err)
	}
}
Copy the code

When the run is complete, open collector.log to view the output.

distributed

Distributed crawler can be considered from several levels, namely, agent level, execution level and storage level.

The agent level

By setting up the agent pool, we can assign the download task to different nodes to perform, which helps to provide crawler web page download speed. This also reduces the possibility that the IP will be banned due to the crawl speed.

Colly implements the proxy IP code as follows:

package main

import (
	"github.com/gocolly/colly"
	"github.com/gocolly/colly/proxy"
)

func main(a) {
	c := colly.NewCollector()

	if p, err := proxy.RoundRobinProxySwitcher(
		"Socks5: / / 127.0.0.1:1337"."Socks5: / / 127.0.0.1:1338"."http://127.0.0.1:8080",); err ==nil {
		c.SetProxyFunc(p)
	}
	// ...
}
Copy the code

Proxy RoundRobinProxySwitcher is colly built-in by way of polling agents switching function. Of course, we can also completely customize.

For example, an example of a random agent switch would look like this:

var proxies []*url.URL = []*url.URL{
	&url.URL{Host: "127.0.0.1:8080"},
	&url.URL{Host: "127.0.0.1:8081"}},func randomProxySwitcher(_ *http.Request) (*url.URL, error) {
	return proxies[random.Intn(len(proxies))], nil
}

// ...
c.SetProxyFunc(randomProxySwitcher)
Copy the code

However, it should be noted that the crawler at this time is still centralized, and the task is only executed on one node.

The execution

This approach achieves true distribution by assigning tasks to different nodes for execution.

If the implementation of distributed execution, first of all, we need to face a problem: how to allocate tasks to different nodes and realize the collaborative work among different task nodes?

First, we choose the appropriate communication scheme. Common communication protocols include HTTP, TCP, a stateless text protocol, and connection-oriented protocol. In addition, there are a variety of RPC protocols to choose from, such as Jsonrpc, Facebook Thrift, Google GRPC, etc.

The documentation provides sample code for an HTTP service that is responsible for receiving requests and performing tasks. As follows:

package main

import (
	"encoding/json"
	"log"
	"net/http"

	"github.com/gocolly/colly"
)

type pageInfo struct {
	StatusCode int
	Links      map[string]int
}

func handler(w http.ResponseWriter, r *http.Request) {
	URL := r.URL.Query().Get("url")
	if URL == "" {
		log.Println("missing URL argument")
		return
	}
	log.Println("visiting", URL)

	c := colly.NewCollector()

	p := &pageInfo{Links: make(map[string]int)}

	// count links
	c.OnHTML("a[href]".func(e *colly.HTMLElement) {
		link := e.Request.AbsoluteURL(e.Attr("href"))
		iflink ! ="" {
			p.Links[link]++
		}
	})

	// extract status code
	c.OnResponse(func(r *colly.Response) {
		log.Println("response received", r.StatusCode)
		p.StatusCode = r.StatusCode
	})
	c.OnError(func(r *colly.Response, err error) {
		log.Println("error:", r.StatusCode, err)
		p.StatusCode = r.StatusCode
	})

	c.Visit(URL)

	// dump results
	b, err := json.Marshal(p)
	iferr ! =nil {
		log.Println("failed to serialize response:", err)
		return
	}
	w.Header().Add("Content-Type"."application/json")
	w.Write(b)
}

func main(a) {
	// example usage: curl -s 'http://127.0.0.1:7171/? url=http://go-colly.org/'
	addr := ": 7171"

	http.HandleFunc("/", handler)

	log.Println("listening on", addr)
	log.Fatal(http.ListenAndServe(addr, nil))}Copy the code

No code for the scheduler is provided, but the implementation is not complicated. When the task is complete, the service returns the corresponding link to the scheduler, which sends the new task to the work node to continue execution.

If a task execution node needs to be determined based on node load, the service provider monitoring API is also required to obtain node performance data to help the scheduler make the decision.

Storage level

We have achieved distribution by assigning tasks to different nodes for execution. However, some data, such as cookies and records of urls accessed, need to be shared between nodes. By default, this data is stored in memory, so each collector has only one copy of the data.

We can realize data sharing among nodes by saving data to redis, Mongo and other storage. Colly supports switching between any storage as long as the corresponding storage implements the methods in the Colly/storage.storage interface.

In fact, Colly already built part of the storage implementation, check storage. We’ll talk about that in the next section.

storage

As mentioned earlier, let’s take a look at the storages colly already supports.

InMemoryStorage, or memory, is colly’s default storage, which can be replaced by collector.setStorage ().

RedisStorage, perhaps because Redis is used more in distributed scenarios, the official use case is provided.

Other options include Sqlite3Storage and MongoStorage.

More than the collector

The crawlers we demonstrated earlier are all relatively simple and have similar processing logic. If it is a complex crawler, we can create different collectors to handle different tasks.

What to make of this passage? Let me give you an example.

If you have written a period of crawler, you must have encountered the problem of parent-child page crawling, usually the processing logic of the parent page is different from the child page, and usually there is a need for data sharing between the father and child pages. If you’ve ever used scrapy, you’ll know that scrapy implements the logical processing of different pages by binding callbacks to the request, whereas data sharing implements the transfer of parent page data to child pages by binding data to the request.

After doing our research, we found that scrapy’s approach was not one that Colly supported. So what do you do? That’s what we’re trying to solve.

For the processing logic of different pages, we can define the creation of multiple collectors, called collectors, which are responsible for handling different page logic.

c := colly.NewCollector(
	colly.UserAgent("myUserAgent"),
	colly.AllowedDomains("foo.com"."bar.com"),// Custom User-Agent and allowed domains are cloned to c2
c2 := c.Clone()
Copy the code

Typically, the collector of the parent and child pages is the same. In the example above, the child page’s Collector C2 is cloned to replicate the configuration of the parent collector.

Data transfer between parent and child pages can be implemented between different collectors through Context. Note that this Context is only the data sharing structure implemented by Colly, not the Go standard library Context.

c.OnResponse(func(r *colly.Response) {
	r.Ctx.Put("Custom-header", r.Headers.Get("Custom-Header"))
	c2.Request("GET"."https://foo.com/".nil, r.Ctx, nil)})Copy the code

This way, we can get the data passed in by the parent through r.tx in the child page. For this scenario, we can look at the official case coursera_courses.

Configuration optimization

Colly’s default configuration is optimized for a small number of sites. If you are targeting a large number of sites for crawling, some improvements are needed.

Persistent storage

By default, cookies and urls in Colly are stored in memory, and we want to switch to persistent storage. As mentioned earlier, Colly already implements some common persistent storage components.

Enable asynchrony to speed up task execution

By default, Colly blocks waiting for requests to complete, which results in an increasing number of tasks waiting to be executed. We can avoid this problem by setting the Async option of the Collector to true for asynchronous processing. If you do this, remember to add c.wait (), or the program will exit immediately.

Disable or restrict KeepAlive connections

Colly enables KeepAlive by default to increase crawler crawl speed. However, there are requirements for open file descriptors, and for long running tasks, the process can easily reach the maximum descriptor limit.

Example code for disallowing HTTP KeepAlive is shown below.

c := colly.NewCollector()
c.WithTransport(&http.Transport{
    DisableKeepAlives: true,})Copy the code

extension

Colly provides some extensions, mainly common crawler related functions such as referer, random_user_Agent, URl_LENGTH_filter, etc. The source path is under colly/extensions/.

An example of how to use them is as follows:

import (
    "log"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/extensions"
)

func main(a) {
    c := colly.NewCollector()
    visited := false

    extensions.RandomUserAgent(c)
    extensions.Referrer(c)

    c.OnResponse(func(r *colly.Response) {
        log.Println(string(r.Body))
        if! visited { visited =true
            r.Request.Visit("/get? q=2")
        }
    })

    c.Visit("http://httpbin.org/get")}Copy the code

Simply pass the Collector into the extension function. That’s easy to fix.

So, can we implement an extension ourselves?

When it comes to scrapy, implementing an extension requires a lot of advance knowledge and careful reading of its documentation. But Colly didn’t say anything about it in the documentation. What to do? Looks like can only look at the source code.

We opened the source of referer plug-in as follows:

package extensions

import (
	"github.com/gocolly/colly"
)

// Referer sets valid Referer HTTP header to requests.
// Warning: this extension works only if you use Request.Visit
// from callbacks instead of Collector.Visit.
func Referer(c *colly.Collector) {
	c.OnResponse(func(r *colly.Response) {
		r.Ctx.Put("_referer", r.Request.URL.String())
	})
	c.OnRequest(func(r *colly.Request) {
		if ref := r.Ctx.Get("_referer"); ref ! ="" {
			r.Headers.Set("Referer", ref)
		}
	})
}
Copy the code

An extension is implemented by adding some event callbacks to collector. Such a simple source code, completely without documentation can achieve a own extension. Colly’s elegant design and Go’s simple syntax are the main reasons for colly’s brevity, which is similar to scrapy and extends callbacks to request and response.

conclusion

After reading colly’s official documentation, you’ll find that while it’s pretty rudimentary, it covers pretty much everything that should be covered. If there is some content not covered, I have also made relevant supplement in this article. The previous Go Elastic package was also poorly documented, but a quick read of the source code made it immediately clear how to use it.

Maybe this is the simplicity of Go.

Finally, if you have any problems using Colly, the official example is definitely best practice, so take the time to read it.