Colly is the world’s most famous Web crawler framework of Golang. Its API is clear, highly configurable and extensible, supports distributed fetching, and supports a variety of storage backends (such as memory, Redis, MongoDB, etc.). This article records some of my feelings and understandings about learning to use it.

Install it first:

❯ go get -u github.com/gocolly/colly/...
Copy the code

The go get package is different from the previous installation package. Such an ellipsis, which means to also get the package’s children and dependencies.

Let’s start with the simplest example

Colly’s documentation is very detailed and complete, and there are many crawler examples in the _examples directory under the project, which is very easy to use. Let’s start with my example:

package main

import (
	"fmt"

	"github.com/gocolly/colly"
)

func main(a) {
	c := colly.NewCollector(
		colly.UserAgent("Mozilla / 5.0 (compatible; Googlebot / 2.1; +http://www.google.com/bot.html)"),
	)

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.OnError(func(_ *colly.Response, err error) {
		fmt.Println("Something went wrong:", err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", r.Request.URL)
	})

	c.OnHTML(".paginator a".func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

    c.OnScraped(func(r *colly.Response) {
        fmt.Println("Finished", r.Request.URL)
    })

	c.Visit("https://movie.douban.com/top250?start=0&filter=")}Copy the code

This program is going to find all links to the Top250 app, as described in the first function of the OnHTML method, and find the href value of the a tag under the paginator tag.

Run it:

❯ go run colly/doubanCrawler1. Go want Visited https://movie.douban.com/top250?start=0&filter= https://movie.douban.com/top250?start=0&filter= Visiting https://movie.douban.com/top250?start=25&filter= Visited https://movie.douban.com/top250?start=25&filter= ... Finished https://movie.douban.com/top250?start=25&filter= Finished https://movie.douban.com/top250?start=0&filter=Copy the code

The primary entity in Colly is a Collector object (created with colly.newCollector), which manages network communication and callback execution for responses. The Collector can accept a variety of Settings during initialization, such as the UserAgent value in this example. For other Settings, go to the official website.

The Collector object accepts multiple callback methods with different functions, listed in the order in which they are invoked:

  1. OnRequest. Request before
  2. OnError. An error occurred during the request
  3. OnResponse. After receiving the response
  4. OnHTML. Call it if the response is received in HTML.
  5. OnXML. Call it if the content of the response received is XML. You don’t really need it to write crawlers, so I didn’t use it above.
  6. OnScraped. Called after the OnXML/OnHTML callback is complete. But it says on the websiteCalled after OnXML callbacksAnd it actually works for OnHTML, so you can notice that.

Grab the item ID and title

Let’s look at the HTML code for each entry on the Top250 page:

<ol class="grid_view"> <li> <div class="item"> <div class="info"> <div class="hd"> <a Href = "https://movie.douban.com/subject/1292052/" class = "" > < span class =" title "> shawshank redemption < / span > < span class =" title "> & have spent /&nbsp; The Shawshank Redemption</span> <span class="other">&nbsp; /&nbsp; Black fly (Hong Kong)/exciting 1995 (Taiwan) < / span > < / a > < span class = "playable" > [can play] < / span > < / div > < / div > < / div > < / li >... </ol>Copy the code

Take a look at how this program works:

package main

import (
	"log"
	"strings"

	"github.com/gocolly/colly"
)

func main(a) {
	c := colly.NewCollector(
		colly.Async(true),
		colly.UserAgent("Mozilla / 5.0 (compatible; Googlebot / 2.1; +http://www.google.com/bot.html)"),
	)

	c.Limit(&colly.LimitRule{DomainGlob:  "*.douban.*", Parallelism: 5})

	c.OnRequest(func(r *colly.Request) {
		log.Println("Visiting", r.URL)
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println("Something went wrong:", err)
	})

	c.OnHTML(".hd".func(e *colly.HTMLElement) {
		log.Println(strings.Split(e.ChildAttr("a"."href"), "/") [4],
			strings.TrimSpace(e.DOM.Find("span.title").Eq(0).Text()))
    })

	c.OnHTML(".paginator a".func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.Visit("https://movie.douban.com/top250?start=0&filter=")
	c.Wait()
}
Copy the code

If you’re interested in running the above example, you can see that fetching is synchronous and slow. NewCollector: colly.Async(true) : colly.Async(true) : colly.NewCollector: colly.Async(true) : colly.Async(true) LimitRule{DomainGlob: “*.douban.*”, Parallelism: 5}), which restricts fetching only douban addresses (domain name suffixes and secondary domain names are not limited), and of course supports reurls to match certain URLS, which can be seen in official documents.

The Limit method also limits concurrency to 5. Why control concurrency? Because crawl bottlenecks often come from the other site crawl frequency restrictions, if in a period of time to reach a certain crawl frequency is easy to be blocked, so we have to control the crawl frequency. In addition, in order not to bring additional pressure and resource consumption to the other site, you should also control your crawling mechanism.

There is no OnResponse method in this example, mainly because there is no actual logic in it. However, the Wait method is often used because Async true requires waiting for coroutines to complete. However, there are two OnHTML methods, one for checking which pages are visited and the other for fetching the item information. Which is this part:

c.OnHTML(".hd".func(e *colly.HTMLElement) {
    log.Println(strings.Split(e.ChildAttr("a"."href"), "/") [4],
        strings.TrimSpace(e.DOM.Find("span.title").Eq(0).Text()))
})
Copy the code

Colly’s HTML parsing library uses GoQuery, so you can just write to follow the syntax of GoQuery. The ChildAttr method gets the value of the element’s attribute, and an unlisted ChildText is used to get the text content of the element. However, in this example, there are two span tags named title, and we use ChildText to directly return all the values of the two tags, but Colly does not provide ChildTexts method (including ChildAttrs). TrimSpace(e.dom.find (” sp.title “).eq (0).text ())).

Use XPath in Colly

If you don’t like the goQuery format, you can switch to HTML parsing. Here’s my example:

import "github.com/antchfx/htmlquery"

c.OnResponse(func(r *colly.Response) {
    doc, err := htmlquery.Parse(strings.NewReader(string(r.Body)))
    iferr ! =nil {
        log.Fatal(err)
    }
    nodes := htmlquery.Find(doc, `//ol[@class="grid_view"]/li//div[@class="hd"]`)
    for _, node := range nodes {
        url := htmlquery.FindOne(node, "./a/@href")
        title := htmlquery.FindOne(node, `.//span[@class="title"]/text()`)
        log.Println(strings.Split(htmlquery.InnerText(url), "/") [4],
            htmlquery.InnerText(title))
    }
})
Copy the code

This time I’ll get the item ID and title in the OnResponse method instead. Htmlquery.parse needs to accept an object that implements the IO.Reader interface, so strings.newReader (string(r.body)) is used. The rest of the code is the previous use of Golang to write crawler (5) – use XPath to write, directly copy from the can.

Afterword.

I liked Colly when I tried it, didn’t you?

The code address

Address of this article: strconv.com/posts/use-c…

The full code can be found at this address.