0 x0 tips before reading

Golang basic syntax, BASIC knowledge of HTML, CSS, JS. I’ve heard of regular expressions and GOLang’s HTTP.

The purpose of this article: to record a minimalist crawler script entry to the development. Only for learning use, can not cause loss to the site.

0x1 First introduction to a crawler

The wiki:

Web crawler (spider) is a kind of web robot used to browse the World Wide Web automatically. Its purpose is generally to compile web indexes. For example, web search engines update their own web content or their indexes to other web sites through crawler software. Web crawlers can save the pages they visit so that search engines can generate indexes for users to search.

Crawlers visiting web sites consume target system resources. Many network systems do not acquiesce to crawler work. Therefore, crawlers need to consider planning, load, and “courtesy” when visiting a large number of pages. Public sites that do not want to be accessed by crawlers and known by crawler owners can avoid access by using methods such as robots.txt files. This file can require the bot to index only a portion of the site, or to do nothing at all.

My current understanding is that crawlers can access web pages, but the content they get is not necessarily the same as what the browser loads (which involves loading some resources). Most web sites have basic anti-crawling, but can be cracked by setting Headers.

Today’s harvest: Reworked the overall development steps of a minimalist crawler.

TODO:

  • Learn about golang’s HTTP package
  • Hand crawler general framework, modular entire project
  • Crawler goes deep…

0x2 Minimalist crawler development steps

  1. Creating a Client
  2. Create an HTTP request
  3. Add a Header to the request Header
  4. The client sends a request request
  5. The client receives a response
  6. Response. Body is decoded and analyzed. If it is not UTF-8, it is converted to UTF-8 encoding
  7. Parse the retrieved content: Extract the required information using regular expressions and format the output
  8. results

As shown in the figure below

0x03 code + comment parsing

As follows:

package main

import (
	"bufio"
	"fmt"
	"golang.org/x/net/html/charset"
	"golang.org/x/text/encoding"
	"golang.org/x/text/transform"
	"io"
	"io/ioutil"
	"log"
	"net/http"
	"regexp"
)

func main(a) {
	url := "https://www.bilibili.com/v/popular/rank/all"
  // Create a client
	client := &http.Client{}
  // Create a request
	req, err := http.NewRequest("GET", url, nil)
	iferr ! =nil {
		log.Fatalln(err)
	}
  / / set the Header
	req.Header.Set("User-Agent"."Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36")
	
  // The client initiates a request and receives a response
	resp, err := client.Do(req)
	iferr ! =nil {
		log.Fatalln(err)
	}
	defer resp.Body.Close()
	
  // If the access fails, print the current status code
	ifresp.StatusCode ! = http.StatusOK { fmt.Println("Error: status code", resp.StatusCode)
		return
	}
  
  // We need to get the encoding of the retrieved content
	e := determineEncodeing(resp.Body)
  // Get the utF-8 encoded content
	utf8Reader := transform.NewReader(resp.Body, e.NewDecoder())
	
  // Read all the information on the page
	all, err := ioutil.ReadAll(utf8Reader)
	iferr ! =nil {
		panic(err)
	}
  // Prints information
	fmt.Printf("%s",all)
  // Parses and prints the retrieved content
	printTitle(all)
}

func determineEncodeing(r io.Reader) encoding.Encoding {
	bytes, err := bufio.NewReader(r).Peek(1024)
	iferr ! =nil {
		panic(err)
	}
	e, _, _ := charset.DetermineEncoding(bytes, "")
	return e
}

func printTitle(contents []byte) {
  // Regular expression, used to match content, + means at least one content, ^ means cannot contain content, parentheses are taken into the array
	re := regexp.MustCompile(`<a href="//(www.bilibili.com/video/[0-9a-zA-Z]+)" target="_blank" class="title">([^<]+)</a>`)
  // -1 finds all matching strings
	matches := re.FindAllSubmatch(contents, - 1)
  / / print
	for _, m := range matches {
		fmt.Printf("Title: %s, URL:%s\n", m[2], m[1])
	}
	fmt.Printf("matches: %d\n".len(matches))
}
Copy the code

0 x04 epilogue

I had no inspiration and no patience today, so I spent two and a half hours writing this. You have the energy to continue to replenish later.