Mid-Autumn Festival is coming, sure not to climb some moon cakes to send mother-in-law?

I am participating in the Mid-Autumn Festival Creative Submission contest, please see: Mid-Autumn Festival Creative Submission Contest for details

This article appears in my column:Let’s Golang

Recently, when I was learning Go, I found that the language to write crawlers seems to be good. It is the Mid-Autumn Festival, so I want to climb some pictures of moon cakes to have fun. You can also climb some pictures to send your mother-in-law

Tips: This article is a teaching blog about Go crawler. It will not discuss too much about the key and difficult points of writing crawler in Go language. Don’t worry about not understanding it, I will introduce all knowledge used in this article…. If you are a big man, you can stop here, or you can give a thumbs-up to this rookie

First, get the page picture link

Here’s how to get an image link from a page. The principle is simple: we use the GetHtml function we wrote to get the page source code, then use regular expressions to get the image link, and then save the link in the string array.

Here is the GetHtml function:

func GetHtml(url string) string {
    resp, _ := http.Get(url)
    defer resp.Body.Close()

    bytes, _ := ioutil.ReadAll(resp.Body)
    html := string(bytes)
    return html
}
Copy the code

The important thing to note about this function is that the resp.body needs to be closed slowly.

Here is the GetPageImgurls function:

func GetPageImgurls(url string) []string {
    html := GetHtml(url)

    re := regexp.MustCompile(ReImg)
    rets := re.FindAllStringSubmatch(html, -1)

    imgUrls := make([]string, 0)
    for _, ret := range rets {
        imgUrl := "https://www.yuebing.com/"+ret[1]
        imgUrls = append(imgUrls, imgUrl)
    }
    return imgUrls
}
Copy the code

Because the path to be crawled is a relative path, you need to add domain name, protocol and other information before the relative path to form an absolute path and store it in the string array, which is convenient for downloading pictures in the future.

Two, realize synchronous download function

Then we will realize the synchronous download function, we are to save the picture to the hard disk with the name of the timestamp.

The DownloadImg function is shown below:

func DownloadImg(url string) { resp, _ := http.Get(url) defer resp.Body.Close() filename := 'E:\code\ SRC \day4\ Crawl image \img\ + strconv.itoa (int(time.now ().unixnano ()))+".jpg" imgBytes, _ := ioutil.ReadAll(resp.Body) err := ioutil.WriteFile(filename, imgBytes, 0644) if err == nil{FMT.Println(filename+" download successful!" }else{FMT.Println(filename+" Download failed!" )}}Copy the code

Ioutil. WriteFile(filename, imgBytes, 0644) This imgBytes is a stream of image bytes. The following bytes are r, w, and x are 4, 2, 1 respectively, so this 0644 means that users with files can read and write, users in the same group can read, and other users can read.

    owner   group   other
0 - rwx  -  rwx  -  rwx
Copy the code

In addition, strconv.itoa (int(time.now ().unixnano ()))) requires an int timestamp because Itoa converts int to string and the timestamp is int64.

Three, realize asynchronous download function

Some people say that it is easy to implement asynchronous download with Go ~ one line of code can achieve, hehe hehe hehe. Yeah, let’s see how that works.

func DownloadImgAsync(url string)  {
    go DownloadImg(url)
}
Copy the code

But then, as many pictures as you need to open up as many coroutines.

What should we do?

chSem = make(chan int,5)
Copy the code

Create a pipe with a capacity of 5 so that each image can be downloaded at the same time with a concurrent capacity of 5.

func DownloadImgAsync(url string)  {
    downloadWG.Add(1)
    go func() {
        chSem <- 1
        DownloadImg(url)
        <-chSem
        downloadWG.Done()
    }()
    downloadWG.Wait()
}
Copy the code

Then each time the download goes to the pipe to write a number, after downloading, read a number from the pipe, so as to ensure that only 5 photos can be downloaded at the same time.

And then do you have any idea what’s going to go wrong with the operation?

Yes, we saved the file with the timestamp name, if asynchronous download, multiple files may have the same timestamp, so we have to generate random file names.

Generate random file names

Above we talked about generating random file names, let’s write ~

The first step is to generate a random number. I’m going to add a random number to the end of the timestamp to avoid duplicate file names.

Let’s first show the code for generating random numbers:

func GetRandomInt(start,end int) int {
    randomMT.Lock()
    <- time.After(1 * time.Nanosecond)
    r := rand.New(rand.NewSource(time.Now().UnixNano()))
    ret := start + r.Intn(end - start)
    randomMT.Unlock()
    return ret

}
Copy the code

It creates a mutex, blocks for a nanosecond, calculates a random number in range, unlocks the mutex, and returns the string.

The next function to generate a random file name is simpler:

func GetRandomName() string {
   timestamp := strconv.Itoa(int(time.Now().UnixNano()))
   randomNum := strconv.Itoa(GetRandomInt(100, 10000))
   return timestamp + "-" + randomNum
}
Copy the code

You generate a timestamp and a random number, and you concatenate it.

5. Use the Title attribute as the file name

We use regular expression to get the image link and the image name Title. At first I thought it was a regular expression crawl link, a crawl name, but there is no possibility that there is a picture without the Title attribute, so I choose to crawl all the information regardless of the Title attribute. Something like this:

Let’s start with the first piece of code:

func GetPageImginfos(url string) []map[string] string { html := GetHtml(url) re := regexp.MustCompile(ReImgName) rets :=  re.FindAllStringSubmatch(html, -1) imgInfos := make([]map[string] string,0) for _,ret := range rets { imgInfo := make(map[string] string) imgUrl := "https://www.yuebing.com/"+ret[1] imgInfo["url"] = imgUrl[0:78] imgInfo["filename"]=GetImgNameTag(ret[1]) //fmt.Println(imgInfo["filename"]) imgInfos = append(imgInfos, imgInfo) } return imgInfos }Copy the code

This code leverages regular expressions

ReImgName = `<a.+? path="(.+?) "> `Copy the code

Crawl the string with the image link and Title attributes, and save the URL and filename to the Map. Since the image links are all the same length, it is easier to use the string, but the Title tag is not so easy, its length is not fixed. So what to do?

Here’s how to get the value from the Title tag:

func GetImgNameTag(imgTag string) string {
    re := regexp.MustCompile(ReTitle)
    rets := re.FindAllStringSubmatch(imgTag, -1)
    //fmt.Println(rets)
    if len(rets) > 0{
        return rets[0][1]
    }else {
        return GetRandomName()
    }
}
Copy the code

We use the regular expression again to get the value inside the Title.

The regular expression content is as follows:

ReTitle = `title="(.+)`
Copy the code

This reptile is basically done. Hurry up and send your mother-in-law

And then I found a big problem.

It was I who discovered that this asynchronous download can only download one page asynchronously, not multiple pages of images simultaneously. So to modify the program…….

Let’s add wg * sync.waitgroup to the asynchronous download function

func DownloadImgAsync(url ,filename string,wg *sync.WaitGroup)  {
    wg.Add(1)
    go func() {
        chSem <- 1
        DownloadImg(url,filename)
        <-chSem
        downloadWG.Done()
    }()
}
Copy the code

Instead of waiting here, wait inside the main function.

So let me show you the main function here.

func main() { for i:=1; i<=15; i++{ j := strconv.Itoa(i) url := "https://www.yuebing.com/category-0-b0-min0-max0-attr0-" + j + "-sort_order-ASC.html" imginfos := GetPageImginfos(url) for _,imgInfoMap := range imginfos{ DownloadImgAsync(imgInfoMap["url"],imgInfoMap["filename"],&downloadWG) time.Sleep(500 * time.Millisecond) } } downloadWG.Wait() }Copy the code

And then I found a big problem.

It was I who discovered that this asynchronous download can only download one page asynchronously, not multiple pages of images simultaneously. So to modify the program…….

Let’s add wg * sync.waitgroup to the asynchronous download function

func DownloadImgAsync(url ,filename string,wg *sync.WaitGroup)  {
    wg.Add(1)
    go func() {
        chSem <- 1
        DownloadImg(url,filename)
        <-chSem
        downloadWG.Done()
    }()
}
Copy the code

Instead of waiting here, wait inside the main function.

So let me show you the main function here.

func main() { for i:=1; i<=15; i++{ j := strconv.Itoa(i) url := "https://www.yuebing.com/category-0-b0-min0-max0-attr0-" + j + "-sort_order-ASC.html" imginfos := GetPageImginfos(url) for _,imgInfoMap := range imginfos{ DownloadImgAsync(imgInfoMap["url"],imgInfoMap["filename"],&downloadWG) time.Sleep(500 * time.Millisecond) } } downloadWG.Wait() }Copy the code

This is obviously going to be much faster.

The following describes the relevant knowledge used in this article.

Six, the last

I have uploaded the code to Github, please help yourself.

Github.com/ReganYue/Cr…

Mid-Autumn Festival is coming, sure not to climb some moon cakes to send mother-in-law?

First, get the page picture link

Two, realize synchronous download function

Three, realize asynchronous download function

Generate random file names

5. Use the Title attribute as the file name

Six, the last

Related Posts

Twenty-three design patterns (22) — Template Method Pattern

Microservice deployment: Blue-green deployment, rolling deployment, grayscale publishing, Canary publishing

1.4.1 Configuring the Central-Derby cluster pattern