Golang concurrent crawler crawls a well-known game medium

The first time in digging gold water article, a little bit of excitement, ha ha

This time use Golang to capture the famous (La Ji) game media homeless stars

The main third-party package used is GoQuery, which parses HTML. It doesn’t matter if you haven’t used GoQuery, it’s pretty simple.

The second is to use Golang to insert data into MySql.

First, a web page is requested using a NET/HTTP package

func main() {
	url := "https://www.gamersky.com/news/"
	
	resp, err := http.Get(url)
	iferr ! = nil { log.Fatal(err) } defer resp.Body.Close()ifres.StatusCode ! = 200 { log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
	}
}
Copy the code

Request page errors are not tolerated here, so log.Fatal exits directly when an error occurs.

Then use goquery. NewDocumentFromReader will load to parse HTML type.

    // NewDocumentFromReader returns a Document from an io.Reader.
    html, err := goquery.NewDocumentFromReader(resp.Body)
Copy the code

Next we can use GoQuery to parse the HTML page.

First we get all the news links on this page

Here the news link appears under the a tag class=”tt”, so we use goQuery to parse all the href attribute of the a tag with the attribute ‘tt’ under the page, then we can get all the news links under the page.

func getNewsList(html *goquery.Document, newsList []string) []string {
	html.Find("a[class=tt]").Each(func(i int, selection *goquery.Selection) {
	    url, _ := selection.Attr("href")
	    newsList = append(newsList, url)
	})
	return newsList
}
Copy the code

So we get all the news links from the front page of the news and put all the links in the newsList slice.

Let’s start by crawling through these news links for specific stories.

Use Goroutine to implement concurrent requests for these news links and parse the results.

        var newsList []string
	newsList = getNewsList(html, newsList)

	var wg sync.WaitGroup
	for i := 0; i < len(newsList); i++ {
	    wg.Add(1)
	    go getNews(newsList[i], &wg)
	}
	wg.Wait()
Copy the code

First we initialize a sync.waitgroup to control goroutine running and ensure that all goroutine runs are complete.

Iterate over the newsList where we have all the news links, and a news link opens a corresponding Goroutine to handle the rest of the processing.

Wg.wait () is used to block the program until all tasks in WG are complete.

Next we start parsing each news page to get the data we want.

First we define the structure of News.

type News struct {
	Title   string
	Media   string
	Url     string
	PubTime string
	Content string
}
Copy the code

As with the first step, first we need to request the news link.

func getNews(url string, wg *sync.WaitGroup) {
	resp, err := http.Get(url)
	iferr ! = nil { log.Println(err) wg.Done()return
	}

	defer resp.Body.Close()

	ifresp.StatusCode ! = http.StatusOK { log.Printf("Error: status code %d", resp.StatusCode)
	    wg.Done()
	    return
	}
	
	html, err := goquery.NewDocumentFromReader(resp.Body)
	news := News{}
Copy the code

By following these steps, the HTML we successfully requested has been turned into an object that can be parsed using Goquer.

The title is in h1 under div of class=”Mid2L_tit”.

html.Find("div[class=Mid2L_tit]>h1").Each(func(i int, selection *goquery.Selection) {
	news.Title = selection.Text()
    })
	
if news.Title == "" {
	wg.Done()
	return
    }
Copy the code

Here individual news column and ordinary news page format is different, temporarily good processing, so when there is no resolution of the Title return.

We can see that the time is in div class=”detail”, but the time can not be directly saved in the database. In this case, I use regular expression to extract all the dates and times and combine them into a format that can be saved in the database.

    var tmpTime string
    html.Find("div[class=detail]").Each(func(i int, selection *goquery.Selection) {
		tmpTime = selection.Text()
    })
    reg := regexp.MustCompile(`\d+`)
    timeString := reg.FindAllString(tmpTime, -1)
    news.PubTime = fmt.Sprintf("%s-%s-%s %s:%s:%s", timeString[0], timeString[1], timeString[2], timeString[3], timeString[4], timeString[5])
Copy the code

If there is a better way, everyone must teach me!!

The next step is to parse the news

The news body is in the P tag under div class=”Mid2L_con”.

html.Find("div[class=Mid2L_con]>p").Each(func(i int, selection *goquery.Selection) {
    news.Content = news.Content + selection.Text()
})
Copy the code

Now that we have all the data we need, the next step is to store the data into MySql.

Start by creating a table called Gamesky.

create table gamesky
(
    id          int auto_increment
        primary key,
    title       varchar(256)                        not null,
    media       varchar(16)                         not null,
    url         varchar(256)                        not null,
    content     varchar(4096)                       null,
    pub_time    timestamp default CURRENT_TIMESTAMP not null on update CURRENT_TIMESTAMP,
    create_time timestamp default CURRENT_TIMESTAMP not null
);
Copy the code

Next we establish a Mysql connection.

package mysql

import (
    "database/sql"
    "fmt"
    "os"

    _ "github.com/go-sql-driver/mysql"
)

var db *sql.DB

func init() {
    db, _ = sql.Open("mysql"."Root: a root @ TCP/game_news (127.0.0.1:3306)? charset=utf8")
    db.SetMaxOpenConns(1000)
    err := db.Ping()
    iferr ! = nil { fmt.Println("Failed to connect to mysql, err:" + err.Error())
        os.Exit(1)
    }
}

func DBCon() *sql.DB {
    return db
}
Copy the code

The next step is to use the MySql connection we set up to save the data we retrieved.

    db := mysql.DBCon()
    
    stmt, err := db.Prepare(
        "insert into news (`title`, `url`, `media`, `content`, `pub_time`) values (? ,? ,? ,? ,?) ")
    iferr ! = nil { log.Println(err) wg.Done() } defer stmt.Close() rs, err := stmt.Exec(news.Title, news.Url, news.Media, news.Content, news.PubTime)iferr ! = nil { log.Println(err) wg.Done() }if id, _ := rs.LastInsertId(); id > 0 {
        log.Println("Insert successful")
    }
    wg.Done()
Copy the code

Rs.lastinsertid () is used to get the ID of the data that was just inserted into the database. If the insert is successful, the id of the corresponding record is returned.

The length of the news body sometimes exceeds the column length specified in MySql. You can modify the column length or save part of the news body.

After an error occurs in a goroutine, or after saving the database, remember WG.done () to reduce the number of tasks in WG by 1.

In this way, our crawler can simultaneously grab the news and save it in the database.

Can see we crawl speed too quickly, has triggered a homeless star of the crawler, so you need to lower frequency can be, but it loses the Golang concurrent advantages, so want to concurrent data is fetched and don’t want to be the crawler, it is necessary to configure a good agent pool, but did not do so here.

package main

import (
	"fmt"
        "game_news/mysql"
	"log"
	"net/http"
	"regexp"
	"sync"

	"github.com/PuerkitoBio/goquery"
)

type News struct {
	Title   string
	Media   string
	Url     string
	PubTime string
	Content string
}

func main() {
	url := "https://www.gamersky.com/news/"
	resp, err := http.Get(url)
	iferr ! = nil { log.Fatal(err) } defer resp.Body.Close()ifresp.StatusCode ! = 200 { log.Fatalf("status code error: %d %s", resp.StatusCode, resp.Status)
	}

	html, err := goquery.NewDocumentFromReader(resp.Body)
	var newsList []string
	newsList = getNewsList(html, newsList)

	var wg sync.WaitGroup
	for i := 0; i < len(newsList); i++ {
		wg.Add(1)
		go getNews(newsList[i], &wg)
	}
	wg.Wait()
}

func getNewsList(html *goquery.Document, newsList []string) []string {
	// '//a[@class="tt"]/@href'
	html.Find("a[class=tt]").Each(func(i int, selection *goquery.Selection) {
		url, _ := selection.Attr("href")
		newsList = append(newsList, url)
	})
	return newsList
}

func getNews(url string, wg *sync.WaitGroup) {
	resp, err := http.Get(url)
	iferr ! = nil { log.Println(err) wg.Done()return
	}

	defer resp.Body.Close()

	ifresp.StatusCode ! = http.StatusOK { log.Printf("Error: status code %d", resp.StatusCode)
		wg.Done()
		return
	}

	html, err := goquery.NewDocumentFromReader(resp.Body)
	news := News{}

	news.Url = url
	news.Media = "GameSky"
	html.Find("div[class=Mid2L_tit]>h1").Each(func(i int, selection *goquery.Selection) {
		news.Title = selection.Text()
	})

	if news.Title == "" {
		wg.Done()
		return
	}

	html.Find("div[class=Mid2L_con]>p").Each(func(i int, selection *goquery.Selection) {
		news.Content = news.Content + selection.Text()
	})

	var tmpTime string
	html.Find("div[class=detail]").Each(func(i int, selection *goquery.Selection) {
		tmpTime = selection.Text()
	})
	reg := regexp.MustCompile(`\d+`)
	timeString := reg.FindAllString(tmpTime, -1)
	news.PubTime = fmt.Sprintf("%s-%s-%s %s:%s:%s", timeString[0], timeString[1], timeString[2], timeString[3], timeString[4], timeString[5])

	db := mysql.DBCon()

	stmt, err := db.Prepare(
		"insert into gamesky (`title`, `url`, `media`, `content`, `pub_time`) values (? ,? ,? ,? ,?) ")
	iferr ! = nil { log.Println(err) wg.Done() } defer stmt.Close() rs, err := stmt.Exec(news.Title, news.Url, news.Media, news.Content, news.PubTime)iferr ! = nil { log.Println(err) wg.Done() }if id, _ := rs.LastInsertId(); id > 0 {
		log.Println("Insert successful")
	}
	wg.Done()
}

Copy the code

This article is over, if you have any questions in the above article, please comment, thank you very much!!

Golang concurrent crawler crawls a well-known game medium

Related Posts

The only way to become an architect! How to understand the Java Dynamic proxy design pattern?

As for message queues, can you withstand the barrage of interviews 🔥?

1379. Find the same node in the clone binary tree (Java/c++ / python)