background
In the previous article, we introduced the cool TUI application framework, BubbleTea. Finally, a program is implemented to pull GitHub Trending warehouse and display it in the console. Since GitHub doesn’t provide an official Trending API, we implemented one ourselves with GoQuery. Due to space constraints, the previous article did not describe how to do this. In this article I’ve cleaned up the code and opened it up as a separate code base.
To observe the
First, let’s observe the structure of GitHub Trending:
In the upper left corner, you can switch Repositories and Developers. On the right, you can choose Language (Spoken Language, local Language, Chinese, English, etc.), Language (Language, programming Language, Golang, C++, etc.) and time Range (Date Range, supporting three dimensions). Today, This week, This month.
Then here is the information for each warehouse:
① Warehouse author and name
② Warehouse description
③ The main programming language used (set when creating the warehouse), or not
(4) number of star
(5) the fork number
⑥ List of Contributors
⑦ The number of new stars in the selected time range (Today, This week, This month)
The developer page is similar, but with much less information:
① Author information
② The most popular warehouse information
Notice the developers page after switch, URL into github.com/trending/developers. When we choose our native language is Chinese, development language to Go and time range for Today after the URL to https://github.com/trending/go?since=daily&spoken_language_code=zh, This selection is indicated by adding the corresponding key-value pair to the query-String.
To prepare
Create repository ghtrending on GitHub, clone to local, execute Go Mod init to initialize:
$ go mod init github.com/darjun/ghtrending
Copy the code
Then execute go Get to download the GoQuery library:
$ go get github.com/PuerkitoBio/goquery
Copy the code
Define two structures based on repository and developer information:
type Repository struct {
Author string
Name string
Link string
Desc string
Lang string
Stars int
Forks int
Add int
BuiltBy []string
}
type Developer struct {
Name string
Username string
PopularRepo string
Desc string
}
Copy the code
Open up
To get the information using GoQuery, we first need to know the structure of the corresponding page. Press F12 to open chrome Developer Tools and select the Elements TAB to see the structure of the page:
To quickly see the structure of any content on the page, use the button in the upper left corner. We click on a single warehouse entry:
The Elements window on the right shows an article element for each warehouse entry:
You can use the standard library NET/HTTP to get the content of the entire web page:
resp, err := http.Get("https://github.com/trending")
Copy the code
Then create the GoQuery document structure from the RESP object:
doc, err := goquery.NewDocumentFromReader(resp.Body)
Copy the code
With a document structure object, we can call its Find() method and pass in a selector, which I’ll select here. Box, Box – row. .box is the class of the entire list div, and.box-row is the class of the warehouse entry. This choice is more precise. The Find() method returns a * goQuery.Selection object whose Each() method we can call to parse Each entry. Each() takes a function of type func(int, * goQuery.Selection), and the second argument is the structure of Each warehouse entry in GoQuery:
doc.Find(".Box .Box-row").Each(func(i int, s *goquery.Selection){})Copy the code
Now let’s look at how to extract the parts. Moving around in the Elements window gives you an intuitive view of what part of the page each element corresponds to:
We find the structure that corresponds to the warehouse name and the author:
It is wrapped in the A element under the article h1 element, the author name under the SPAN element, the repository name under the A element, and the URL link to the repository is the href attribute of the A element. Let’s get them:
titleSel := s.Find("h1 a")
repo.Author = strings.Trim(titleSel.Find("span").Text(), "/\n ")
repo.Name = strings.TrimSpace(titleSel.Contents().Last().Text())
relativeLink, _ := titleSel.Attr("href")
if len(relativeLink) > 0 {
repo.Link = "https://github.com" + relativeLink
}
Copy the code
The repository is described in the P element within the article element:
repo.Desc = strings.TrimSpace(s.Find("p").Text())
Copy the code
The programming language, number of stars, number of forks, contributor (BuiltBy), and number of new stars are all in the last div of the article element. The programming language, BuiltBy and the number of new stars are in the SPAN element, while the number of stars and the number of forks are in the A element. If the programming language is not set, a span element is missing:
var langIdx, addIdx, builtByIdx int
spanSel := s.Find("div>span")
if spanSel.Size() == 2 {
// language not exist
langIdx = - 1
addIdx = 1
} else {
builtByIdx = 1
addIdx = 2
}
// language
if langIdx >= 0 {
repo.Lang = strings.TrimSpace(spanSel.Eq(langIdx).Text())
} else {
repo.Lang = "unknown"
}
// add
addParts := strings.SplitN(strings.TrimSpace(spanSel.Eq(addIdx).Text()), "".2)
repo.Add, _ = strconv.Atoi(addParts[0])
// builtby
spanSel.Eq(builtByIdx).Find("a>img").Each(func(i int, img *goquery.Selection) {
src, _ := img.Attr("src")
repo.BuiltBy = append(repo.BuiltBy, src)
})
Copy the code
Then there are stars and forks:
aSel := s.Find("div>a")
starStr := strings.TrimSpace(aSel.Eq(2 -).Text())
star, _ := strconv.Atoi(strings.Replace(starStr, ","."".- 1))
repo.Stars = star
forkStr := strings.TrimSpace(aSel.Eq(- 1).Text())
fork, _ := strconv.Atoi(strings.Replace(forkStr, ","."".- 1))
repo.Forks = fork
Copy the code
Developers do the same thing. I won’t go into it here. One thing to be careful about with GoQuery is that because the hierarchy of web pages is complex, we use selectors to limit as many elements and classes as possible to ensure that we find exactly the structure we want. In addition, the content obtained from the web page has many Spaces, which need to be removed with strings.trimspace ().
Interface design
With the basic work done, let’s look at how to design the interface. I want to provide a type and a method to create an object of that type, then call the FetchRepos() and FetchDevelopers() methods of the object to get the repository and developer list. But I don’t want the user to know the details of this type. So I define an interface:
type Fetcher interface {
FetchRepos() ([]*Repository, error)
FetchDevelopers() ([]*Developer, error)
}
Copy the code
We define a type to implement this interface:
type trending struct{}
func New(a) Fetcher {
return &trending{}
}
func (t trending) FetchRepos(a) ([]*Repository, error){}func (t trending) FetchDevelopers(a) ([]*Developer, error){}Copy the code
The crawl logic we introduced above is in the FetchRepos() and FetchDevelopers() methods.
Then, we can use it elsewhere:
import "github.com/darjun/ghtrending"
t := ghtrending.New()
repos, err := t.FetchRepos()
developers, err := t.FetchDevelopers()
Copy the code
options
As mentioned earlier, GitHub Trending supports selecting local languages, programming languages and time ranges. We want these Settings as options, using the option mode/Functional option commonly used in the Go language. Define the option structure first:
type options struct {
GitHubURL string
SpokenLang string
Language string // programming language
DateRange string
}
type option func(*options)
Copy the code
Then define three DataRange options:
func WithDaily(a) option {
return func(opt *options) {
opt.DateRange = "daily"}}func WithWeekly(a) option {
return func(opt *options) {
opt.DateRange = "weekly"}}func WithMonthly(a) option {
return func(opt *options) {
opt.DateRange = "monthly"}}Copy the code
There may be other ranges of time in the future, so leave a more general option:
func WithDateRange(dr string) option {
return func(opt *options) {
opt.DateRange = dr
}
}
Copy the code
Programming language options:
func WithLanguage(lang string) option {
return func(opt *options) {
opt.Language = lang
}
}
Copy the code
Local language options, separate country and code, for example, Chinese code is CN:
func WithSpokenLanguageCode(code string) option {
return func(opt *options) {
opt.SpokenLang = code
}
}
func WithSpokenLanguageFull(lang string) option {
return func(opt *options) {
opt.SpokenLang = spokenLangCode[lang]
}
}
Copy the code
SpokenLangCode is a comparison of GitHub supported countries and codes, which I crawled from the GitHub Trending page. It goes something like this:
var (
spokenLangCode map[string]string
)
func init(a) {
spokenLangCode = map[string]string{
"abkhazian": "ab"."afar": "aa"."afrikaans": "af"."akan": "ak"."albanian": "sq".// ...}}Copy the code
Finally, I’d like the GitHub URL to be set as well:
func WithURL(url string) option {
return func(opt *options) {
opt.GitHubURL = url
}
}
Copy the code
We added the options field to the trending structure, and then modified the New() method to accept options with mutable parameters. This way we only need to set what we want, and the other options can be set to the default values, such as GitHubURL:
type trending struct {
opts options
}
func loadOptions(opts ... option) options {
o := options{
GitHubURL: "http://github.com",}for _, option := range opts {
option(&o)
}
return o
}
func New(opts ... option) Fetcher {
return&trending{ opts: loadOptions(opts...) ,}}Copy the code
Finally, concatenate the URL according to the options in the FetchRepos() and FetchDevelopers() methods:
fmt.Sprintf("%s/trending/%s? spoken_language_code=%s&since=%s", t.opts.GitHubURL, t.opts.Language, t.opts.SpokenLang, t.opts.DateRange)
fmt.Sprintf("%s/trending/developers? lanugage=%s&since=%s", t.opts.GitHubURL, t.opts.Language, t.opts.DateRange)
Copy the code
After adding the option, if we want to get the Go Trending list for a week, we can do this:
t := ghtrending.New(ghtrending.WithWeekly(), ghtreading.WithLanguage("Go"))
repos, _ := t.FetchRepos()
Copy the code
A simple method
In addition, we provide a way to call the interface directly to get the repository and developer list without creating trending objects (for lazy people only) :
func TrendingRepositories(opts ... option) ([]*Repository, error) {
returnNew(opts...) .FetchRepos() }func TrendingDevelopers(opts ... option) ([]*Developer, error) {
returnNew(opts...) .FetchDevelopers() }Copy the code
Use effect
Create a new directory and initialize Go Modules:
$ mkdir -p demo/ghtrending && cd demo/ghtrending
$ go mod init github/darjun/demo/ghtrending
Copy the code
Download package:
Write code:
package main
import (
"fmt"
"log"
"github.com/darjun/ghtrending"
)
func main(a) {
t := ghtrending.New()
repos, err := t.FetchRepos()
iferr ! =nil {
log.Fatal(err)
}
fmt.Printf("%d repos\n".len(repos))
fmt.Printf("first repo:%#v\n", repos[0])
developers, err := t.FetchDevelopers()
iferr ! =nil {
log.Fatal(err)
}
fmt.Printf("%d developers\n".len(developers))
fmt.Printf("first developer:%#v\n", developers[0])}Copy the code
Operation effect:
The document
Finally, we add documentation:
A small open source library is done.
conclusion
This article describes how to use GoQuery to crawl a web page. The interface design of GHtrending is introduced emphatically. When writing a library, provide an easy-to-use, minimal interface. Users do not need to know the implementation details of the library to use it. Ghtrending’s use of functional options is an example of this, delivering only when needed and not when not.
It is easy to be limited to obtain Trending list by crawling web pages. For example, after a period of time, the structure of GitHub web page changes, and the code has to be adapted. In the absence of an API, that’s the only way to do it for now.
If you find a fun and useful Go library, please Go to GitHub and submit issue😄
reference
- Ghtrending GitHub:github.com/darjun/ghtr…
- Go daily goquery of library: darjun. Making. IO / 2020/10/11 /…
- GitHub: github.com/darjun/go-d…
I
My blog is darjun.github. IO
Welcome to follow my wechat public account [GoUpUp], learn together, progress together ~