background

In the previous article, we introduced the cool TUI application framework, BubbleTea. Finally, a program is implemented to pull GitHub Trending warehouse and display it in the console. Since GitHub doesn’t provide an official Trending API, we implemented one ourselves with GoQuery. Due to space constraints, the previous article did not describe how to do this. In this article I’ve cleaned up the code and opened it up as a separate code base.

To observe the

First, let’s observe the structure of GitHub Trending:

In the upper left corner, you can switch Repositories and Developers. On the right, you can choose Language (Spoken Language, local Language, Chinese, English, etc.), Language (Language, programming Language, Golang, C++, etc.) and time Range (Date Range, supporting three dimensions). Today, This week, This month.

Then here is the information for each warehouse:

① Warehouse author and name

② Warehouse description

③ The main programming language used (set when creating the warehouse), or not

(4) number of star

(5) the fork number

⑥ List of Contributors

⑦ The number of new stars in the selected time range (Today, This week, This month)

The developer page is similar, but with much less information:

① Author information

② The most popular warehouse information

Notice the developers page after switch, URL into github.com/trending/developers. When we choose our native language is Chinese, development language to Go and time range for Today after the URL to https://github.com/trending/go?since=daily&spoken_language_code=zh, This selection is indicated by adding the corresponding key-value pair to the query-String.

To prepare

Create repository ghtrending on GitHub, clone to local, execute Go Mod init to initialize:

$ go mod init github.com/darjun/ghtrending
Copy the code

Then execute go Get to download the GoQuery library:

$ go get github.com/PuerkitoBio/goquery
Copy the code

Define two structures based on repository and developer information:

type Repository struct {
  Author  string
  Name    string
  Link    string
  Desc    string
  Lang    string
  Stars   int
  Forks   int
  Add     int
  BuiltBy []string
}

type Developer struct {
  Name        string
  Username    string
  PopularRepo string
  Desc        string
}
Copy the code

Open up

To get the information using GoQuery, we first need to know the structure of the corresponding page. Press F12 to open chrome Developer Tools and select the Elements TAB to see the structure of the page:

To quickly see the structure of any content on the page, use the button in the upper left corner. We click on a single warehouse entry:

The Elements window on the right shows an article element for each warehouse entry:

You can use the standard library NET/HTTP to get the content of the entire web page:

resp, err := http.Get("https://github.com/trending")
Copy the code

Then create the GoQuery document structure from the RESP object:

doc, err := goquery.NewDocumentFromReader(resp.Body)
Copy the code

With a document structure object, we can call its Find() method and pass in a selector, which I’ll select here. Box, Box – row. .box is the class of the entire list div, and.box-row is the class of the warehouse entry. This choice is more precise. The Find() method returns a * goQuery.Selection object whose Each() method we can call to parse Each entry. Each() takes a function of type func(int, * goQuery.Selection), and the second argument is the structure of Each warehouse entry in GoQuery:

doc.Find(".Box .Box-row").Each(func(i int, s *goquery.Selection){})Copy the code

Now let’s look at how to extract the parts. Moving around in the Elements window gives you an intuitive view of what part of the page each element corresponds to:

We find the structure that corresponds to the warehouse name and the author:

It is wrapped in the A element under the article h1 element, the author name under the SPAN element, the repository name under the A element, and the URL link to the repository is the href attribute of the A element. Let’s get them:

titleSel := s.Find("h1 a")
repo.Author = strings.Trim(titleSel.Find("span").Text(), "/\n ")
repo.Name = strings.TrimSpace(titleSel.Contents().Last().Text())
relativeLink, _ := titleSel.Attr("href")
if len(relativeLink) > 0 {
  repo.Link = "https://github.com" + relativeLink
}
Copy the code

The repository is described in the P element within the article element:

repo.Desc = strings.TrimSpace(s.Find("p").Text())
Copy the code

The programming language, number of stars, number of forks, contributor (BuiltBy), and number of new stars are all in the last div of the article element. The programming language, BuiltBy and the number of new stars are in the SPAN element, while the number of stars and the number of forks are in the A element. If the programming language is not set, a span element is missing:

var langIdx, addIdx, builtByIdx int
spanSel := s.Find("div>span")
if spanSel.Size() == 2 {
  // language not exist
  langIdx = - 1
  addIdx = 1
} else {
  builtByIdx = 1
  addIdx = 2
}

// language
if langIdx >= 0 {
  repo.Lang = strings.TrimSpace(spanSel.Eq(langIdx).Text())
} else {
  repo.Lang = "unknown"
}

// add
addParts := strings.SplitN(strings.TrimSpace(spanSel.Eq(addIdx).Text()), "".2)
repo.Add, _ = strconv.Atoi(addParts[0])

// builtby
spanSel.Eq(builtByIdx).Find("a>img").Each(func(i int, img *goquery.Selection) {
  src, _ := img.Attr("src")
  repo.BuiltBy = append(repo.BuiltBy, src)
})
Copy the code

Then there are stars and forks:

aSel := s.Find("div>a")
starStr := strings.TrimSpace(aSel.Eq(2 -).Text())
star, _ := strconv.Atoi(strings.Replace(starStr, ","."".- 1))
repo.Stars = star
forkStr := strings.TrimSpace(aSel.Eq(- 1).Text())
fork, _ := strconv.Atoi(strings.Replace(forkStr, ","."".- 1))
repo.Forks = fork
Copy the code

Developers do the same thing. I won’t go into it here. One thing to be careful about with GoQuery is that because the hierarchy of web pages is complex, we use selectors to limit as many elements and classes as possible to ensure that we find exactly the structure we want. In addition, the content obtained from the web page has many Spaces, which need to be removed with strings.trimspace ().

Interface design

With the basic work done, let’s look at how to design the interface. I want to provide a type and a method to create an object of that type, then call the FetchRepos() and FetchDevelopers() methods of the object to get the repository and developer list. But I don’t want the user to know the details of this type. So I define an interface:

type Fetcher interface {
  FetchRepos() ([]*Repository, error)
  FetchDevelopers() ([]*Developer, error)
}
Copy the code

We define a type to implement this interface:

type trending struct{}

func New(a) Fetcher {
  return &trending{}
}

func (t trending) FetchRepos(a) ([]*Repository, error){}func (t trending) FetchDevelopers(a) ([]*Developer, error){}Copy the code

The crawl logic we introduced above is in the FetchRepos() and FetchDevelopers() methods.

Then, we can use it elsewhere:

import "github.com/darjun/ghtrending"

t := ghtrending.New()
repos, err := t.FetchRepos()

developers, err := t.FetchDevelopers()
Copy the code

options

As mentioned earlier, GitHub Trending supports selecting local languages, programming languages and time ranges. We want these Settings as options, using the option mode/Functional option commonly used in the Go language. Define the option structure first:

type options struct {
  GitHubURL  string
  SpokenLang string
  Language   string // programming language
  DateRange  string
}

type option func(*options)
Copy the code

Then define three DataRange options:

func WithDaily(a) option {
  return func(opt *options) {
    opt.DateRange = "daily"}}func WithWeekly(a) option {
  return func(opt *options) {
    opt.DateRange = "weekly"}}func WithMonthly(a) option {
  return func(opt *options) {
    opt.DateRange = "monthly"}}Copy the code

There may be other ranges of time in the future, so leave a more general option:

func WithDateRange(dr string) option {
  return func(opt *options) {
    opt.DateRange = dr
  }
}
Copy the code

Programming language options:

func WithLanguage(lang string) option {
  return func(opt *options) {
    opt.Language = lang
  }
}
Copy the code

Local language options, separate country and code, for example, Chinese code is CN:

func WithSpokenLanguageCode(code string) option {
  return func(opt *options) {
    opt.SpokenLang = code
  }
}

func WithSpokenLanguageFull(lang string) option {
  return func(opt *options) {
    opt.SpokenLang = spokenLangCode[lang]
  }
}
Copy the code

SpokenLangCode is a comparison of GitHub supported countries and codes, which I crawled from the GitHub Trending page. It goes something like this:

var (
  spokenLangCode map[string]string
)

func init(a) {
  spokenLangCode = map[string]string{
    "abkhazian":             "ab"."afar":                  "aa"."afrikaans":             "af"."akan":                  "ak"."albanian":              "sq".// ...}}Copy the code

Finally, I’d like the GitHub URL to be set as well:

func WithURL(url string) option {
  return func(opt *options) {
    opt.GitHubURL = url
  }
}
Copy the code

We added the options field to the trending structure, and then modified the New() method to accept options with mutable parameters. This way we only need to set what we want, and the other options can be set to the default values, such as GitHubURL:

type trending struct {
  opts options
}

func loadOptions(opts ... option) options {
  o := options{
    GitHubURL: "http://github.com",}for _, option := range opts {
    option(&o)
  }

  return o
}

func New(opts ... option) Fetcher {
  return&trending{ opts: loadOptions(opts...) ,}}Copy the code

Finally, concatenate the URL according to the options in the FetchRepos() and FetchDevelopers() methods:

fmt.Sprintf("%s/trending/%s? spoken_language_code=%s&since=%s", t.opts.GitHubURL, t.opts.Language, t.opts.SpokenLang, t.opts.DateRange)

fmt.Sprintf("%s/trending/developers? lanugage=%s&since=%s", t.opts.GitHubURL, t.opts.Language, t.opts.DateRange)
Copy the code

After adding the option, if we want to get the Go Trending list for a week, we can do this:

t := ghtrending.New(ghtrending.WithWeekly(), ghtreading.WithLanguage("Go"))
repos, _ := t.FetchRepos()
Copy the code

A simple method

In addition, we provide a way to call the interface directly to get the repository and developer list without creating trending objects (for lazy people only) :

func TrendingRepositories(opts ... option) ([]*Repository, error) {
  returnNew(opts...) .FetchRepos() }func TrendingDevelopers(opts ... option) ([]*Developer, error) {
  returnNew(opts...) .FetchDevelopers() }Copy the code

Use effect

Create a new directory and initialize Go Modules:

$ mkdir -p demo/ghtrending && cd demo/ghtrending
$ go mod init github/darjun/demo/ghtrending
Copy the code

Download package:

Write code:

package main

import (
  "fmt"
  "log"

  "github.com/darjun/ghtrending"
)

func main(a) {
  t := ghtrending.New()

  repos, err := t.FetchRepos()
  iferr ! =nil {
    log.Fatal(err)
  }

  fmt.Printf("%d repos\n".len(repos))
  fmt.Printf("first repo:%#v\n", repos[0])

  developers, err := t.FetchDevelopers()
  iferr ! =nil {
    log.Fatal(err)
  }

  fmt.Printf("%d developers\n".len(developers))
  fmt.Printf("first developer:%#v\n", developers[0])}Copy the code

Operation effect:

The document

Finally, we add documentation:

A small open source library is done.

conclusion

This article describes how to use GoQuery to crawl a web page. The interface design of GHtrending is introduced emphatically. When writing a library, provide an easy-to-use, minimal interface. Users do not need to know the implementation details of the library to use it. Ghtrending’s use of functional options is an example of this, delivering only when needed and not when not.

It is easy to be limited to obtain Trending list by crawling web pages. For example, after a period of time, the structure of GitHub web page changes, and the code has to be adapted. In the absence of an API, that’s the only way to do it for now.

If you find a fun and useful Go library, please Go to GitHub and submit issue😄

reference

  1. Ghtrending GitHub:github.com/darjun/ghtr…
  2. Go daily goquery of library: darjun. Making. IO / 2020/10/11 /…
  3. GitHub: github.com/darjun/go-d…

I

My blog is darjun.github. IO

Welcome to follow my wechat public account [GoUpUp], learn together, progress together ~