background
In the previous article Go, a daily library of bubbletea we introduced the cool TUI program framework- bubbletea
. Finally, a program that pulls the GitHub Trending warehouse and displays it on the console is implemented. Since GitHub does not provide an official Trending API, we implemented one goquery
Due to the length of the previous article, it did not introduce how to implement it. In this article, I sorted out the code and opened it up as a separate code library.
Observe first
First, let's observe the structure of GitHub Trending:
In the upper left corner, you can switch between Repositories and Developers. On the right, you can select language (Spoken Language, local language, Chinese, English, etc.), language (Language, programming language, Golang, C++, etc.), and time range (Date Range, which supports 3 dimensions, Today, This week, This month).
Then the following is the information of each warehouse:
① Warehouse author and name
② Warehouse description
③ The main programming language used (set when the warehouse is created), or it may not
④ Number of stars
⑤ Fork number
⑥ List of contributors
⑦ How many stars are added in the selected time range (Today, This week, This month)
The developer page is similar, but with a lot less information:
① Author information
② The most popular warehouse information
After noticing the switched developer page, the URL becomes github.com/trending/developers
. In addition, when we select the local language as Chinese, the development language as Go, and the time range as Today, the URL becomes https://github.com/trending/go?since=daily&spoken_language_code=zh
. This choice is expressed by adding the corresponding key-value pair in the query-string.
ready
Create a warehouse ghtrending
on GitHub, clone it locally, and perform go mod init
initialization:
$ go mod init github.com/darjun/ghtrending
Then execute go get
download the goquery
library:
$ go get github.com/PuerkitoBio/goquery
Define two structures based on the information of the warehouse and the developer:
type Repository struct {
Author string
Name string
Link string
Desc string
Lang string
Stars int
Forks int
Add int
BuiltBy []string
}
type Developer struct {
Name string
Username string
PopularRepo string
Desc string
}
Start climbing
To use goquery
obtain the corresponding information, we must first know the corresponding web page structure. Press F12 to open the chrome developer tools, select the Elements
tab, you can see the web page structure:
Use the button in the upper left corner to quickly view the structure of any content on the web page. We click on a single warehouse entry:
Elements
window on the right shows that each warehouse entry corresponds to a article
element:
You can use the standard library net/http
get the content of the entire web page:
resp, err := http.Get("https://github.com/trending")
Then create the goquery
document structure from the resp
doc, err := goquery.NewDocumentFromReader(resp.Body)
With the document structure object, we can call its Find()
method and pass in the selector, here I choose .Box .Box-row
. .Box
is the class of the entire list div
.Box-row
is the class of the warehouse entry. This choice is more precise. Find()
method returns a *goquery.Selection
object, and we can call its Each()
method to parse each entry. Each()
receives a func(int, *goquery.Selection)
type 060d27282dfceb, and the second parameter is the structure of each warehouse entry in goquery:
doc.Find(".Box .Box-row").Each(func(i int, s *goquery.Selection) {
})
Next we look at how to extract each part. Moving in the Elements
window, you can intuitively see which part of the page each element corresponds to:
We find the structure corresponding to the warehouse name and the author:
It is wrapped in the article
element under the h1
element under the a
element, the author name is in the span
element, the warehouse name is directly under a
, and the URL link of the warehouse is the href
attribute a
Let's get them:
titleSel := s.Find("h1 a")
repo.Author = strings.Trim(titleSel.Find("span").Text(), "/\n ")
repo.Name = strings.TrimSpace(titleSel.Contents().Last().Text())
relativeLink, _ := titleSel.Attr("href")
if len(relativeLink) > 0 {
repo.Link = "https://github.com" + relativeLink
}
The warehouse description is in the article
element within the p
element:
repo.Desc = strings.TrimSpace(s.Find("p").Text())
The programming language, number of stars, number of forks, contributors ( BuiltBy
) and the number of new stars are all in the last div
of the article
The programming language, BuiltBy
and the number of new stars are in the span
element, and the number of stars and fork are in the a
element. If the programming language is not set, there is one span
element 060d27282dfdd8:
var langIdx, addIdx, builtByIdx int
spanSel := s.Find("div>span")
if spanSel.Size() == 2 {
// language not exist
langIdx = -1
addIdx = 1
} else {
builtByIdx = 1
addIdx = 2
}
// language
if langIdx >= 0 {
repo.Lang = strings.TrimSpace(spanSel.Eq(langIdx).Text())
} else {
repo.Lang = "unknown"
}
// add
addParts := strings.SplitN(strings.TrimSpace(spanSel.Eq(addIdx).Text()), " ", 2)
repo.Add, _ = strconv.Atoi(addParts[0])
// builtby
spanSel.Eq(builtByIdx).Find("a>img").Each(func(i int, img *goquery.Selection) {
src, _ := img.Attr("src")
repo.BuiltBy = append(repo.BuiltBy, src)
})
Then the number of stars and the number of forks:
aSel := s.Find("div>a")
starStr := strings.TrimSpace(aSel.Eq(-2).Text())
star, _ := strconv.Atoi(strings.Replace(starStr, ",", "", -1))
repo.Stars = star
forkStr := strings.TrimSpace(aSel.Eq(-1).Text())
fork, _ := strconv.Atoi(strings.Replace(forkStr, ",", "", -1))
repo.Forks = fork
Developers do a similar approach. I won't go into details here. There is one thing to pay attention to when using goquery
, because the hierarchical structure of the web page is more complicated, we use selectors, we try to limit some elements and classes as much as possible to ensure that we find the structure that we want. In addition, the content obtained on the webpage has many spaces, which need to be removed strings.TrimSpace()
Interface design
After the basic work is completed, let's take a look at how to design the interface. I want to provide a type and a method to create an object of this type, and then call the FetchRepos()
and FetchDevelopers()
methods of the object to get the list of warehouses and developers. But I don't want users to know the details of this type. So I defined an interface:
type Fetcher interface {
FetchRepos() ([]*Repository, error)
FetchDevelopers() ([]*Developer, error)
}
We define a type to implement this interface:
type trending struct{}
func New() Fetcher {
return &trending{}
}
func (t trending) FetchRepos() ([]*Repository, error) {
}
func (t trending) FetchDevelopers() ([]*Developer, error) {
}
The crawling logic we introduced above is placed in the FetchRepos()
and FetchDevelopers()
methods.
Then, we can use it elsewhere:
import "github.com/darjun/ghtrending"
t := ghtrending.New()
repos, err := t.FetchRepos()
developers, err := t.FetchDevelopers()
Options
As mentioned earlier, GitHub Trending supports selected local languages, programming languages, and time ranges. We want to use these settings as options and use the common option mode/functional options of the Go language. First define the option structure:
type options struct {
GitHubURL string
SpokenLang string
Language string // programming language
DateRange string
}
type option func(*options)
Then define three DataRange
options:
func WithDaily() option {
return func(opt *options) {
opt.DateRange = "daily"
}
}
func WithWeekly() option {
return func(opt *options) {
opt.DateRange = "weekly"
}
}
func WithMonthly() option {
return func(opt *options) {
opt.DateRange = "monthly"
}
}
There may be other ranges of time in the future, leaving a more general option:
func WithDateRange(dr string) option {
return func(opt *options) {
opt.DateRange = dr
}
}
Programming language options:
func WithLanguage(lang string) option {
return func(opt *options) {
opt.Language = lang
}
}
For local language options, country and code are separated. For example, the code for Chinese is cn:
func WithSpokenLanguageCode(code string) option {
return func(opt *options) {
opt.SpokenLang = code
}
}
func WithSpokenLanguageFull(lang string) option {
return func(opt *options) {
opt.SpokenLang = spokenLangCode[lang]
}
}
spokenLangCode
is a comparison of countries and codes supported by GitHub. I crawled it from the GitHub Trending page. It looks like this:
var (
spokenLangCode map[string]string
)
func init() {
spokenLangCode = map[string]string{
"abkhazian": "ab",
"afar": "aa",
"afrikaans": "af",
"akan": "ak",
"albanian": "sq",
// ...
}
}
Finally, I hope that the GitHub URL can also be set:
func WithURL(url string) option {
return func(opt *options) {
opt.GitHubURL = url
}
}
We add the options
trending
structure, and then modify the New()
method to allow it to accept variable parameter options. In this way, we only need to set what we want to set, and other options can use default values, such as GitHubURL
:
type trending struct {
opts options
}
func loadOptions(opts ...option) options {
o := options{
GitHubURL: "http://github.com",
}
for _, option := range opts {
option(&o)
}
return o
}
func New(opts ...option) Fetcher {
return &trending{
opts: loadOptions(opts...),
}
}
Finally, in the FetchRepos()
method and the FetchDevelopers()
method, the URL is spliced according to the options:
fmt.Sprintf("%s/trending/%s?spoken_language_code=%s&since=%s", t.opts.GitHubURL, t.opts.Language, t.opts.SpokenLang, t.opts.DateRange)
fmt.Sprintf("%s/trending/developers?lanugage=%s&since=%s", t.opts.GitHubURL, t.opts.Language, t.opts.DateRange)
After adding the option, if we want to get the Go language Trending list within a week, we can do this:
t := ghtrending.New(ghtrending.WithWeekly(), ghtreading.WithLanguage("Go"))
repos, _ := t.FetchRepos()
Simple way
In addition, we also provide a trending
for directly calling the interface to obtain the warehouse and developer list without creating the 060d27282e00ab object (for lazy people):
func TrendingRepositories(opts ...option) ([]*Repository, error) {
return New(opts...).FetchRepos()
}
func TrendingDevelopers(opts ...option) ([]*Developer, error) {
return New(opts...).FetchDevelopers()
}
Effect
Create a new directory and initialize Go Modules:
$ mkdir -p demo/ghtrending && cd demo/ghtrending
$ go mod init github/darjun/demo/ghtrending
Download package:
Write code:
package main
import (
"fmt"
"log"
"github.com/darjun/ghtrending"
)
func main() {
t := ghtrending.New()
repos, err := t.FetchRepos()
if err != nil {
log.Fatal(err)
}
fmt.Printf("%d repos\n", len(repos))
fmt.Printf("first repo:%#v\n", repos[0])
developers, err := t.FetchDevelopers()
if err != nil {
log.Fatal(err)
}
fmt.Printf("%d developers\n", len(developers))
fmt.Printf("first developer:%#v\n", developers[0])
}
running result:
Documentation
Finally, we add some documentation:
A small open source library is complete.
to sum up
This article describes how to use goquery
crawl web pages. It focuses on the interface design ghtrending
When writing a library, it should provide an easy-to-use, minimal interface. Users do not need to understand the implementation details of the library to use it. ghtrending
uses functional options as an example. It is passed only when needed, and not provided if there is no need.
It is easier to get the Trending list by crawling the web page. For example, the structure of the GitHub web page changes over time, and the code has to be adapted. In the case that the official API is not provided, it can only be done at present.
If you find a fun and useful Go language library, welcome to submit an issue on the Go Daily Library GitHub😄
reference
- ghtrending GitHub:github.com/darjun/ghtrending
- Go query for Go Daily One Library: https://darjun.github.io/2020/10/11/godailylib/goquery
- Go daily one library GitHub: https://github.com/darjun/go-daily-lib
I
My blog: https://darjun.github.io
Welcome to follow my WeChat public account [GoUpUp], learn together and make progress together~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。