Go daily one library of colly

Introduction

colly is a powerful crawler framework written in Go language. It provides a concise API, has strong performance, can automatically process cookie&session, and provides a flexible extension mechanism.

First, we introduce the basic concepts of colly Then to introduce a few cases colly usage and characteristics: pull GitHub Treading, pull Baidu novel hot list, download pictures on the Unsplash website .

Quick to use

The code in this article uses Go Modules.

Create a directory and initialize:

$ mkdir colly && cd colly
$ go mod init github.com/darjun/go-daily-lib/colly

Install the colly library:

$ go get -u github.com/gocolly/colly/v2

use:

package main

import (
  "fmt"

  "github.com/gocolly/colly/v2"
)

func main() {
  c := colly.NewCollector(
    colly.AllowedDomains("www.baidu.com" ),
  )

  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    c.Visit(e.Request.AbsoluteURL(link))
  })

  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
  })

  c.OnResponse(func(r *colly.Response) {
    fmt.Printf("Response %s: %d bytes\n", r.Request.URL, len(r.Body))
  })

  c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error %s: %v\n", r.Request.URL, err)
  })

  c.Visit("http://www.baidu.com/")
}

colly is relatively simple:

First, call colly.NewCollector() create a crawler object of *colly.Collector Because every webpage has many links to other webpages. If there are no restrictions, the operation may never stop. So the above by passing an option colly.AllowedDomains("www.baidu.com") Restrict crawling domain name www.baidu.com page.

Then we call the c.OnHTML method to register the HTML callback, and execute the callback function a href attribute. Here continue to visit the URL pointed to by href That is to say, parse the crawled webpage, and then continue to visit the links to other pages in the webpage.

Call the c.OnRequest() method to register the request callback. The callback is executed every time a request is sent. Here it is just a simple print request URL.

Call the c.OnResponse() method to register the response callback, and execute the callback every time a response is received. Here it is just a simple print URL and response size.

Call the c.OnError() method to register the error callback, and execute the callback when an error occurs in the execution request. Here, simply print the URL and error information.

Finally, we call c.Visit() start visiting the first page.

run:

$ go run main.go
Visiting http://www.baidu.com/
Response http://www.baidu.com/: 303317 bytes
Link found: "百度首页" -> /
Link found: "设置" -> javascript:;
Link found: "登录" -> https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5
Link found: "新闻" -> http://news.baidu.com
Link found: "hao123" -> https://www.hao123.com
Link found: "地图" -> http://map.baidu.com
Link found: "直播" -> https://live.baidu.com/
Link found: "视频" -> https://haokan.baidu.com/?sfrom=baidu-top
Link found: "贴吧" -> http://tieba.baidu.com
...

colly crawls to the page, it will use goquery parse the page. Then find the element-selector corresponding to the registered HTML callback, and goquery.Selection into a colly.HTMLElement execute the callback.

colly.HTMLElement is actually a simple package goquery.Selection

type HTMLElement struct {
  Name string
  Text string
  Request *Request
  Response *Response
  DOM *goquery.Selection
  Index int
}

And provides a simple and easy-to-use method:

Attr(k string) : Returns the attributes of the current element. In the above example, we use e.Attr("href") obtain the href attribute;
ChildAttr(goquerySelector, attrName string) : Returns goquerySelector first child element selected attrName properties;
ChildAttrs(goquerySelector, attrName string) : Returns goquerySelector selecting all child elements attrName attributes to []string return;
ChildText(goquerySelector string) : splice the text content of the child elements selected by goquerySelector
ChildTexts(goquerySelector string) : return goquerySelector slice sub-elements selected text composition to []string return.
ForEach(goquerySelector string, callback func(int, *HTMLElement)) : each goquerySelector child element selected callback callback ;
Unmarshal(v interface{}) : By assigning a tag in goquerySelector format to the structure field, you can unmarshal an HTMLElement object into a structure instance.

These methods will be used frequently. Below we use some examples to introduce the features and usage of colly

GitHub Treading

I wrote an that GitHub Treading colly more convenient to use 060e7988f63a08:

type Repository struct {
  Author  string
  Name    string
  Link    string
  Desc    string
  Lang    string
  Stars   int
  Forks   int
  Add     int
  BuiltBy []string
}

func main() {
  c := colly.NewCollector(
    colly.MaxDepth(1),
  )


  repos := make([]*Repository, 0, 15)
  c.OnHTML(".Box .Box-row", func (e *colly.HTMLElement) {
    repo := &Repository{}

    // author & repository name
    authorRepoName := e.ChildText("h1.h3 > a")
    parts := strings.Split(authorRepoName, "/")
    repo.Author = strings.TrimSpace(parts[0])
    repo.Name = strings.TrimSpace(parts[1])

    // link
    repo.Link = e.Request.AbsoluteURL(e.ChildAttr("h1.h3 >a", "href"))

    // description
    repo.Desc = e.ChildText("p.pr-4")

    // language
    repo.Lang = strings.TrimSpace(e.ChildText("div.mt-2 > span.mr-3 > span[itemprop]"))

    // star & fork
    starForkStr := e.ChildText("div.mt-2 > a.mr-3")
    starForkStr = strings.Replace(strings.TrimSpace(starForkStr), ",", "", -1)
    parts = strings.Split(starForkStr, "\n")
    repo.Stars , _=strconv.Atoi(strings.TrimSpace(parts[0]))
    repo.Forks , _=strconv.Atoi(strings.TrimSpace(parts[len(parts)-1]))

    // add
    addStr := e.ChildText("div.mt-2 > span.float-sm-right")
    parts = strings.Split(addStr, " ")
    repo.Add, _ = strconv.Atoi(parts[0])

    // built by
    e.ForEach("div.mt-2 > span.mr-3  img[src]", func (index int, img *colly.HTMLElement) {
      repo.BuiltBy = append(repo.BuiltBy, img.Attr("src"))
    })

    repos = append(repos, repo)
  })

  c.Visit("https://github.com/trending")
  
  fmt.Printf("%d repositories\n", len(repos))
  fmt.Println("first repository:")
  for _, repo := range repos {
      fmt.Println("Author:", repo.Author)
      fmt.Println("Name:", repo.Name)
      break
  }
}

We use ChildText get information such as author, warehouse name, language, number of stars and forks, new additions today, and ChildAttr get the warehouse link. This link is a relative path, and it is converted to an absolute path e.Request.AbsoluteURL()

run:

$ go run main.go
25 repositories
first repository:
Author: Shopify
Name: dawn

Baidu Fiction Hot List

The web page structure is as follows:

The structure of each part is as follows:

From this we define the structure:

type Hot struct {
  Rank   string `selector:"a > div.index_1Ew5p"`
  Name   string `selector:"div.content_1YWBm > a.title_dIF3B"`
  Author string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(2)"`
  Type   string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(3)"`
  Desc   string `selector:"div.desc_3CTjT"`
}

The tag is the CSS selector syntax. This is added so that you can directly call the HTMLElement.Unmarshal() method to fill the Hot object.

Then create the Collector object:

c := colly.NewCollector()

c.OnHTML("div.category-wrap_iQLoo", func(e *colly.HTMLElement) {
  hot := &Hot{}

  err := e.Unmarshal(hot)
  if err != nil {
    fmt.Println("error:", err)
    return
  }

  hots = append(hots, hot)
})

c.OnRequest(func(r *colly.Request) {
  fmt.Println("Requesting:", r.URL)
})

c.OnResponse(func(r *colly.Response) {
  fmt.Println("Response:", len(r.Body))
})

OnHTML performed for each entry Unmarshal generated Hot object.

OnRequest/OnResponse simply outputs debugging information.

Then, call c.Visit() access the website:

err := c.Visit("https://top.baidu.com/board?tab=novel")
if err != nil {
  fmt.Println("Visit error:", err)
  return
}

Finally, add some debugging prints:

fmt.Printf("%d hots\n", len(hots))
for _, hot := range hots {
  fmt.Println("first hot:")
  fmt.Println("Rank:", hot.Rank)
  fmt.Println("Name:", hot.Name)
  fmt.Println("Author:", hot.Author)
  fmt.Println("Type:", hot.Type)
  fmt.Println("Desc:", hot.Desc)
  break
}

Run output:

Requesting: https://top.baidu.com/board?tab=novel
Response: 118083
30 hots
first hot:
Rank: 1
Name: 逆天邪神
Author: 作者：火星引力
Type: 类型：玄幻
Desc: 掌天毒之珠，承邪神之血，修逆天之力，一代邪神，君临天下！  查看更多>

Unsplash

I write an article on the official account, and the background pictures are basically obtained from the website unsplash. unsplash provides a large number of rich, free pictures. One problem with this website is that the access speed is relatively slow. Now that I learn to crawl, I just use the program to download pictures automatically.

The unsplash homepage is shown in the figure below:

The web page structure is as follows:

But the pictures displayed on the homepage are smaller in size, we click on the link of a certain picture:

The web page structure is as follows:

Due to the three-layer web page structure ( img needs to be visited at the end), a colly.Collector object is used, and the OnHTML callback setting needs to be extra careful, which brings a relatively large mental burden to the coding. colly supports multiple Collector , we use this method to encode:

func main() {
  c1 := colly.NewCollector()
  c2 := c1.Clone()
  c3 := c1.Clone()

  c1.OnHTML("figure[itemProp] a[itemProp]", func(e *colly.HTMLElement) {
    href := e.Attr("href")
    if href == "" {
      return
    }

    c2.Visit(e.Request.AbsoluteURL(href))
  })

  c2.OnHTML("div._1g5Lu > img[src]", func(e *colly.HTMLElement) {
    src := e.Attr("src")
    if src == "" {
      return
    }

    c3.Visit(src)
  })

  c1.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
  })

  c1.OnError(func(r *colly.Response, err error) {
    fmt.Println("Visiting", r.Request.URL, "failed:", err)
  })
}

We use three Collector objects, the first Collector used to collect the corresponding picture links on the homepage, and then use the second Collector to access these picture links, and finally let the third Collector to download the pictures. Above we also registered the request and error callback Collector

Collector downloading the third 060e7988f63fee to the specific image content, save it locally:

func main() {
  // ... 省略
  var count uint32
  c3.OnResponse(func(r *colly.Response) {
    fileName := fmt.Sprintf("images/img%d.jpg", atomic.AddUint32(&count, 1))
    err := r.Save(fileName)
    if err != nil {
      fmt.Printf("saving %s failed:%v\n", fileName, err)
    } else {
      fmt.Printf("saving %s success\n", fileName)
    }
  })

  c3.OnRequest(func(r *colly.Request) {
    fmt.Println("visiting", r.URL)
  })
}

The above uses atomic.AddUint32() to generate a serial number for the picture.

Run the program and crawl the results:

asynchronous

By default, colly crawls web pages synchronously, that is, crawls one after another, and this is the case with the unplash program above. This takes a long time. colly provides asynchronous crawling features. We only need to pass in the option colly.Async(true) Collector object to enable asynchronous:

c1 := colly.NewCollector(
  colly.Async(true),
)

However, due to asynchronous crawling, the program finally needs to wait for the Collector processing to complete, otherwise main exits early and the program will exit:

c1.Wait()
c2.Wait()
c3.Wait()

Run again, much faster 😀.

second edition

Scrolling down the unsplash webpage, we find that the pictures behind are loaded asynchronously. Scroll the page and view the request through the network tab of the chrome browser:

Request path /photos , set per_page and page parameters, and return a JSON array. So there is another way:

To define the structure of each item, we only keep the necessary fields:

type Item struct {
  Id     string
  Width  int
  Height int
  Links  Links
}

type Links struct {
  Download string
}

Then parse the JSON in the OnResponse callback, and call the Visit() method Collector responsible for downloading the image on Download

c.OnResponse(func(r *colly.Response) {
  var items []*Item
  json.Unmarshal(r.Body, &items)
  for _, item := range items {
    d.Visit(item.Links.Download)
  }
})

Initial access, we set to pull 3 pages, 12 per page (the same as the number of page requests):

for page := 1; page <= 3; page++ {
  c.Visit(fmt.Sprintf("https://unsplash.com/napi/photos?page=%d&per_page=12", page))
}

Run to view the downloaded picture:

Speed limit

Sometimes there are too many concurrent requests and the website will restrict access. At this time, you need to use LimitRule . To put it LimitRule , 060e7988f64284 limits the access speed and concurrency:

type LimitRule struct {
  DomainRegexp string
  DomainGlob string
  Delay time.Duration
  RandomDelay time.Duration
  Parallelism    int
}

The commonly used ones are Delay/RandomDelay/Parallism , which represent the delay between request and request, random delay, and concurrency. In addition, must specify which domain names to impose restrictions on, set by DomainRegexp or DomainGlob , if these two fields are not set, the Limit() method will return an error. Used in the above example:

err := c.Limit(&colly.LimitRule{
  DomainRegexp: `unsplash\.com`,
  RandomDelay:  500 * time.Millisecond,
  Parallelism:  12,
})
if err != nil {
  log.Fatal(err)
}

We set for unsplash.com , the random maximum delay between request and request is 500ms, and there are up to 12 concurrent requests at the same time.

Set timeout

Sometimes the network speed is slow. The colly used in http.Client has a default timeout mechanism. We can colly.WithTransport() option:

c.WithTransport(&http.Transport{
  Proxy: http.ProxyFromEnvironment,
  DialContext: (&net.Dialer{
    Timeout:   30 * time.Second,
    KeepAlive: 30 * time.Second,
  }).DialContext,
  MaxIdleConns:          100,
  IdleConnTimeout:       90 * time.Second,
  TLSHandshakeTimeout:   10 * time.Second,
  ExpectContinueTimeout: 1 * time.Second,
})

Expand

colly provides some extended features in the sub-package extension . The most commonly used one is the random User-Agent. Usually the website will use the User-Agent to identify whether the request is sent by the browser, and the crawler will generally set this Header to pretend to be the browser. It is also relatively simple to use:

import "github.com/gocolly/colly/v2/extensions"

func main() {
  c := colly.NewCollector()
  extensions.RandomUserAgent(c)
}

The implementation of random User-Agent is also very simple, which is to randomly set one of the predefined User-Agent arrays to the Header:

func RandomUserAgent(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", uaGens[rand.Intn(len(uaGens))]())
  })
}

It is not difficult to implement our own extension. For example, we need to set a specific Header every time we request. The extension can be written like this:

func MyHeader(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("My-Header", "dj")
  })
}

Use the Collector object to call the MyHeader() function:

MyHeader(c)

to sum up

colly is the most popular crawler framework in the Go language, supporting rich features. This article introduces some common features, supplemented by examples. Due to space limitations, some advanced features have not been covered, such as queues, storage, etc. Those interested in crawlers can learn more about it.

If you find a fun and useful Go language library, welcome to submit an issue on the Go Daily Library GitHub😄

reference

Go daily one library GitHub: https://github.com/darjun/go-daily-lib
Go query for Go Daily One Library: https://darjun.github.io/2020/10/11/godailylib/goquery/
Implement a GitHub Trending API with Go: https://darjun.github.io/2021/06/16/github-trending-api/
colly GitHub：https://github.com/gocolly/colly

I

My blog: https://darjun.github.io

Welcome to follow my WeChat public account [GoUpUp], learn together and make progress together~

Go daily one library of colly

Introduction

Quick to use

GitHub Treading

Baidu Fiction Hot List

Unsplash

asynchronous

second edition

Speed limit

Set timeout

Expand

to sum up

reference

I

darjun

引用和评论

一起用Go做一个小游戏（下）

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

腾讯 tRPC-Go 教学——（1）搭建服务

一文弄懂用Go实现MCP服务

一起用Go做一个小游戏（上）