Introduction
colly
is a powerful crawler framework written in Go language. It provides a concise API, has strong performance, can automatically process cookie&session, and provides a flexible extension mechanism.
First, we introduce the basic concepts of colly
Then to introduce a few cases colly
usage and characteristics: pull GitHub Treading, pull Baidu novel hot list, download pictures on the Unsplash website .
Quick to use
The code in this article uses Go Modules.
Create a directory and initialize:
$ mkdir colly && cd colly
$ go mod init github.com/darjun/go-daily-lib/colly
Install the colly
library:
$ go get -u github.com/gocolly/colly/v2
use:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("www.baidu.com" ),
)
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
c.Visit(e.Request.AbsoluteURL(link))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
c.OnResponse(func(r *colly.Response) {
fmt.Printf("Response %s: %d bytes\n", r.Request.URL, len(r.Body))
})
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Error %s: %v\n", r.Request.URL, err)
})
c.Visit("http://www.baidu.com/")
}
colly
is relatively simple:
First, call colly.NewCollector()
create a crawler object of *colly.Collector
Because every webpage has many links to other webpages. If there are no restrictions, the operation may never stop. So the above by passing an option colly.AllowedDomains("www.baidu.com")
Restrict crawling domain name www.baidu.com
page.
Then we call the c.OnHTML
method to register the HTML
callback, and execute the callback function a
href
attribute. Here continue to visit the URL pointed to by href
That is to say, parse the crawled webpage, and then continue to visit the links to other pages in the webpage.
Call the c.OnRequest()
method to register the request callback. The callback is executed every time a request is sent. Here it is just a simple print request URL.
Call the c.OnResponse()
method to register the response callback, and execute the callback every time a response is received. Here it is just a simple print URL and response size.
Call the c.OnError()
method to register the error callback, and execute the callback when an error occurs in the execution request. Here, simply print the URL and error information.
Finally, we call c.Visit()
start visiting the first page.
run:
$ go run main.go
Visiting http://www.baidu.com/
Response http://www.baidu.com/: 303317 bytes
Link found: "百度首页" -> /
Link found: "设置" -> javascript:;
Link found: "登录" -> https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5
Link found: "新闻" -> http://news.baidu.com
Link found: "hao123" -> https://www.hao123.com
Link found: "地图" -> http://map.baidu.com
Link found: "直播" -> https://live.baidu.com/
Link found: "视频" -> https://haokan.baidu.com/?sfrom=baidu-top
Link found: "贴吧" -> http://tieba.baidu.com
...
colly
crawls to the page, it will use goquery parse the page. Then find the element-selector corresponding to the registered HTML callback, and goquery.Selection
into a colly.HTMLElement
execute the callback.
colly.HTMLElement
is actually a simple package goquery.Selection
type HTMLElement struct {
Name string
Text string
Request *Request
Response *Response
DOM *goquery.Selection
Index int
}
And provides a simple and easy-to-use method:
Attr(k string)
: Returns the attributes of the current element. In the above example, we usee.Attr("href")
obtain thehref
attribute;ChildAttr(goquerySelector, attrName string)
: ReturnsgoquerySelector
first child element selectedattrName
properties;ChildAttrs(goquerySelector, attrName string)
: ReturnsgoquerySelector
selecting all child elementsattrName
attributes to[]string
return;ChildText(goquerySelector string)
: splice the text content of the child elements selected bygoquerySelector
ChildTexts(goquerySelector string)
: returngoquerySelector
slice sub-elements selected text composition to[]string
return.ForEach(goquerySelector string, callback func(int, *HTMLElement))
: eachgoquerySelector
child element selected callbackcallback
;Unmarshal(v interface{})
: By assigning a tag in goquerySelector format to the structure field, you can unmarshal an HTMLElement object into a structure instance.
These methods will be used frequently. Below we use some examples to introduce the features and usage of colly
GitHub Treading
I wrote an that GitHub Treading colly
more convenient to use 060e7988f63a08:
type Repository struct {
Author string
Name string
Link string
Desc string
Lang string
Stars int
Forks int
Add int
BuiltBy []string
}
func main() {
c := colly.NewCollector(
colly.MaxDepth(1),
)
repos := make([]*Repository, 0, 15)
c.OnHTML(".Box .Box-row", func (e *colly.HTMLElement) {
repo := &Repository{}
// author & repository name
authorRepoName := e.ChildText("h1.h3 > a")
parts := strings.Split(authorRepoName, "/")
repo.Author = strings.TrimSpace(parts[0])
repo.Name = strings.TrimSpace(parts[1])
// link
repo.Link = e.Request.AbsoluteURL(e.ChildAttr("h1.h3 >a", "href"))
// description
repo.Desc = e.ChildText("p.pr-4")
// language
repo.Lang = strings.TrimSpace(e.ChildText("div.mt-2 > span.mr-3 > span[itemprop]"))
// star & fork
starForkStr := e.ChildText("div.mt-2 > a.mr-3")
starForkStr = strings.Replace(strings.TrimSpace(starForkStr), ",", "", -1)
parts = strings.Split(starForkStr, "\n")
repo.Stars , _=strconv.Atoi(strings.TrimSpace(parts[0]))
repo.Forks , _=strconv.Atoi(strings.TrimSpace(parts[len(parts)-1]))
// add
addStr := e.ChildText("div.mt-2 > span.float-sm-right")
parts = strings.Split(addStr, " ")
repo.Add, _ = strconv.Atoi(parts[0])
// built by
e.ForEach("div.mt-2 > span.mr-3 img[src]", func (index int, img *colly.HTMLElement) {
repo.BuiltBy = append(repo.BuiltBy, img.Attr("src"))
})
repos = append(repos, repo)
})
c.Visit("https://github.com/trending")
fmt.Printf("%d repositories\n", len(repos))
fmt.Println("first repository:")
for _, repo := range repos {
fmt.Println("Author:", repo.Author)
fmt.Println("Name:", repo.Name)
break
}
}
We use ChildText
get information such as author, warehouse name, language, number of stars and forks, new additions today, and ChildAttr
get the warehouse link. This link is a relative path, and it is converted to an absolute path e.Request.AbsoluteURL()
run:
$ go run main.go
25 repositories
first repository:
Author: Shopify
Name: dawn
Baidu Fiction Hot List
The web page structure is as follows:
The structure of each part is as follows:
- Each hot list is in a
div.category-wrap_iQLoo
; a
next elementdiv.index_1Ew5p
ranked;- The content is in
div.content_1YWBm
; a.title_dIF3B
in the content is the title;- There are two
div.intro_1l0wp
content, the former is the author and the latter is the type; div.desc_3CTjT
in the content is the description.
From this we define the structure:
type Hot struct {
Rank string `selector:"a > div.index_1Ew5p"`
Name string `selector:"div.content_1YWBm > a.title_dIF3B"`
Author string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(2)"`
Type string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(3)"`
Desc string `selector:"div.desc_3CTjT"`
}
The tag is the CSS selector syntax. This is added so that you can directly call the HTMLElement.Unmarshal()
method to fill the Hot
object.
Then create the Collector
object:
c := colly.NewCollector()
Register callback:
c.OnHTML("div.category-wrap_iQLoo", func(e *colly.HTMLElement) {
hot := &Hot{}
err := e.Unmarshal(hot)
if err != nil {
fmt.Println("error:", err)
return
}
hots = append(hots, hot)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Requesting:", r.URL)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Response:", len(r.Body))
})
OnHTML
performed for each entry Unmarshal
generated Hot
object.
OnRequest/OnResponse
simply outputs debugging information.
Then, call c.Visit()
access the website:
err := c.Visit("https://top.baidu.com/board?tab=novel")
if err != nil {
fmt.Println("Visit error:", err)
return
}
Finally, add some debugging prints:
fmt.Printf("%d hots\n", len(hots))
for _, hot := range hots {
fmt.Println("first hot:")
fmt.Println("Rank:", hot.Rank)
fmt.Println("Name:", hot.Name)
fmt.Println("Author:", hot.Author)
fmt.Println("Type:", hot.Type)
fmt.Println("Desc:", hot.Desc)
break
}
Run output:
Requesting: https://top.baidu.com/board?tab=novel
Response: 118083
30 hots
first hot:
Rank: 1
Name: 逆天邪神
Author: 作者:火星引力
Type: 类型:玄幻
Desc: 掌天毒之珠,承邪神之血,修逆天之力,一代邪神,君临天下! 查看更多>
Unsplash
I write an article on the official account, and the background pictures are basically obtained from the website unsplash. unsplash provides a large number of rich, free pictures. One problem with this website is that the access speed is relatively slow. Now that I learn to crawl, I just use the program to download pictures automatically.
The unsplash homepage is shown in the figure below:
The web page structure is as follows:
But the pictures displayed on the homepage are smaller in size, we click on the link of a certain picture:
The web page structure is as follows:
Due to the three-layer web page structure ( img
needs to be visited at the end), a colly.Collector
object is used, and the OnHTML
callback setting needs to be extra careful, which brings a relatively large mental burden to the coding. colly
supports multiple Collector
, we use this method to encode:
func main() {
c1 := colly.NewCollector()
c2 := c1.Clone()
c3 := c1.Clone()
c1.OnHTML("figure[itemProp] a[itemProp]", func(e *colly.HTMLElement) {
href := e.Attr("href")
if href == "" {
return
}
c2.Visit(e.Request.AbsoluteURL(href))
})
c2.OnHTML("div._1g5Lu > img[src]", func(e *colly.HTMLElement) {
src := e.Attr("src")
if src == "" {
return
}
c3.Visit(src)
})
c1.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c1.OnError(func(r *colly.Response, err error) {
fmt.Println("Visiting", r.Request.URL, "failed:", err)
})
}
We use three Collector
objects, the first Collector
used to collect the corresponding picture links on the homepage, and then use the second Collector
to access these picture links, and finally let the third Collector
to download the pictures. Above we also registered the request and error callback Collector
Collector
downloading the third 060e7988f63fee to the specific image content, save it locally:
func main() {
// ... 省略
var count uint32
c3.OnResponse(func(r *colly.Response) {
fileName := fmt.Sprintf("images/img%d.jpg", atomic.AddUint32(&count, 1))
err := r.Save(fileName)
if err != nil {
fmt.Printf("saving %s failed:%v\n", fileName, err)
} else {
fmt.Printf("saving %s success\n", fileName)
}
})
c3.OnRequest(func(r *colly.Request) {
fmt.Println("visiting", r.URL)
})
}
The above uses atomic.AddUint32()
to generate a serial number for the picture.
Run the program and crawl the results:
asynchronous
By default, colly
crawls web pages synchronously, that is, crawls one after another, and this is the case with the unplash program above. This takes a long time. colly
provides asynchronous crawling features. We only need to pass in the option colly.Async(true)
Collector
object to enable asynchronous:
c1 := colly.NewCollector(
colly.Async(true),
)
However, due to asynchronous crawling, the program finally needs to wait for the Collector
processing to complete, otherwise main
exits early and the program will exit:
c1.Wait()
c2.Wait()
c3.Wait()
Run again, much faster 😀.
second edition
Scrolling down the unsplash webpage, we find that the pictures behind are loaded asynchronously. Scroll the page and view the request through the network tab of the chrome browser:
Request path /photos
, set per_page
and page
parameters, and return a JSON array. So there is another way:
To define the structure of each item, we only keep the necessary fields:
type Item struct {
Id string
Width int
Height int
Links Links
}
type Links struct {
Download string
}
Then parse the JSON in the OnResponse
callback, and call the Visit()
method Collector
responsible for downloading the image on Download
c.OnResponse(func(r *colly.Response) {
var items []*Item
json.Unmarshal(r.Body, &items)
for _, item := range items {
d.Visit(item.Links.Download)
}
})
Initial access, we set to pull 3 pages, 12 per page (the same as the number of page requests):
for page := 1; page <= 3; page++ {
c.Visit(fmt.Sprintf("https://unsplash.com/napi/photos?page=%d&per_page=12", page))
}
Run to view the downloaded picture:
Speed limit
Sometimes there are too many concurrent requests and the website will restrict access. At this time, you need to use LimitRule
. To put it LimitRule
, 060e7988f64284 limits the access speed and concurrency:
type LimitRule struct {
DomainRegexp string
DomainGlob string
Delay time.Duration
RandomDelay time.Duration
Parallelism int
}
The commonly used ones are Delay/RandomDelay/Parallism
, which represent the delay between request and request, random delay, and concurrency. In addition, must specify which domain names to impose restrictions on, set by DomainRegexp
or DomainGlob
, if these two fields are not set, the Limit()
method will return an error. Used in the above example:
err := c.Limit(&colly.LimitRule{
DomainRegexp: `unsplash\.com`,
RandomDelay: 500 * time.Millisecond,
Parallelism: 12,
})
if err != nil {
log.Fatal(err)
}
We set for unsplash.com
, the random maximum delay between request and request is 500ms, and there are up to 12 concurrent requests at the same time.
Set timeout
Sometimes the network speed is slow. The colly
used in http.Client
has a default timeout mechanism. We can colly.WithTransport()
option:
c.WithTransport(&http.Transport{
Proxy: http.ProxyFromEnvironment,
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
}).DialContext,
MaxIdleConns: 100,
IdleConnTimeout: 90 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ExpectContinueTimeout: 1 * time.Second,
})
Expand
colly
provides some extended features in the sub-package extension
. The most commonly used one is the random User-Agent. Usually the website will use the User-Agent to identify whether the request is sent by the browser, and the crawler will generally set this Header to pretend to be the browser. It is also relatively simple to use:
import "github.com/gocolly/colly/v2/extensions"
func main() {
c := colly.NewCollector()
extensions.RandomUserAgent(c)
}
The implementation of random User-Agent is also very simple, which is to randomly set one of the predefined User-Agent arrays to the Header:
func RandomUserAgent(c *colly.Collector) {
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", uaGens[rand.Intn(len(uaGens))]())
})
}
It is not difficult to implement our own extension. For example, we need to set a specific Header every time we request. The extension can be written like this:
func MyHeader(c *colly.Collector) {
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("My-Header", "dj")
})
}
Use the Collector
object to call the MyHeader()
function:
MyHeader(c)
to sum up
colly
is the most popular crawler framework in the Go language, supporting rich features. This article introduces some common features, supplemented by examples. Due to space limitations, some advanced features have not been covered, such as queues, storage, etc. Those interested in crawlers can learn more about it.
If you find a fun and useful Go language library, welcome to submit an issue on the Go Daily Library GitHub😄
reference
- Go daily one library GitHub: https://github.com/darjun/go-daily-lib
- Go query for Go Daily One Library: https://darjun.github.io/2020/10/11/godailylib/goquery/
- Implement a GitHub Trending API with Go: https://darjun.github.io/2021/06/16/github-trending-api/
- colly GitHub:https://github.com/gocolly/colly
I
My blog: https://darjun.github.io
Welcome to follow my WeChat public account [GoUpUp], learn together and make progress together~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。