Share a wave of GO crawlers

[TOC]

Share a wave of GO crawlers

Let’s review the last time we talked about using GOLANG to send mail

Golang+chromedp+goquery Simple crawling of dynamic data｜Go theme month

Shared the mail, what is the email
What are the mail protocols
How to use GOLANG to send email
How to send emails with plain text, HTML content, attachments, etc.
Sending mail, how to copy, how to blind copy
How to improve the performance of sending mail

Want to see how uses GOLANG to send mail , welcome to check the article How to use GOLANG to send mail

Remember that we simply shared an article about Golang crawling webpage dynamic data Golang+chromedp+goquery Simple crawling dynamic data｜Go theme month

If there are friends who are interested, we can study in detail the use of chromedp framework

Today, let’s share the static data of web crawling using GO

What are static web pages and dynamic web pages?

What is static web page data?

It means that there is no program code in the webpage, only HTML , that is, only hypertext markup language, the suffix name is generally .html , htm , xml etc.
Another feature of static web pages is that users can click to open them directly, and the content of the page opened by anyone at any time is unchanged. The html is fixed, and the effect is fixed.

So by the way, what is a dynamic web page?

Dynamic webpage is a kind of webpage programming technology
In addition to the HTML tags, the dynamic web page files also include program codes for some specific functions
These codes are mainly used for the browser and the server to interact. The server can dynamically generate web content according to the different requests of the client, which is very flexible.

In other words, although the page code of the dynamic web page has not changed, the displayed content can change with the passage of time, different environments, and changes in the database.

GO to crawl static data of web pages

We crawl the data of static web pages, for example, we crawl the static data on this website, crawl the account and password information on the web page

http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696

Let's crawl the steps of this website:

Specify a website that clearly needs to be crawled
Get data through HTTP GET
Convert byte array to string
Use regular expressions to match what we expect ( is very important here, in fact, crawling static web pages, processing data and filtering data takes more time )
Filter data, remove duplication and other operations (this step varies from person to person and target website according to personal needs)

Let's write a DEMO, climb the account and password information on the above website, the information we want on the web page is this, we are only for learning, don't use it to do some bad things

package main

import (
   "io/ioutil"
   "log"
   "net/http"
   "regexp"
)

const (
   // 正则表达式，匹配出 XL 的账号密码
   reAccount = `(账号|迅雷账号)(；|：)[0-9:]+(| )密码：[0-9a-zA-Z]+`
)

// 获取网站 账号密码
func GetAccountAndPwd(url string) {
   // 获取网站数据
   resp, err := http.Get(url)
   if err !=nil{
      log.Fatal("http.Get error : ",err)
   }
   defer resp.Body.Close()

   // 去读数据内容为 bytes
   dataBytes, err := ioutil.ReadAll(resp.Body)
   if err !=nil{
      log.Fatal("ioutil.ReadAll error : ",err)
   }

   // 字节数组 转换成 字符串
   str := string(dataBytes)

   // 过滤 XL 账号和密码
   re := regexp.MustCompile(reAccount)

   // 匹配多少次， -1 默认是全部
   results := re.FindAllStringSubmatch(str, -1)

   // 输出结果
   for _, result := range results {
      log.Println(result[0])
   }
}

func main() {
   // 简单设置log 参数
   log.SetFlags(log.Lshortfile | log.LstdFlags)
   // 传入网站地址，爬取开始爬取数据
   GetAccountAndPwd("http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696")
}

The results of running the above code are as follows:

2021/06/xx xx:05:25 main.go:46: 账号：357451317 密码：110120a
2021/06/xx xx:05:25 main.go:46: 账号：907812219 密码：810303
2021/06/xx xx:05:25 main.go:46: 账号：797169897 密码：zxcvbnm132
2021/06/xx xx:05:25 main.go:46: 迅雷账号：792253782:1密码：283999
2021/06/xx xx:05:25 main.go:46: 迅雷账号：147643189:2密码：344867
2021/06/xx xx:05:25 main.go:46: 迅雷账号：147643189:1密码：267297

It can be seen that the account: the initial data and the Thunder account: the initial data are all crawled by us. In fact, it is not difficult to crawl the content of static web pages. The is basically spent on regular expression matching and data processing.

According to the steps of crawling webpages above, we can list:

Visit website http.Get(url)
Read data content ioutil.ReadAll
Convert data to string
Set regular matching rules regexp.MustCompile(reAccount)
Start filtering data, you can set the number of filtering re.FindAllStringSubmatch(str, -1)

Of course, in actual work, it is certainly not that simple.

format of the data crawled on the website is not uniform enough to , and the special characters are more and more complicated. has no regular , and even data is dynamic , and there is no way to get Get

However, the above-mentioned problems can be solved. According to different problems, different solutions and data processing are designed. I believe that friends who encounter this point will definitely be able to solve them. In the face of problems, we must have the determination to solve the problems.

`Scraping pictures`

After reading the above example, let's try to crawl the image data on the webpage, for example, search for Shiba Inu on a certain degree

is such a page

Let url address bar of url to crawl the data

https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC

Since there are many pictures, we set to only match the data of 2

Let's take a look at DEMO

By the way, the Get url data is converted into a string function, extracted, and encapsulated into a small function
GetPic , special use regular expressions for matching, you can set the number of matches, here we set the matching 2 times

package main

import (
   "io/ioutil"
   "log"
   "net/http"
   "regexp"
)

const (
   // 正则表达式，匹配出XL的账号密码
   reAccount = `(账号|迅雷账号)(；|：)[0-9:]+(| )密码：[0-9a-zA-Z]+`
   // 正则表达式，匹配出 图片
   rePic = `https?://[^"]+?(\.((jpg)|(png)|(jpeg)|(gif)|(bmp)))`
)

func getStr(url string)string{
   resp, err := http.Get(url)
   if err !=nil{
      log.Fatal("http.Get error : ",err)
   }
   defer resp.Body.Close()

   // 去读数据内容为 bytes
   dataBytes, err := ioutil.ReadAll(resp.Body)
   if err !=nil{
      log.Fatal("ioutil.ReadAll error : ",err)
   }

   // 字节数组 转换成 字符串
   str := string(dataBytes)
   return str
}

// 获取网站 账号密码
func GetAccountAndPwd(url string,n int) {
   str := getStr(url)
   // 过滤 XL 账号和密码
   re := regexp.MustCompile(reAccount)

   // 匹配多少次， -1 默认是全部
   results := re.FindAllStringSubmatch(str, n)

   // 输出结果
   for _, result := range results {
      log.Println(result[0])
   }
}

// 获取网站 账号密码
func GetPic(url string,n int) {

   str := getStr(url)

   // 过滤 图片
   re := regexp.MustCompile(rePic)

   // 匹配多少次， -1 默认是全部
   results := re.FindAllStringSubmatch(str, n)

   // 输出结果
   for _, result := range results {
      log.Println(result[0])
   }
}


func main() {
   // 简单设置l og 参数
   log.SetFlags(log.Lshortfile | log.LstdFlags)
   //GetAccountAndPwd("http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696", -1)
   GetPic("https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC",2)
}

Run the above code, and the results are as follows (without deduplication):

2021/06/xx xx:06:39 main.go:63: https://ss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=4246005838,1103140037&fm=26&gp=0.jpg
2021/06/xx xx:06:39 main.go:63: https://ss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=4246005838,1103140037&fm=26&gp=0.jpg

Sure enough, it is what we want, but just print it out and crawl the image links, which certainly cannot meet our real crawler needs. We must still crawl the images to download for our use. That’s what we want.

For the convenience of the demonstration, let’s add the small function of downloading files on the above code, let’s download the first one.

package main

import (
   "fmt"
   "io/ioutil"
   "log"
   "net/http"
   "regexp"
   "strings"
   "time"
)

const (
   // 正则表达式，匹配出 图片
   rePic = `https?://[^"]+?(\.((jpg)|(png)|(jpeg)|(gif)|(bmp)))`
)

// 获取网页数据，且把数据转成 字符串
func getStr(url string) string {
   resp, err := http.Get(url)
   if err != nil {
      log.Fatal("http.Get error : ", err)
   }
   defer resp.Body.Close()

   // 去读数据内容为 bytes
   dataBytes, err := ioutil.ReadAll(resp.Body)
   if err != nil {
      log.Fatal("ioutil.ReadAll error : ", err)
   }

   // 字节数组 转换成 字符串
   str := string(dataBytes)
   return str
}

// 获取图片数据
func GetPic(url string, n int) {

   str := getStr(url)

   // 过滤 图片
   re := regexp.MustCompile(rePic)

   // 匹配多少次， -1 默认是全部
   results := re.FindAllStringSubmatch(str, n)

   // 输出结果
   for _, result := range results {
      // 获取具体的图片名字
      fileName := GetFilename(result[0])
     // 下载图片
      DownloadPic(result[0], fileName)
   }
}

// 获取到 文件的 名字
func GetFilename(url string) (filename string) {
   // 找到最后一个 = 的索引
   lastIndex := strings.LastIndex(url, "=")
   // 获取 / 后的字符串 ，这就是源文件名
   filename = url[lastIndex+1:]

   // 把时间戳 加 在原来名字前，拼一个新的名字
   prefix := fmt.Sprintf("%d",time.Now().Unix())
   filename = prefix + "_" + filename

   return filename
}

func DownloadPic(url string, filename string) {
   resp, err := http.Get(url)
   if err != nil {
      log.Fatal("http.Get error : ", err)
   }
   defer resp.Body.Close()

   bytes, err := ioutil.ReadAll(resp.Body)
   if err != nil {
      log.Fatal("ioutil.ReadAll error : ", err)
   }

   // 文件存放的路径
   filename = "./" + filename

   // 写文件 并设置文件权限
   err = ioutil.WriteFile(filename, bytes, 0666)
   if err != nil {
      log.Fatal("wirte failed !!", err)
   } else {
      log.Println("ioutil.WriteFile successfully , filename = ", filename)
   }
}

func main() {
   // 简单设置l og 参数
   log.SetFlags(log.Lshortfile | log.LstdFlags)
   GetPic("https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC", 1)
}

For the above code, we added 2 functions to help us modify the image name and download the image

Get the name of the file and rename it with a timestamp, GetFilename
Download the specific picture, download it to the current directory, DownloadPic

Run the above code, you can see the following effect

2021/06/xx xx:50:04 main.go:91: ioutil.WriteFile successfully , filename =  ./1624377003_0.jpg

And in the current directory, a picture has been downloaded successfully, the name is 1624377003_0.jpg

The following is an image photo of the specific picture

Some big brothers will say that it is too slow for me to download pictures with one coroutine, can I download it faster, and download multiple coroutines together?

`Crawl our little Shiba Inu concurrently`

Do we still remember the GO channel and sync package ? GO channel and the sync package is just a good practice. This small function is relatively simple. Let me talk about the general idea. If you are interested, you can implement a wave.

Let's read the data of the above URL and convert it into a string
Use regular matching to match a series of image links
Put each picture link in a buffered channel, temporarily set the buffer to 100
Let's open 3 more coroutines to read the data of the channel concurrently and download it to the local. For the modification method of the file name, please refer to the encoding

How about it, big brothers and friends, if you are interested, you can practice it. If you have ideas about crawling dynamic data, we can communicate and discuss and make progress together

`Summarize`

Brief description of sharing static and dynamic pages
GO crawls simple data of static web pages
GO crawls pictures on the web
Crawl resources on the web page concurrently

`Welcome to like, follow, favorite`

Friends, your support and encouragement are my motivation to keep sharing and improve quality

Well, this has to end here, next GO in gjson applications and share

Technology is open, and our mindset should be more open. Embrace the change, live toward the sun, and work hard to move forward.

I am Nezha , welcome to like and follow the collection, see you next time~

Share a wave of GO crawlers