[TOC]
Share a wave of GO crawlers
Let’s review the last time we talked about using GOLANG to send mail
Golang+chromedp+goquery Simple crawling of dynamic data|Go theme month
- Shared the mail, what is the email
- What are the mail protocols
- How to use GOLANG to send email
- How to send emails with plain text,
HTML
content, attachments, etc. - Sending mail, how to copy, how to blind copy
- How to improve the performance of sending mail
Want to see how uses GOLANG to send mail , welcome to check the article How to use GOLANG to send mail
Remember that we simply shared an article about Golang crawling webpage dynamic data Golang+chromedp+goquery Simple crawling dynamic data|Go theme month
If there are friends who are interested, we can study in detail the use of chromedp framework
Today, let’s share the static data of web crawling using GO
What are static web pages and dynamic web pages?
What is static web page data?
- It means that there is no program code in the webpage, only HTML , that is, only hypertext markup language, the suffix name is generally
.html
,htm
,xml
etc. - Another feature of static web pages is that users can click to open them directly, and the content of the page opened by anyone at any time is unchanged. The
html
is fixed, and the effect is fixed.
So by the way, what is a dynamic web page?
Dynamic webpage is a kind of webpage programming technology
In addition to the HTML tags, the dynamic web page files also include program codes for some specific functions
These codes are mainly used for the browser and the server to interact. The server can dynamically generate web content according to the different requests of the client, which is very flexible.
In other words, although the page code of the dynamic web page has not changed, the displayed content can change with the passage of time, different environments, and changes in the database.
GO to crawl static data of web pages
We crawl the data of static web pages, for example, we crawl the static data on this website, crawl the account and password information on the web page
http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696
Let's crawl the steps of this website:
- Specify a website that clearly needs to be crawled
- Get data through HTTP
GET
- Convert byte array to string
- Use regular expressions to match what we expect ( is very important here, in fact, crawling static web pages, processing data and filtering data takes more time )
- Filter data, remove duplication and other operations (this step varies from person to person and target website according to personal needs)
Let's write a DEMO, climb the account and password information on the above website, the information we want on the web page is this, we are only for learning, don't use it to do some bad things
package main
import (
"io/ioutil"
"log"
"net/http"
"regexp"
)
const (
// 正则表达式,匹配出 XL 的账号密码
reAccount = `(账号|迅雷账号)(;|:)[0-9:]+(| )密码:[0-9a-zA-Z]+`
)
// 获取网站 账号密码
func GetAccountAndPwd(url string) {
// 获取网站数据
resp, err := http.Get(url)
if err !=nil{
log.Fatal("http.Get error : ",err)
}
defer resp.Body.Close()
// 去读数据内容为 bytes
dataBytes, err := ioutil.ReadAll(resp.Body)
if err !=nil{
log.Fatal("ioutil.ReadAll error : ",err)
}
// 字节数组 转换成 字符串
str := string(dataBytes)
// 过滤 XL 账号和密码
re := regexp.MustCompile(reAccount)
// 匹配多少次, -1 默认是全部
results := re.FindAllStringSubmatch(str, -1)
// 输出结果
for _, result := range results {
log.Println(result[0])
}
}
func main() {
// 简单设置log 参数
log.SetFlags(log.Lshortfile | log.LstdFlags)
// 传入网站地址,爬取开始爬取数据
GetAccountAndPwd("http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696")
}
The results of running the above code are as follows:
2021/06/xx xx:05:25 main.go:46: 账号:357451317 密码:110120a
2021/06/xx xx:05:25 main.go:46: 账号:907812219 密码:810303
2021/06/xx xx:05:25 main.go:46: 账号:797169897 密码:zxcvbnm132
2021/06/xx xx:05:25 main.go:46: 迅雷账号:792253782:1密码:283999
2021/06/xx xx:05:25 main.go:46: 迅雷账号:147643189:2密码:344867
2021/06/xx xx:05:25 main.go:46: 迅雷账号:147643189:1密码:267297
It can be seen that the account: the initial data and the
Thunder account: the initial data are all crawled by us. In fact, it is not difficult to crawl the content of static web pages. The is basically spent on regular expression matching and data processing.
According to the steps of crawling webpages above, we can list:
- Visit website
http.Get(url)
- Read data content
ioutil.ReadAll
- Convert data to string
- Set regular matching rules
regexp.MustCompile(reAccount)
- Start filtering data, you can set the number of filtering
re.FindAllStringSubmatch(str, -1)
Of course, in actual work, it is certainly not that simple.
format of the data crawled on the website is not uniform enough to , and the special characters are more and more complicated. has no regular , and even data is dynamic , and there is no way to get Get
However, the above-mentioned problems can be solved. According to different problems, different solutions and data processing are designed. I believe that friends who encounter this point will definitely be able to solve them. In the face of problems, we must have the determination to solve the problems.
Scraping pictures
After reading the above example, let's try to crawl the image data on the webpage, for example, search for Shiba Inu on a certain degree
is such a page
Let url
address bar of url
to crawl the data
https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC
Since there are many pictures, we set to only match the data of 2
Let's take a look at DEMO
- By the way, the
Get
url data is converted into a string function, extracted, and encapsulated into a small function GetPic
, special use regular expressions for matching, you can set the number of matches, here we set the matching 2 times
package main
import (
"io/ioutil"
"log"
"net/http"
"regexp"
)
const (
// 正则表达式,匹配出XL的账号密码
reAccount = `(账号|迅雷账号)(;|:)[0-9:]+(| )密码:[0-9a-zA-Z]+`
// 正则表达式,匹配出 图片
rePic = `https?://[^"]+?(\.((jpg)|(png)|(jpeg)|(gif)|(bmp)))`
)
func getStr(url string)string{
resp, err := http.Get(url)
if err !=nil{
log.Fatal("http.Get error : ",err)
}
defer resp.Body.Close()
// 去读数据内容为 bytes
dataBytes, err := ioutil.ReadAll(resp.Body)
if err !=nil{
log.Fatal("ioutil.ReadAll error : ",err)
}
// 字节数组 转换成 字符串
str := string(dataBytes)
return str
}
// 获取网站 账号密码
func GetAccountAndPwd(url string,n int) {
str := getStr(url)
// 过滤 XL 账号和密码
re := regexp.MustCompile(reAccount)
// 匹配多少次, -1 默认是全部
results := re.FindAllStringSubmatch(str, n)
// 输出结果
for _, result := range results {
log.Println(result[0])
}
}
// 获取网站 账号密码
func GetPic(url string,n int) {
str := getStr(url)
// 过滤 图片
re := regexp.MustCompile(rePic)
// 匹配多少次, -1 默认是全部
results := re.FindAllStringSubmatch(str, n)
// 输出结果
for _, result := range results {
log.Println(result[0])
}
}
func main() {
// 简单设置l og 参数
log.SetFlags(log.Lshortfile | log.LstdFlags)
//GetAccountAndPwd("http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696", -1)
GetPic("https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC",2)
}
Run the above code, and the results are as follows (without deduplication):
2021/06/xx xx:06:39 main.go:63: https://ss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=4246005838,1103140037&fm=26&gp=0.jpg
2021/06/xx xx:06:39 main.go:63: https://ss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=4246005838,1103140037&fm=26&gp=0.jpg
Sure enough, it is what we want, but just print it out and crawl the image links, which certainly cannot meet our real crawler needs. We must still crawl the images to download for our use. That’s what we want.
For the convenience of the demonstration, let’s add the small function of downloading files on the above code, let’s download the first one.
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
"regexp"
"strings"
"time"
)
const (
// 正则表达式,匹配出 图片
rePic = `https?://[^"]+?(\.((jpg)|(png)|(jpeg)|(gif)|(bmp)))`
)
// 获取网页数据,且把数据转成 字符串
func getStr(url string) string {
resp, err := http.Get(url)
if err != nil {
log.Fatal("http.Get error : ", err)
}
defer resp.Body.Close()
// 去读数据内容为 bytes
dataBytes, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatal("ioutil.ReadAll error : ", err)
}
// 字节数组 转换成 字符串
str := string(dataBytes)
return str
}
// 获取图片数据
func GetPic(url string, n int) {
str := getStr(url)
// 过滤 图片
re := regexp.MustCompile(rePic)
// 匹配多少次, -1 默认是全部
results := re.FindAllStringSubmatch(str, n)
// 输出结果
for _, result := range results {
// 获取具体的图片名字
fileName := GetFilename(result[0])
// 下载图片
DownloadPic(result[0], fileName)
}
}
// 获取到 文件的 名字
func GetFilename(url string) (filename string) {
// 找到最后一个 = 的索引
lastIndex := strings.LastIndex(url, "=")
// 获取 / 后的字符串 ,这就是源文件名
filename = url[lastIndex+1:]
// 把时间戳 加 在原来名字前,拼一个新的名字
prefix := fmt.Sprintf("%d",time.Now().Unix())
filename = prefix + "_" + filename
return filename
}
func DownloadPic(url string, filename string) {
resp, err := http.Get(url)
if err != nil {
log.Fatal("http.Get error : ", err)
}
defer resp.Body.Close()
bytes, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatal("ioutil.ReadAll error : ", err)
}
// 文件存放的路径
filename = "./" + filename
// 写文件 并设置文件权限
err = ioutil.WriteFile(filename, bytes, 0666)
if err != nil {
log.Fatal("wirte failed !!", err)
} else {
log.Println("ioutil.WriteFile successfully , filename = ", filename)
}
}
func main() {
// 简单设置l og 参数
log.SetFlags(log.Lshortfile | log.LstdFlags)
GetPic("https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC", 1)
}
For the above code, we added 2 functions to help us modify the image name and download the image
- Get the name of the file and rename it with a timestamp,
GetFilename
- Download the specific picture, download it to the current directory,
DownloadPic
Run the above code, you can see the following effect
2021/06/xx xx:50:04 main.go:91: ioutil.WriteFile successfully , filename = ./1624377003_0.jpg
And in the current directory, a picture has been downloaded successfully, the name is 1624377003_0.jpg
The following is an image photo of the specific picture
Some big brothers will say that it is too slow for me to download pictures with one coroutine, can I download it faster, and download multiple coroutines together?
Crawl our little Shiba Inu concurrently
Do we still remember the GO channel and sync package ? GO channel and the sync package is just a good practice. This small function is relatively simple. Let me talk about the general idea. If you are interested, you can implement a wave.
- Let's read the data of the above URL and convert it into a string
- Use regular matching to match a series of image links
- Put each picture link in a buffered channel, temporarily set the buffer to 100
- Let's open 3 more coroutines to read the data of the channel concurrently and download it to the local. For the modification method of the file name, please refer to the encoding
How about it, big brothers and friends, if you are interested, you can practice it. If you have ideas about crawling dynamic data, we can communicate and discuss and make progress together
Summarize
- Brief description of sharing static and dynamic pages
- GO crawls simple data of static web pages
- GO crawls pictures on the web
- Crawl resources on the web page concurrently
Welcome to like, follow, favorite
Friends, your support and encouragement are my motivation to keep sharing and improve quality
Well, this has to end here, next GO in gjson applications and share
Technology is open, and our mindset should be more open. Embrace the change, live toward the sun, and work hard to move forward.
I am Nezha , welcome to like and follow the collection, see you next time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。