1.思路:我写了个简单美国亚马逊爬虫,抓取页面中商品的链接,根据商品ASIN码来判断在页面中第几个。由于亚马逊商品页中商品链接中有包含页面的第几个。一个页面有48个商品。我先循环判断第一页中的商品列表。如果找不到递归跳转第二页,继续判断。就是运行速度有点慢,修改一些代码后,抓取不到页面,抓取到robotcheck页面。我是刚学Python,如何解决ASIN码匹配亚马逊链接中链接速度慢?
2.代码(环境:Python3.6.1):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
This is the Amazon web crawler script
Crawling the Amazon page, according to the commodity ASIN code query the goods
in the first few pages.
Through the 48 links in the page to determine whether the ASIN code in the 48 links
"""
import requests
from bs4 import BeautifulSoup
# 输入关键字与ASIN
__keyword__ = 'Earrings'
__code__ = 'B06Y5HW7T3'
__totalPage__ = 5
print('\n')
print('--------------------------开始搜索--------------------------')
def page_request(_url):
"""
这是页面请求函数封装,抓取页面链接的内容
* param 请求网站的链接
* returns 返回抓取到网页结构
"""
headers = {'Host': 'www.amazon.com',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding':'gzip, deflate, br',
'Connection':'keep-alive'
}
session = requests.session()
get_html = session.get(_url, headers=headers)
analysis_html = BeautifulSoup(get_html.text, 'lxml')
return analysis_html
def filter_content(_con, _page):
"""
先对HTML内容筛选出48个链接和下一页的链接,先循环遍历49个链接,是否有相符合ASIN码,
若无则递归翻页查找,有则跳出循环输出第几个
* @param 请求网站链接
* @param 页数
"""
_current = _con.find_all('a', class_='a-link-normal s-access-detail-page s-overflow-ellipsis s-color-twister-title-link a-text-normal')
_next = 'https://www.amazon.com' + _con.find('div', id='pagn').find('a', id='pagnNextLink').get('href')
for i in range(0, 48):
start = _current[i].get('href').find('_1_')
end = _current[i].get('href').find('?s')
if __code__ in _current[i].get('href'):
_rank = _current[i].get('href')[start:end]
print('排名在: ' + _rank[_rank.find('1_') + 2:_rank.find('/')])
return
if _page == 0:
print('前5页找不到该商品~')
return
else:
filter_content(page_request(_next), _page-1)
if __name__ == '__main__':
URL = 'https://www.amazon.com/s/ref=nb_sb_noss?url=node%3D7454880011&field-keywords=' + __keyword__
filter_content(page_request(URL), int(__totalPage__))
我看你里面已经使用了lxml
你的选择器,应该采用css selector,css selector比bs 内置或者正则效率更高。