我写了个简单亚马逊爬虫，匹配字符串速度慢如何解决？

Question

我写了个简单亚马逊爬虫，匹配字符串速度慢如何解决？

1981816

发布于
2017-04-24

更新于
2017-04-25

1.思路:我写了个简单美国亚马逊爬虫，抓取页面中商品的链接，根据商品ASIN码来判断在页面中第几个。由于亚马逊商品页中商品链接中有包含页面的第几个。一个页面有48个商品。我先循环判断第一页中的商品列表。如果找不到递归跳转第二页，继续判断。就是运行速度有点慢，修改一些代码后，抓取不到页面，抓取到robotcheck页面。我是刚学Python，如何解决ASIN码匹配亚马逊链接中链接速度慢?

2.代码(环境:Python3.6.1)：

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    """
    
    This is the Amazon web crawler script
    Crawling the Amazon page, according to the commodity ASIN code query the goods
    in the first few pages.
    Through the 48 links in the page to determine whether the ASIN code in the 48 links
    
    """
    import requests
    from bs4 import BeautifulSoup
    
    # 输入关键字与ASIN
    __keyword__ = 'Earrings'
    __code__ = 'B06Y5HW7T3'
    __totalPage__ = 5
    
    print('\n')
    print('--------------------------开始搜索--------------------------')
    
    def page_request(_url):
        """
            这是页面请求函数封装，抓取页面链接的内容
            * param 请求网站的链接
            * returns 返回抓取到网页结构
        """
        headers = {'Host': 'www.amazon.com',
                   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0',
                   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                   'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
                   'Accept-Encoding':'gzip, deflate, br',
                   'Connection':'keep-alive'
                  }
        session = requests.session()
        get_html = session.get(_url, headers=headers)
        analysis_html = BeautifulSoup(get_html.text, 'lxml')
        return analysis_html
    
    def filter_content(_con, _page):
        """
            先对HTML内容筛选出48个链接和下一页的链接,先循环遍历49个链接,是否有相符合ASIN码,
            若无则递归翻页查找,有则跳出循环输出第几个
            * @param 请求网站链接
            * @param 页数
        """
        _current = _con.find_all('a', class_='a-link-normal s-access-detail-page s-overflow-ellipsis s-color-twister-title-link a-text-normal')
        _next = 'https://www.amazon.com' + _con.find('div', id='pagn').find('a', id='pagnNextLink').get('href')
        for i in range(0, 48):
            start = _current[i].get('href').find('_1_')
            end = _current[i].get('href').find('?s')
            if __code__ in _current[i].get('href'):
                _rank = _current[i].get('href')[start:end]
                print('排名在: ' + _rank[_rank.find('1_') + 2:_rank.find('/')])
                return
        if _page == 0:
            print('前5页找不到该商品~')
            return
        else:
            filter_content(page_request(_next), _page-1)
    
    if __name__ == '__main__':
        URL = 'https://www.amazon.com/s/ref=nb_sb_noss?url=node%3D7454880011&field-keywords=' + __keyword__
        filter_content(page_request(URL), int(__totalPage__))

python爬虫 python3.x

阅读 4.9k

1 个回答

得票最新

xiong1000

1095711

发布于
2017-04-26

✓ 已被采纳

我看你里面已经使用了lxml
你的选择器，应该采用css selector，css selector比bs 内置或者正则效率更高。

撰写回答