scrapy爬取失败原因以及如何判断是否被ban

初学scrapy，写了一个爬虫希望爬取http://codeforces.com/problemset的内容，代码如下

from scrapy.spider import CrawlSpider, Rule
from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from scrapy.linkextractors import LinkExtractor

class Codeforce_Problem_Spider (CrawlSpider):
    name = "Codeforce_Problem_Spider"
    allowed_domains = ["web"]

    start_urls = ("http://codeforces.com/problemset",)

    rules = (
        Rule(
            LinkExtractor(allow=r'/problemset/page/[2-9]', restrict_xpaths='//*[@id="pageContent"]//a[@class="arrow"]'),
        ),
        Rule(
            LinkExtractor(allow=r'/problemset/problem/\d{3}/[A-H]', restrict_xpaths='//*[@id="pageContent"]'),
            callback='parse_item'
        )
    )


    def parse_item(self, response):

        self.logger.info('hi %s' % response.url)
        l = ItemLoader(item=ProblemCodeforceItem(), response=response)

        l.add_xpath('title', '//*[@id="pageContent"]//*[@class="title"][1]/text()')
        l.add_xpath('contestId','//*[@id="sidebar"]/div[1]/table/tbody/tr[1]/th/a/text()')

        return l.load_item()

返回结果
图片描述

shell下尝试了正则表达式的匹配没有问题，尝试删掉xpath参数也没有效果，不知道哪里出了问题
看到有提到爬虫会被ban的问题，我想知道怎样判断爬虫是否被ban了……

阅读 4.8k

scrapy爬取失败原因以及如何判断是否被ban

你尚未登录，登录后可以