运行如下,好像是404,Ignoring response
2017-10-20 12:47:25 [scrapy.core.engine] INFO: Spider opened
2017-10-20 12:47:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-20 12:47:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-20 12:47:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET 中国科技论文在线> (referer: Sina Visitor System)
2017-10-20 12:47:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 中国科技论文在线>: HTTP status code is not handled or not allowed
2017-10-20 12:47:25 [scrapy.core.engine] INFO: Closing spider (finished)
具体代码如下:
-- coding:utf-8 --
from scrapy.spiders import Spider
class wenxianspider(Spider):
name = 'wenxian1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
}
start_urls = ['http://www.paper.edu.cn/journal/MUjGEF3QMRAVeQIeQeQ.shtml']
def parse(self, response):
books = response.xpath('//p[@class="journalBox-content"]/text()').extract()
# booksgbk = books.encode("uft-8")
for book in books:
print book.strip().encode("utf-8",'ignore')#在print输出位,可以用ignore忽略掉无法编码的字符