scrapy用cmd运行,抓取网站的文本内容,抓不到东西,求大佬们帮忙看看,为什么?

新手上路,请多包涵

运行如下,好像是404,Ignoring response

2017-10-20 12:47:25 [scrapy.core.engine] INFO: Spider opened

2017-10-20 12:47:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2017-10-20 12:47:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2017-10-20 12:47:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET 中国科技论文在线> (referer: Sina Visitor System)

2017-10-20 12:47:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 中国科技论文在线>: HTTP status code is not handled or not allowed

2017-10-20 12:47:25 [scrapy.core.engine] INFO: Closing spider (finished)

具体代码如下:

-- coding:utf-8 --

from scrapy.spiders import Spider
class wenxianspider(Spider):

name = 'wenxian1' 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
}
start_urls = ['http://www.paper.edu.cn/journal/MUjGEF3QMRAVeQIeQeQ.shtml']
def parse(self, response):
    books = response.xpath('//p[@class="journalBox-content"]/text()').extract()
   # booksgbk = books.encode("uft-8")

    for book in books:
        print book.strip().encode("utf-8",'ignore')#在print输出位,可以用ignore忽略掉无法编码的字符
阅读 3.3k
1 个回答
你要的内容是通过ajax异步加载的,你应该从post 到http://www.paper.edu.cn/qk/academic/journal/MUjGEF3QMRAVeQIeQeQ这取
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题