在spider中代码是这样的:
sel = Selector(response)
sites=sel.xpath('//div[@id="frag_1"]//h1/text()').extract()
print sites
抓取页面如下所示:
抓取网页:http://www.sciencedirect.com/science/article/pii/S0927775706008156
日志为:
其中,抓取期刊名称“Colloids and Surfaces A: Physicochemical and Engineering Aspects”,是ok的。
因为这个
h1
是通过ajax
请求动态加载的爬虫不能直接爬取,填充的请求是:
http://www.sciencedirect.com/science/frag/S0927775706008156/9899dec61b0879aa5f954b8f9a594a1026dc4a44c39be2bfb11514cf3c674a7da011cbb661dcd93bb55a33b09281b4674824d1034e2212ddcc8cae201b02f70b59a18bc5a83ecc3c9566807dbeae7cdf8700bb8bcbad524a15358461a7fd35cb5fa09cbf177f5301f11f3df889c3d73963d149466f992436e04b500146f101a5cce87cafe0f82cc9b574e25dcb65131f83e429518a406c95/frag_1