本文内容
- 爬取豆瓣电影Top250页面内容,字段包含:
排名,片名,导演,一句话描述 有的为空,评分,评价人数,上映时间,上映国家,类别 - 抓取数据存储
scrapy介绍
创建项目
scrapy startproject dbmovie
创建爬虫
cd dbmoive
scarpy genspider dbmovie_spider movie.douban.com/top250
注意,爬虫名不能和项目名一样
应对反爬策略的配置
-
打开settings.py文件,将ROBOTSTXT_OBEY修改为False。
ROBOTSTXT_OBEY = False
-
修改User-Agent
DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'Accept-Encoding' : 'gzip, deflate, br', 'Cache-Control' : 'max-age=0', 'Connection' : 'keep-alive', 'Host' : 'movie.douban.com', 'Upgrade-Insecure-Requests' : '1', 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', }
运行爬虫
scrapy crawl dbmovie_spider
定义item
根据前面的分析,我们需要抓取一共十个字段的信息,现在在items.py文件中定义item
import scrapy
class DoubanItem(scrapy.Item):
# 排名
ranking = scrapy.Field()
# 篇名
title = scrapy.Field()
# 导演
director = scrapy.Field()
# 一句话描述 有的为空
movie_desc = scrapy.Field()
# 评分
rating_num = scrapy.Field()
# 评价人数
people_count = scrapy.Field()
# 上映时间
online_date = scrapy.Field()
# 上映国家
country = scrapy.Field()
# 类别
category = scrapy.Field()
字段提取
这里需要用到xpath相关知识,偷了个懒,直接用chrome插件获取
Chrome浏览器获取XPATH的方法----通过开发者工具获取
def parse(self, response):
item = DoubanItem()
movies = response.xpath('//div[@class="item"]')
for movie in movies:
# 名次
item['ranking'] = movie.xpath('div[@class="pic"]/em/text()').extract()[0]
# 片名 提取多个片名
titles = movie.xpath('div[@class="info"]/div[1]/a/span/text()').extract()[0]
item['title'] = titles
# 获取导演信息
info_director = movie.xpath('div[2]/div[2]/p[1]/text()[1]').extract()[0].replace("\n", "").replace(" ", "").split('\xa0')[0]
item['director'] = info_director
# 上映日期
online_date = movie.xpath('div[2]/div[2]/p[1]/text()[2]').extract()[0].replace("\n", "").replace('\xa0', '').split("/")[0].replace(" ", "")
# 制片国家
country = movie.xpath('div[2]/div[2]/p[1]/text()[2]').extract()[0].replace("\n", "").split("/")[1].replace('\xa0', '')
# 影片类型
category = movie.xpath('div[2]/div[2]/p[1]/text()[2]').extract()[0].replace("\n", "").split("/")[2].replace('\xa0', '').replace(" ", "")
item['online_date'] = online_date
item['country'] = country
item['category'] = category
movie_desc = movie.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()').extract()
if len(movie_desc) != 0: # 判断info的值是否为空,不进行这一步有的电影信息并没有会报错或数据不全
item['movie_desc'] = movie_desc
else:
item['movie_desc'] = ' '
item['rating_num'] = movie.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
item['people_count'] = movie.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[4]/text()').extract()[0]
yield item
# 获取下一页
next_url = response.xpath('//span[@class="next"]/a/@href').extract()
if next_url:
next_url = 'https://movie.douban.com/top250' + next_url[0]
yield scrapy.Request(next_url, callback=self.parse, dont_filter=True)
存储数据,mysql
注意1064错误,表中字段包含mysql关键字导致
Scrapy入门教程之写入数据库
import pymysql
def dbHandle():
conn = pymysql.connect(
host='localhost',
user='root',
passwd='pwd',
db="dbmovie",
charset='utf8',
use_unicode=False
)
return conn
class DoubanPipeline(object):
def process_item(self, item, spider):
dbObject = dbHandle()
cursor = dbObject.cursor()
sql = "insert into db_info(ranking,title,director,movie_desc,rating_num,people_count,online_date,country,category) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
try:
cursor.execute(sql, (item['ranking'], item['title'], item['director'], item['movie_desc'], item['rating_num'], item['people_count'], item['online_date'], item['country'], item['category']))
dbObject.commit()
except Exception as e:
print(e)
dbObject.rollback()
return item
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。