scrapy 中如何终止管道 判断去重

就是在pipeline开两个管道,在用mongodb 中插入数据,判断去重
如何重复就不下载文件,如果不重复,就插入数据库 并且下载文件
这是用一个下载管道一个数据库插入管道
先查看数据库判断数据是否重复,如果重复,就终止后面管道的运行,如果不重复 就插入数据进入数据库,并且启动下载管道,

from scrapy.pipelines.files import FilesPipeline
from scrapy import Request
from scrapy.conf import settings
import pymongo


class XiaoMiQuanPipeLines(object):
    def __init__(self):
        host = settings["MONGODB_HOST"]
        port = settings["MONGODB_PORT"]
        dbname = settings["MONGODB_DBNAME"]
        sheetname = settings["MONGODB_SHEETNAME"]

        client = pymongo.MongoClient(host=host, port=port)

        mydb = client[dbname]

        self.post = mydb[sheetname]

    def process_item(self, item):
        url = item['file_url']
        name = item['name']

        result = self.post.aggregate(
            [
                {"$group": {"_id": {"url": url, "name": name}}}
            ]
        )
        if result:
            pass
        else:

            self.post.insert({"url": url, "name": name})
            return item


class DownLoadPipelines(FilesPipeline):

    def file_path(self, request, response=None, info=None):
        return request.meta.get('filename', '')

    def get_media_requests(self, item, info):
        file_url = item['file_url']
        meta = {'filename': item['name']}
        yield Request(url=file_url, meta=meta)

阅读 3.1k
1 个回答

DropItem-官方文档:

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题