究竟怎么给phantomjs设置代理?

今天在抓一个站点的时候用到phantomjs组件,抓取都很成功,但是有如下提示:

[W 170405 02:04:09 base_handler:334] phantomjs does not support specify proxy from script, use phantomjs args instead

我尝试用配置文件设置了全局代理,提示是没有了,可是测试的结果返回的都是我本地的IP而不是代理的IP。

查文档看到不是很理解,文档里只有Addition args pass to phantomjs command line.这么一句,可是究竟应该怎么用?我是用all启动的,如果用pyspider phantomjs启动,应该如何传入配置?

阅读 4.1k
评论 更新于 2017-04-06
    2 个回答

    纠结一晚上。。。。在 https://github.com/binux/pysp... 找到答案。

    pyspider phantomjs -- --proxy=ip:port
    
    pyspider --phantomjs-proxy 127.0.0.25555 all
    评论 赞赏 2017-04-06
      mimvp
      • 187
      #!/usr/bin/env python
      # -*- encoding: utf-8 -*-
      # Project: mimvp_proxy_pyspider
      #
      # Python2 支持 http、https
      #
      # 米扑代理示例:
      # http://proxy.mimvp.com/demo2.php
      # 
      # 米扑代理购买:
      # http://proxy.mimvp.com
      # 
      # mimvp.com
      # 2017-07-22
      
      
      ############  方式1:pyspider crawl_config  ############
      
      from pyspider.libs.base_handler import *
      
      class Handler(BaseHandler):
          crawl_config = {
              'proxy' : 'http://188.226.141.217:8080',     # http
              'proxy' : 'https://182.253.32.65:3128'      # https
          }
      
          @every(minutes=24 * 60)
          def on_start(self):
              self.crawl('http://proxy.mimvp.com/exist.php', callback=self.index_page)
      
          @config(age=10 * 24 * 60 * 60)
          def index_page(self, response):
              for each in response.doc('a[href^="http"]').items():
                  self.crawl(each.attr.href, callback=self.detail_page)
      
          @config(priority=2)
          def detail_page(self, response):
              return {
                  "url": response.url,
                  "title": response.doc('title').text(),
              }
              
              
              
      ############  方式2:pyspider --phantomjs-proxy 启动  ############
      
      # $ pyspider --help
      # Usage: pyspider [OPTIONS] COMMAND [ARGS]...
      # 
      #   A powerful spider system in python.
      # 
      # Options:
      #   -c, --config FILENAME           a json file with default values for
      #                                   subcommands. {"webui": {"port":5001}}
      #   --logging-config TEXT           logging config file for built-in python
      #                                   logging module  [default: /Library/Framework
      #                                   s/Python.framework/Versions/2.7/lib/python2.
      #                                   7/site-packages/pyspider/logging.conf]
      #   --debug                         debug mode
      #   --queue-maxsize INTEGER         maxsize of queue
      #   --taskdb TEXT                   database url for taskdb, default: sqlite
      #   --projectdb TEXT                database url for projectdb, default: sqlite
      #   --resultdb TEXT                 database url for resultdb, default: sqlite
      #   --message-queue TEXT            connection url to message queue, default:
      #                                   builtin multiprocessing.Queue
      #   --amqp-url TEXT                 [deprecated] amqp url for rabbitmq. please
      #                                   use --message-queue instead.
      #   --beanstalk TEXT                [deprecated] beanstalk config for beanstalk
      #                                   queue. please use --message-queue instead.
      #   --phantomjs-proxy TEXT          phantomjs proxy ip:port
      #   --data-path TEXT                data dir path
      #   --add-sys-path / --not-add-sys-path
      #                                   add current working directory to python lib
      #                                   search path
      #   --version                       Show the version and exit.
      #   --help                          Show this message and exit.
         
      pyspider --phantomjs-proxy "188.226.141.217:8080" all

      该答案已被忽略,原因:垃圾推广信息

      评论 赞赏 2017-07-28
        撰写回答

        登录后参与交流、获取后续更新提醒