0

今天在抓一个站点的时候用到phantomjs组件,抓取都很成功,但是有如下提示:

[W 170405 02:04:09 base_handler:334] phantomjs does not support specify proxy from script, use phantomjs args instead

我尝试用配置文件设置了全局代理,提示是没有了,可是测试的结果返回的都是我本地的IP而不是代理的IP。

查文档看到不是很理解,文档里只有Addition args pass to phantomjs command line.这么一句,可是究竟应该怎么用?我是用all启动的,如果用pyspider phantomjs启动,应该如何传入配置?

2017-04-05 提问
2 个回答
1

已采纳

纠结一晚上。。。。在 https://github.com/binux/pysp... 找到答案。

pyspider phantomjs -- --proxy=ip:port

pyspider --phantomjs-proxy 127.0.0.25555 all
0
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Project: mimvp_proxy_pyspider
#
# Python2 支持 http、https
#
# 米扑代理示例:
# http://proxy.mimvp.com/demo2.php
# 
# 米扑代理购买:
# http://proxy.mimvp.com
# 
# mimvp.com
# 2017-07-22


############  方式1:pyspider crawl_config  ############

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {
        'proxy' : 'http://188.226.141.217:8080',     # http
        'proxy' : 'https://182.253.32.65:3128'      # https
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://proxy.mimvp.com/exist.php', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }
        
        
        
############  方式2:pyspider --phantomjs-proxy 启动  ############

# $ pyspider --help
# Usage: pyspider [OPTIONS] COMMAND [ARGS]...
# 
#   A powerful spider system in python.
# 
# Options:
#   -c, --config FILENAME           a json file with default values for
#                                   subcommands. {"webui": {"port":5001}}
#   --logging-config TEXT           logging config file for built-in python
#                                   logging module  [default: /Library/Framework
#                                   s/Python.framework/Versions/2.7/lib/python2.
#                                   7/site-packages/pyspider/logging.conf]
#   --debug                         debug mode
#   --queue-maxsize INTEGER         maxsize of queue
#   --taskdb TEXT                   database url for taskdb, default: sqlite
#   --projectdb TEXT                database url for projectdb, default: sqlite
#   --resultdb TEXT                 database url for resultdb, default: sqlite
#   --message-queue TEXT            connection url to message queue, default:
#                                   builtin multiprocessing.Queue
#   --amqp-url TEXT                 [deprecated] amqp url for rabbitmq. please
#                                   use --message-queue instead.
#   --beanstalk TEXT                [deprecated] beanstalk config for beanstalk
#                                   queue. please use --message-queue instead.
#   --phantomjs-proxy TEXT          phantomjs proxy ip:port
#   --data-path TEXT                data dir path
#   --add-sys-path / --not-add-sys-path
#                                   add current working directory to python lib
#                                   search path
#   --version                       Show the version and exit.
#   --help                          Show this message and exit.
   
pyspider --phantomjs-proxy "188.226.141.217:8080" all

该答案已被忽略,原因:垃圾推广信息

撰写答案

你可能感兴趣的

推广链接