0

我使用pyspider下载pdf文件,具体爬虫代码如下:
@config(priority=2)

def detail_page(self, response):
    author = []
    for each in response.doc('h2 a').items():
        author.push(each.text())
    
    file = self.down_file(response.doc('.download-links a[href^="http"]').attr.href)
    return {
        "author": ",".join(author),
        "title": file,
    }

def down_file(self, file_url):   

    file_name = file_url.split('/')[-1]  
    u = urllib2.urlopen(file_url)  
    f = open(file_name, 'wb')  
    meta = u.info()  
    file_size = int(meta.getheaders("Content-Length")[0])  

    file_size_dl = 0 
    block_sz = 8192 
    while True:  
        buffer = u.read(block_sz)  
        if not buffer:  
            break 

        file_size_dl += len(buffer)  
        f.write(buffer)  
    f.close()
    return file_name

结果爬取超时TimeoutError: process timeout,不知该如何处理?

2017-04-07 提问
3 个回答
0

不要在脚本中调用 urllib2.urlopen,这样会柱塞脚本的运行。
如果文件小于10M,使用 self.crawl 抓取即可,如果大于10M,将链接导出到另外的系统中进行下载。

0

对于大文件下载,能否启动多线程下载?

0
    PROCESS_TIME_LIMIT = 30
    EXCEPTION_LIMIT = 3

    RESULT_LOGS_LIMIT = 1000
    RESULT_RESULT_LIMIT = 10

    def __init__(self, projectdb, inqueue, status_queue, newtask_queue, result_queue,
                 enable_stdout_capture=True,
                 enable_projects_import=True,
                 process_time_limit=PROCESS_TIME_LIMIT):
        self.inqueue = inqueue
        self.status_queue = status_queue
        self.newtask_queue = newtask_queue
        self.result_queue = result_queue
        self.projectdb = projectdb
        self.enable_stdout_capture = enable_stdout_capture

        self._quit = False
        self._exceptions = 10
        self.project_manager = ProjectManager(projectdb, dict(
            result_queue=self.result_queue,
            enable_stdout_capture=self.enable_stdout_capture,
            process_time_limit=process_time_limit,
        ))

        if enable_projects_import:
            self.enable_projects_import()

源码中设置了process的时间30s,而且没有提供修改的入口.

所以要不修改源码,要不修改代码.

撰写答案

推广链接