爬虫获取所有数据的思路是什么

673454146

508158290

发布于
2017-02-03

比如一个网站有下一页，我要怎么能把所有下一页爬完呢，用递归吗，递归深度不会有限制吗，初学，希望得到指点

python 网页爬虫 node.js linux

阅读 5.2k

6 个回答

得票最新

eric

2.1k125

发布于
2017-02-03

递归，消息队列，储存已经爬取的页面（redis, 数据库)

morriaty_the_murderer

36421218

发布于
2017-02-04

更新于
2017-02-04

如果你指的所有数据是一个小域名下的所有数据，并且你并不想细究原理，那就去学scrapy。

如果你指的所有数据是全网数据，并且想搞明白爬取时是广度优先还是深度优先等等原理，那首先你得有10000+服务器。

haofly

1.1k61320

发布于
2017-02-05

如果是同一个网站，用递归爬去呀，同一个网站怎么会爬不完

Xavier

282517

发布于
2017-02-15

如果网站的结构是简单重复的，可以先分析页码url的规律，然后直接从第一页拿到总页数，然后手动构造出其他页的url。

F_意志力

817119

发布于
2017-02-16

首先大致说下爬取的思路,如果页面链接很简单,类似 www.xxx.com/post/1.html这种有规律可循的页面,可以写递归或者循环去爬取

如果页面链接是未知的,可以获取爬取的页面去解析标签的链接,然后继续爬取,在这一过程中,你需要将已经爬取过的链接存下来,爬新链接的时候去寻找一下是否之前爬取过,然后也是通过递归去爬取

爬取思路通过url爬取->解析爬取内容中新的url->通过url爬取->....->当爬取到一定数量或者很长一段时间没有新链接的时候跳出递归

最后在python界有一个很厉害的爬虫框架scrapy,基本上把爬虫常用套路全部都封装好了,稍微学习下就会了传送门

lpgad

80411331

发布于
2017-02-16

更新于
2017-02-16


import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;

import org.apache.commons.io.FileUtils;



public class SpiderDemo {
    public static void main(String[] args) throws IOException {
//        URL url = new URL("http://www.zhongguoxinyongheimingdan.com");
//        URLConnection connection = url.openConnection();
//        InputStream in = connection.getInputStream();
//        File file = new File("F://a.txt");
//        FileUtils.copyInputStreamToFile(in, file);
        File srcDir = new File("F://a.txt");
        String str = FileUtils.readFileToString(srcDir, "UTF-8");
        String[] str1 = str.split("href=");
        for (int i = 3; i < str1.length-1; i++) {
            URL url = new URL("http://www.zhongguoxinyongheimingdan.com"+str1[i].substring(1, 27));
            File f = new File("F://abc//"+str1[i].substring(2, 22));
            if(!f.exists()){
            f.mkdir();    
            File desc1 = new File(f,str1[i].substring(1, 22)+".txt");
            URLConnection connection = url.openConnection();
            InputStream in = connection.getInputStream();
            FileUtils.copyInputStreamToFile(in, desc1);
            String str2 = FileUtils.readFileToString(desc1, "UTF-8");
            String[] str3 = str2.split("\" src=\"");
            for(int j = 1;j<str3.length-2;j++){
                URL url1 = new URL(str3[j].substring(0, 81));
                URLConnection connection1 = url1.openConnection();
                connection1.setDoInput(true);
                InputStream in1 = connection1.getInputStream();
                File desc2 = new File(f,str3[j].substring(44,76)+".jpg");
                FileUtils.copyInputStreamToFile(in1, desc2);
            }
            }
            }
        }
    
}

简单的代码把中国信用黑名单网站的所有照片保存到本地网站本身简单！不过当场这个网站奔溃了也是醉了！

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

爬虫获取所有数据的思路是什么

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

怎么用Vue3和Element-Plus及手动写组件模仿一个网站的全站内容，要模仿的很像？

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？