node爬取网站

Question

node爬取网站

发布于
2020-03-24

大概几百个网站，需要网站的名称，使用的是node-crawler这个插件，这些网站的字符集编码有些不一样，有的网站是utf-8的，有的是gb2312的，还有其他格式的字符集，怎么才能自动识别字符集编码？

var defaultOptions = {
    autoWindowClose: true,
    forceUTF8: true,
    gzip: true,
    incomingEncoding: null,
    jQuery: true,//res 是否注入 cheerio，doc有详细说明
    maxConnections: 10,//只有在rateLimit == 0时起作用，限制并发数
    method: 'GET',
    priority: 5,//queue请求优先级，模拟用户行为
    priorityRange: 10,
    rateLimit: 0,//请求最小间隔
    referer: false,
    retries: 3,//重试次数，请求不成功会重试3次
    retryTimeout: 10000,//重试间隔
    timeout: 15000,//15s req无响应，req失败
    skipDuplicates: false,//url去重，建议框架外单读使用seenreq
    rotateUA: false,//数组多组UA
    homogeneous: false
};

var c = new Crawler({

    forceUTF8: true,
    maxConnections: 10,
    // This will be called for each crawled page
    callback: function (error, res, done) {
        if (error) {
            console.log(error);
        } else {
            var $ = res.$;
            console.log(res)
            // $ is Cheerio by default
            //a lean implementation of core jQuery designed specifically for the server
            // console.log($("title").text());
        }
        done();
    }
});
c.queue(arr[0]);

nodejs爬虫 node.js javascript 前端

阅读 2.3k

1 个回答

得票最新

蛋先生DX

30715

发布于
2020-03-25

抓取之前，先异步请求下首页内容，从响应头或响应内容中取得编码即可

撰写回答