用邮箱扒头像来告诉你怎么写简单的脚本扒图

手上有几十万邮箱,本来用户系统没有做头像的东西,现在想根据这些邮箱能拿一部分用户的头像,可以直接使用
gravatar的服务,不过这玩意儿不时会被墙,还是拉回来靠谱,第2个途径是qq邮箱,通过分析数据发现,这几十万
用户里面居然有一半以上是qq邮箱,so 要想办法通过不用oauth的方式拿到.

思路与技术选择

作为一个pythoner,有很多爬虫框架可以选择,例如scrapy pyspider 没错有中文 有ui 有时间调度

爬虫框架会给你做很多事情,基本的东西入parse 回调等等,重要的功能室可以用深度或者广度优先算法进行类似下一页的爬取, 更好一些的
给你简单的方式去做agent伪装,proxy伪装,密码验证,时间调度等等.

但是邮箱扒图这种事情就是拿到url后直接抓回来就好, 没必要这么兴师动众,so requests就够了。

要做的事情

down回图,但是不要default的图片,例如qq的头像如果没有的话会给几种尺寸的默认图片,但是我不想要这个东西,没有就是没有

可以再扒图的进程挂掉后可以让他回复掉之前的现场(我可不想一次次重新抓, 几十万邮箱呢)

可以用多个进程,加快爬取速度

下面开始讲实现

第一步是获得url,如果你不介意gravatar会被墙,qq的连接会变(毕竟不是文档给出的地址), 这个地方就够了。

根据邮箱获得url

gravatar

gravatar文档

gravatar python实现

如需梯子请自备。

gravatar没什么可以说的,就是拿到md5后的qq邮箱

需要注意的参数 s是尺寸,gravatar做的比较好,基本什么尺寸都有
d是默认参数,不想用默认头像的时候填404,gravatar会返回404的响应, 其他参数请自己看文档

qq

http://q4.qlogo.cn/g?b=qq&nk=491794128&s=1

qq连接则比较容易拿到(不要问我怎么找到的,我忘了)

nk是qq号,qq邮箱也可以

s为图片大小,我扒了一下发现里面有这么多的size尺寸 1 2 3 4 5 40 41 100 140 160 240 640,
1~5是都有的尺寸,其中2对应4040, 4对应100100, 但是请注意,不是每个人都有100大小的图(10年前传的头像,从来没改过,真的有这种用户, 我身边就有...)

这篇帖子告诉你怎么免appid通过QQ号获取到QQ昵称和头像
里面提到了php curl反盗链抓东西 可惜是php的,我已经改为python的了,
python版, 虽然最终的实现没用用到这个东西(qq有可以直接访问的连接oh yeah),但是不一定什么时候就用到了。

下面是贴了5个大小的图,不确定能不能再github or osc or sf上显示

1
2
3
4
5

不能显示请点1
不能显示请点2
不能显示请点3
不能显示请点4
不能显示请点5

上代码前先说遇到的问题

Like所有的爬虫可能会遇到的问题,你需要伪装AGENTS, 否则爬虫可能会被禁掉,因为我爬qq的时候发现,一段时间后qq头像的大小变为了0,一定是出事情了。

可能你在我代码里面会看到我用邮箱.jpg命名了抓回来的图,这是因为我想写一个简单的东西看看这些图。

gravatar的用户量, 这个比例一直再将,从40人1人,到60人1人,在我抓到6万邮箱的时候发现这个比例大体是100人中有1人

关于无视默认图片, gravatar直接使用404判断,这个简单。qq麻烦些,首先先download回默认的几个图,然后md5下这个图,这样下载qq图的时候对比下这个md5码,一样则说明是默认图片,pass.

关于恢复现场,log会帮你,善用log。

关于多进程,这个最简单,还记得学算法时的思路么,大任务化为小任务即可,因此把总的邮件列表拆为几个part,脚本再做一些支持就可以同时用几个进程来跑了。

上代码

最新代码请看这里

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import hashlib
import urllib
import sys
import os
import random
from functools import partial


AGENTS = [
    "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
]


QQ_MD5_ESCAPE = ['11567101378fc08988b38b8f0acb1f74', '9d11f9fcc1888a4be8d610f8f4bba224']


LOG_FILES = 'scrapy_{}.log'
EMAIL_LIST = 'email_list_{}.json'
AVATAR_PATH = 'avatar/{}{}'


LOG_LEVEL_EXISTS = 'EXISTS'
LOG_LEVEL_NOTSET_OR_ERROR = 'NOTSET_OR_ERROR'
LOG_LEVEL_TYPE_ERROR = 'TYPE_ERROR'
LOG_LEVEL_ERROR = 'ERROR'
LOG_LEVEL_FAIL = 'FAIL'
LOG_LEVEL_SUCCESS = 'SUCCESS'
LOG_LEVEL_IGNORE = 'IGNORE'


def get_gravatar_url(email, default_avatar=None, use_404=False, size=100):
    data = {}
    if default_avatar and default_avatar.startswith('http'):
        data['d'] = default_avatar
    if use_404:
        data['d'] = '404'
    data['s'] = str(size)
    gravatar_url = "http://secure.gravatar.com/avatar/" + hashlib.md5(email.lower()).hexdigest() + "?"
    gravatar_url += urllib.urlencode(data)
    return gravatar_url


def get_random_headers():
    agent = random.choice(AGENTS)
    headers = {'User-Agent': agent}
    return headers


def check_logfile(part):
    last_scrapy_line = 1
    if os.path.exists('scrapy_{}.log'.format(part)):
        with open('scrapy_{}.log'.format(part)) as log_read:
            for line in log_read:
                last_scrapy_line = max(last_scrapy_line, int(line.split()[0]))
    print last_scrapy_line
    return last_scrapy_line + 1


def get_log_message(log_format='{index} {level} {email} {msg}', index=None, level=None, email=None, msg=None):
    return log_format.format(index=index, level=level, email=email, msg=msg)


SUCCESS_LOG = partial(get_log_message, level=LOG_LEVEL_SUCCESS, msg='scrapyed success')
EXIST_LOG = partial(get_log_message, level=LOG_LEVEL_EXISTS, msg='scrapyed already')
FAIL_LOG = partial(get_log_message, level=LOG_LEVEL_FAIL, msg='scrapyed failed')
NOT_QQ_LOG = partial(get_log_message, level=LOG_LEVEL_TYPE_ERROR, msg='not qq email')
IGNORE_LOG = partial(get_log_message, level=LOG_LEVEL_TYPE_ERROR, msg='ignore email')
EMPTY_SIZE_LOG = partial(get_log_message, level=LOG_LEVEL_ERROR, msg='empty avatar')
UNEXCEPT_ERROR_LOG = partial(get_log_message, level=LOG_LEVEL_ERROR, msg='unexcept error')


def write_log(log, msg):
    log.write(msg)
    log.write('\n')
    log.flush()


def save_avatar_file(filename, content):
    with open(filename, 'wb') as avatar_file:
        avatar_file.write(content)


def scrapy_context(part, suffix='.jpg', rescrapy=False, hook=None):
    last_scrapy_line = check_logfile(part)
    index = last_scrapy_line
    with open(LOG_FILES.format(part), 'a') as log:
        with open(EMAIL_LIST.format(part)) as list_file:
            for linenum, email in enumerate(list_file):
                if linenum < last_scrapy_line:
                    continue
                email = email.strip()
                if not rescrapy:
                    if os.path.exists(AVATAR_PATH.format(email, suffix)):
                        print EXIST_LOG(index=index, email=email)
                        index += 1
                        continue
                if not hook:
                    raise NotImplementedError()
                try:
                    hook(part, suffix=suffix, rescrapy=rescrapy, log=log, index=index, email=email)
                except Exception as ex:
                    print UNEXCEPT_ERROR_LOG(index=index, email=email)
                    write_log(log, UNEXCEPT_ERROR_LOG(index=index, email=email))
                    raise ex
                index += 1


def scrapy_qq_hook(part, suffix='.jpg', rescrapy=False, log=None, index=None, email=None):
    if 'qq.com' not in email.lower():
        print NOT_QQ_LOG(index=index, email=email)
        write_log(log, NOT_QQ_LOG(index=index, email=email))
        return

    url = 'http://q4.qlogo.cn/g?b=qq&nk={}&s=4'.format(email)
    response = requests.get(url, timeout=10, headers=get_random_headers())
    if response.status_code == 200:
        # 判断用户是否有大图标, 如果没有则请求小图标
        if hashlib.md5(response.content) in QQ_MD5_ESCAPE:
            url = 'http://q4.qlogo.cn/g?b=qq&nk={}&s=2'.format(email)
            response = requests.get(url, timeout=10, headers=get_random_headers())
            if response.status_code == 200:
                if not len(response.content):
                    print EMPTY_SIZE_LOG(index=index, email=email)
                    write_log(log, EMPTY_SIZE_LOG(index=index, email=email))
    # 这里再次判断是因为上一个200判断做了一次图片check
    if response.status_code == 200:
        save_avatar_file(AVATAR_PATH.format(email, suffix), response.content)
        print SUCCESS_LOG(index=index, email=email)
        write_log(log, SUCCESS_LOG(index=index, email=email))
    else:
        print FAIL_LOG(index=index, email=email)
        write_log(log, FAIL_LOG(index=index, email=email))


def scrapy_gravatar_hook(part, suffix='.jpg', rescrapy=False, ignore_email_suffix=None, log=None, index=None, email=None):
    if ignore_email_suffix and ignore_email_suffix in email.lower():
        print IGNORE_LOG(index=index, email=email)
        write_log(log, IGNORE_LOG(index=index, email=email))
        return

    response = requests.get(get_gravatar_url(email, use_404=True), timeout=10, headers=get_random_headers())
    if response.status_code == 200:
        save_avatar_file(AVATAR_PATH.format(email, suffix), response.content)
        print SUCCESS_LOG(index=index, email=email)
        write_log(log, SUCCESS_LOG(index=index, email=email))
    else:
        print FAIL_LOG(index=index, email=email)
        write_log(log, FAIL_LOG(index=index, email=email))
        return


scrapy_gravatar = partial(scrapy_context, hook=scrapy_gravatar_hook)
scrapy_qq = partial(scrapy_context, hook=scrapy_qq_hook)


FUNC_MAPPER = {
    'qq': scrapy_qq,
    'gravatar': scrapy_gravatar,
}

if __name__ == '__main__':
    scrapy_type = sys.argv[1]
    part = sys.argv[2]
    if scrapy_type not in FUNC_MAPPER:
        print 'type should in [qq | gravatar]'
        exit(0)
    FUNC_MAPPER[scrapy_type](part)

简单用法

pip install requests

  1. 将scrapy_avatar.py放到某文件夹下例如/opt/projects/scripts

  2. mkdir /opt/projects/scripts/avatar

  3. 将你的文件列表放到email_list_0.json里面

  4. python scrapy_avatar.py gravatar 0 或者 python scrapy_avatar.py qq 0

简单说明

当email_list比较大的时候, 为了使用更多的进程你可以将email_list拆分成多个list
例如 email_list_0.json email_list_1.json
你就可以使用 python scrapy_avatar.py gravatar 0 python scrapy_avatar.py gravatar 1起两个进程来抓

其他feature请阅读代码,更改里面的两个hook方法

吐槽

  1. 因为这是一个简单的脚本,因此懒得用click做脚本参数处理,只依赖于requests, 参数判断就懒得写了.

  2. 本来在scrapy_context那个for循环里使用的是contextmanager yield来做的,但是有个奇怪的RuntimeError generator didn't stop, 无奈将yield改为hook的方法.

  3. qq的头像有些奇怪的问题,例如不是没人都有100大小的图,但是没人都有40大小的图, 因此优先拿大图, 在qq那边就做了一次判断.

  4. 没有把context以及hook的其他方法配到脚本里面去,需要的人请自行修改.

把这个用chrome打开会很炫的

附: 简单的显示linux服务器图片的方式 Flask+nginx

django比较重,Flask+nginx就够了,因为没有任何其他需求

pip install flask

app.py丢到抓图的地方,改下nginx里面头像地址的root,丢进/etc/nginx/site-enable去 reload nginx, 别忘了host添上localtest

flask代码 app.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from flask import Flask, send_from_directory, safe_join
import os


app = Flask(__name__)
app.debug = True

@app.route("/")
def hello():
    avatars = os.listdir('avatar')
    avatars = sorted(avatars)
    html = '\n'.join("<img src='/avatar/{}' />".format(avatar) for avatar in avatars)
    return html

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=11111)

nginx

upstream localtest-backend {
    server 127.0.0.1:11111 fail_timeout=0;
}

server {

  listen 80;
  server_name localtest.com;

  location ~ /avatar/(?P<file>.*) {
    root /opt/projects/scripts/new;
    try_files /avatar/$file /avatar/$file =404;
    expires 30d;
    gzip on;
    gzip_types text/plain application/x-javascript text/css application/javascript;
    gzip_comp_level 3;
  }

  location / {
        proxy_pass http://localtest-backend;
  }
}

D咄咄
1.7k 声望257 粉丝

Life is to short, please use python.


引用和评论

0 条评论