【Python爬虫实例学习篇】——2、获取免费IP代理

由于在使用爬虫时经常会检查IP地址，因此有必要找到一个获取IP代理的地方。经过一番骚操作，终于构建了本人第一个代理库，代理库的返回值类型均为列表类型。（说明，这些免费代理每天实时更新，经过测试可用率超60%）另外，为保证代理库能长时间稳定运行，本文对requests库的get请求再一次进行了封装。

使用工具

1.Python 3.6
2.requests库
3.免费代理网站

1、API获取一个免费代理

该免费代理网站提供了两个，一个是提供一个免费代理，另一个是提供一页免费代理（一页最多15个）。

import requests

def GetFreeProxy():
    # 获取一个免费代理
    url='https://www.freeip.top/api/proxy_ip'
    ip=list(range(1))
    try:
        res=requests.get(url=url,timeout=20)
        # 将返回数据进行json解析
        result = res.json()
        ip[0]=result['data']['ip']+':'+result['data']['port']
        return ip
    except Exception:
        print('获取代理ip失败！正在重试···')
        # 异常重调
        GetFreeProxy()
        return 0

2、API获取一页免费代理

def GetFreeProxyListAPI(page=1, country='', isp='', order_by='validated_at', order_rule='DESC'):
    # 获取一个免费代理列表
    # 返回值为list
    # 参数名            数据类型 必传    说明        例子
    # page            int        N    第几页    1
    # country        string    N    所属国    中国,美国
    # isp            string    N    ISP        电信,阿里云
    # order_by        string    N    排序字段    speed:响应速度,validated_at:最新校验时间 created_at:存活时间
    # order_rule    string    N    排序方向    DESC:降序 ASC:升序
    data = {
        'page': str(page),
        'country': country,
        'isp': isp,
        'order_by': order_by,
        'order_rule': order_rule
    }
    url = 'https://www.freeip.top/api/proxy_ips' + '?' + str(parse.urlencode(data))
    ip = list(range(1))
    headers = {
        'User-Agent': str(choice(user_agent_list))
    }
    session = requests.session()
    res = GET(session, url=url, headers=headers)
    # 解析数据
    result = res.json()
    ip = list(range(int(result['data']['to'])))
    for i in range(int(result['data']['to'])):
        ip[i] = result['data']['data'][i]['ip'] + ':' + result['data']['data'][i]['port']
    return ip

3、网页获取一个免费代理

def GetFreeProxy():
    # method=2 is pure-HTTP
    # 返回值为list
    headers = {
        'User-Agent': str(random.choice(user_agent_list))
    }
    session = requests.session()
    # Choose the proxy web
    homeurl = 'https://www.freeip.top/'
    url = 'https://www.freeip.top/?page='
    GET(session=session, url=homeurl, headers=headers)
    # //tr[1]/td[1]
    res = GET(session=session, url=url, headers=headers)
    html = etree.HTML(res.text)
    # 选择IP数据和端口数据
    IP_list_1 = html.xpath('//tr[1]/td[1]')
    IP_list_2 = html.xpath('//tr[1]/td[2]')
    IP_list = list(range(1))
    IP_list[0] = IP_list_1[0].text + ':' + IP_list_2[0].text
    return IP_list

4、网页获取一页免费代理（推荐）

def GetFreeProxyList(GetType=1, protocol='https'):
    # 代理可用率超50%  推荐使用
    # method=2 is pure-HTTP
    # 返回值为list
    headers = {
        'User-Agent': str(choice(user_agent_list)),
    }
    session = requests.session()
    # Choose the proxy web
    if GetType == 1:
        homeurl = 'https://www.freeip.top/'
        url = 'https://www.freeip.top/?page=1&protocol=' + protocol
        GET(session=session, url=homeurl, headers=headers)
    elif GetType == 2:
        homeurl = 'https://www.kuaidaili.com/'
        url = 'https://www.kuaidaili.com/free/inha/'
        GET(session=session, url=homeurl, headers=headers)
    else:
        print('其他方法暂未支持！')
        return 0
    # Get the IP list
    num = 1
    if _exists('IP.txt'):
        remove('IP.txt')
    IP_list = []
    while True:
        res = GET(session=session, url=url, headers=headers)
        html = etree.HTML(res.content.decode('utf-8'))
        # 选择IP数据和端口数据
        IP_list_1 = html.xpath('//tr/td[1]')
        IP_list_2 = html.xpath('//tr/td[2]')
        if GetType == 1:
            url = 'https://www.freeip.top/?page=' + str(num) + '&protocol=' + protocol
        IP_list.extend(
            list(map(lambda ip_list_1, ip_list_2: (ip_list_1.text + ':' + ip_list_2.text), IP_list_1, IP_list_2)))
        num = num + 1
        if len(IP_list_1):
            continue
        else:
            break
    return IP_list

5、再次封装的GET请求

# 由于免费代理网站不稳定，获取大量代理时容易出现503错误，因此需要多次重传
def GET(session, url, headers, timeout=15, num=0):
    try:
        response = session.get(url=url, headers=headers, timeout=timeout,verify=False)
        if response.status_code == 200:
            return response
        else:
            print('对方服务器错误，正在进行第%i' % (num + 1) + '次重试···')
            sleep(0.8)
            response = GET(session=session, url=url, headers=headers, num=num + 1)
            return response
    except Exception:
        print('连接错误，正在进行第%i' % (num + 1) + '次重试···')
        sleep(0.8)
        response = GET(session=session, url=url, headers=headers, num=num + 1)
        return response

6、结果展示

获取一个免费代理：

获取一页免费代理：

==微信公众号：==

小术快跑

【Python爬虫实例学习篇】——2、获取免费IP代理

【Python爬虫实例学习篇】——2、获取免费IP代理

使用工具

目录

<span id="jump1"> 1、API获取一个免费代理</span>

<span id="jump2"> 2、API获取一页免费代理</span>

<span id="jump3"> 3、网页获取一个免费代理</span>

<span id="jump4"> 4、网页获取一页免费代理（推荐）</span>

<span id="jump5"> 5、再次封装的GET请求</span>

<span id="jump6"> 6、结果展示</span>

==微信公众号：==

小术快跑呀

引用和评论

【自制实用小工具】——1、Xpath解析器

怎么来爬取代理服务器ip地址？（python）

Anaconda安装教程以及Anaconda和pip配置国内镜像

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

pip安装报错：No such file or directory 'conda-forge' 没有那个文件或目录

科学计算编程涉及到的技术栈简介

Python 描述符