头图

What is a tunnel proxy? Let's take a look at the screenshot below:

The so-called tunnel proxy is a proxy service that can automatically change the proxy IP for you. In your code, you only need to write down an entry proxy address, and then initiate a request normally, and the request received by the target server is a different proxy address each time.

On an agent website, the price of 50 concurrent tunnel agents per second is 4,000 yuan/month:

Conventionally, first request the interface to get a batch of proxy IPs, and then select the original proxy server that initiated the request. The price is only more than 600 yuan a month:

So, if we can be a tunnel agent by ourselves, we will save a lot of money!

The principle of tunnel proxy, the difference from conventional proxy, can be explained clearly with the following two pictures:

传统代理服务

隧道代理

To develop such a tunnel proxy ourselves, we need to do two steps:

  1. Build an agent pool
  2. Realize proxy automatic forwarding

Build a proxy pool

Assuming that the cheap proxy address you bought from the proxy supplier is: http://xxx.com/ips , request directly on the browser, the page effect is as shown in the figure below:

Now, all you need to do is to write a program, periodically visit this url, pull the latest available IP address, and then put it in Redis.

Here, we use the Hash data structure of Redis, where the field name of the Hash is IP: port, and the value inside is some information related to each IP.

Your program needs to make sure that all the proxy addresses currently in Redis are available. Here, I give a sample program:

"""
ProxyManager.py
~~~~~~~~~~~~~~~~~~~~~
简易代理池管理工具,直接从URL中读取所有
最新的代理,并写入Redis。
"""
import yaml
import time
import json
import redis
import datetime
import requests


class ProxyManager:
    def __init__(self):
        self.config = self.read_config()
        self.redis_config = self.config['redis']
        self.client = redis.Redis(host=self.redis_config['host'],
                                  password=self.redis_config['password'],
                                  port=self.redis_config['port'])
        self.instance_dict = {}

    def read_config(self):
        with open('config.yaml') as f:
            config = yaml.safe_load(f.read())
            return config

    def read_ip(self):
        resp = requests.get(self.config['proxy']).text
        if '{' in resp:
            return []
        proxy_list = resp.split()
        return proxy_list

    def delete_ip(self, live_ips, pool_ips):
        ip_to_removed = set(pool_ips) - set(live_ips)
        if ip_to_removed:
            print('ip to be removed:', ip_to_removed)
            self.client.hdel(self.redis_config['key'], *list(ip_to_removed))

    def add_new_ips(self, live_ips, pool_ips):
        ip_to_add = set(live_ips) - set(pool_ips)
        if ip_to_add:
            print('ip to add:', ip_to_add)
            ips = {}
            for ip in ip_to_add:
                ips[ip] = json.dumps({'private_ip': ip,
                                      'ts': datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')})
            self.client.hset(self.redis_config['key'], mapping=ips)

    def run(self):
        while True:
            live_ips = self.read_ip()
            pool_ips = [x.decode() for x in self.client.hgetall(self.redis_config['key'])]
            self.delete_ip(live_ips, pool_ips)
            self.add_new_ips(live_ips, pool_ips)
            time.sleep(40)


if __name__ == '__main__':
    manager = ProxyManager()
    manager.run()

Among them, I wrote the Redis-related configuration and the URL of the proxy provider into a yaml configuration file to prevent you from seeing it. The format of the configuration file is shown in the figure below:

Since the IP provided by my proxy provider has a validity period of 1-5 minutes, I will change the IP every 40 seconds to be safe. When replacing, an incremental replacement method is adopted. Compare the currently pulled IP with the existing IP in Redis. All IPs not pulled this time are removed from Redis, and then the newly added IPs are added to Redis.

In the actual process, you can also add some proxy verification logic to ensure that the proxy pulled from the URL is also checked for validity, and the invalid ones are removed immediately.

Realize automatic forwarding

To realize automatic forwarding, we can use OpenResty . This is a high-performance web platform based on Nginx and Lua. Through it, we can use Lua language to implement some logic, such as reading data from Redis, forwarding the source request to the upstream proxy server...

Therefore, we use OpenResty to build a forwarding service. And use the IP address of the server where this forwarding service is located as our entry IP address. When using Requests and other network request clients to send requests, you only need to set this entry IP address as a proxy. Then, when the client sends a request, the request first arrives at OpenResty. Then it randomly selects a proxy IP from Redis as the upstream proxy, and forwards the request just sent to the upstream proxy. So as to achieve the effect of tunnel proxy.

Lua is a very old language, and its syntax is quite different from Python. But don't worry, I have already written this configuration file. You can use it if you take it and change it.

The corresponding configuration file is shown in the figure below:

worker_processes  16;        #nginx worker 数量
error_log /usr/local/openresty/nginx/logs/error.log;   #指定错误日志文件路径
events {
    worker_connections 1024;
}


stream {
    ## TCP 代理日志格式定义
    log_format tcp_proxy '$remote_addr [$time_local] '
                         '$protocol $status $bytes_sent $bytes_received '
                         '$session_time "$upstream_addr" '
                         '"$upstream_bytes_sent" "$upstream_bytes_received" "$upstream_connect_time"';
    ## TCP 代理日志配置
    access_log /usr/local/openresty/nginx/logs/access.log tcp_proxy;
    open_log_file_cache off;

    ## TCP 代理配置
    upstream backend{
        server 127.0.0.2:1101;# 爱写啥写啥  反正下面的代码也给你改了
        balancer_by_lua_block {
            -- 初始化balancer
            local balancer = require "ngx.balancer"
            local host = ""
            local port = 0
            host = ngx.ctx.proxy_host
            port = ngx.ctx.proxy_port
            -- 设置 balancer
            local ok, err = balancer.set_current_peer(host, port)
            if not ok then
                ngx.log(ngx.ERR, "failed to set the peer: ", err)
            end
        }
    }


    server {
        preread_by_lua_block{

            local redis = require("resty.redis")
            --创建实例
            local redis_instance = redis:new()
            --设置超时(毫秒)
            redis_instance:set_timeout(3000)
            --建立连接,请在这里配置Redis 的 IP 地址、端口号、密码和用到的 Key
            local rhost = "123.45.67.89"
            local rport = 6739
            local rpass = "abcdefg"
            local rkey = "proxy:pool"
            local ok, err = redis_instance:connect(rhost, rport)
            ngx.log(ngx.ERR, "1111111 ", ok, " ", err)

            -- 如果没有密码,移除下面这一行
            local res, err = redis_instance:auth(rpass)
            local res, err = redis_instance:hkeys(rkey)
            if not res then
                ngx.log(ngx.ERR,"res num error : ", err)
                return redis_instance:close()
            end
            math.randomseed(tostring(ngx.now()):reverse():sub(1, 6))
            local proxy = res[math.random(#res)]
            local colon_index = string.find(proxy, ":")
            local proxy_ip = string.sub(proxy, 1, colon_index - 1)
            local proxy_port = string.sub(proxy, colon_index + 1)
            ngx.log(ngx.ERR,"redis data = ", proxy_ip, ":", proxy_port);
            ngx.ctx.proxy_host = proxy_ip
            ngx.ctx.proxy_port = proxy_port
            redis_instance:close()
        }
        #  下面是本机的端口,也就是爬虫固定写死的端口
       listen 0.0.0.0:9976; #监听本机地址和端口,当使用keeplived的情况下使用keeplived VIP
       proxy_connect_timeout 3s;
       proxy_timeout 10s;
       proxy_pass backend; #这里填写对端的地址
    }
}

The places that need to be modified are the comments I have made in the configuration file. Specifically, the places that need to be modified include:

  • Redis address, port, password and key. If your Redis does not have a password, you can delete the line setting the password

  • Port of the ingress agent

After setting these configurations, we can use Docker to start it. The Docker configuration file is extremely simple:

from openresty/openresty:centos

copy nginx_redis.conf /usr/local/openresty/nginx/conf/nginx.conf

Then, execute the command to build and run:

docker build --network host -t tunnel_proxy:0.01 .
docker run --name tunnel_proxy --network host -it tunnel_proxy:0.01

After running, you will see that the Docker command line seems to be stuck. This is a normal request. Because you need to have a request, it will output the content.

Now, you can quickly write a piece of code to verify with Requests:

import requests
import time

proxies = {'http': 'http://13.88.220.207:9976'}
for _ in range(10):
    resp = requests.get('http://httpbin.org/ip', proxies=proxies).text
    print(resp)
    time.sleep(1)

The running effect is shown in the figure below.

It shows that the tunnel proxy is successfully set up. I have been running the tunnel proxy stably for half a year now, and there has never been a problem, so you can use it with confidence.

Finally, this article was inspired by @萌木盖's article: openresty forward proxy building-short book , and improved on the basis of this article. Special thanks to the original author.


If the method mentioned in this article can help you save a lot of money in buying a tunnel agent, then please consider taking out a small part of it, joining my knowledge planet, and becoming a Code·Pro member.

Join the planet, in addition to all the benefits you have in the WeChat group, you will also get:

  1. Priority answering of questions and one-on-one Q&A
  2. Career planning consulting
  3. Interview skills and experience
  4. Exclusive live sharing on a regular basis
  5. Regular lucky draw
  6. ……

青南
537 声望956 粉丝

微软最有价值专家(MVP)。已出版图书《Python 爬虫开发,从入门到实战》、《左手 MongoDB,右手 Redis——从入门到商业实战》。独立开发维护开源项目 GNE(获得近2000 Star)。