Python Scrapy 隧道代理:爬虫框架四种 IP 控制场景

Scrapy Spider 集成亿牛云爬虫代理,实现强制切换、Keep-Alive 保持、Proxy-Tunnel 固定 IP,以及 Scrapy 不支持 HTTPS Proxy-Tunnel 的边界说明。

亿牛云技术团队2026年5月21日2 分钟阅读

Scrapy 集成爬虫代理

Scrapy 是高效率的爬虫框架,通过 Downloader Middleware 配置代理。但由于 Scrapy 的连接池机制,部分场景(尤其是 HTTPS Proxy-Tunnel)存在限制。

环境配置

export PROXY_HOST=t.16yun.cn
export PROXY_PORT=31111
export PROXY_USERNAME=your-username
export PROXY_PASSWORD=your-password
export TARGET_URL=https://httpbin.org/ip

基础 Spider

import scrapy

class HttpbinSpider(scrapy.Spider):
    name = "httpbin"

    def start_requests(self):
        yield scrapy.Request("https://httpbin.org/ip", callback=self.parse)

    def parse(self, response):
        self.logger.info("status=%s body=%s", response.status, response.text[:200])

中间件配置

settings.py 中启用代理中间件:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    "tutorial.middlewares.TunnelProxyMiddleware": 543,
}

# 代理配置
PROXY_HOST = "t.16yun.cn"
PROXY_PORT = "31111"
PROXY_USERNAME = "your-username"
PROXY_PASSWORD = "your-password"

中间件实现

# middlewares.py
import os
import base64

class TunnelProxyMiddleware:
    def process_request(self, request, spider):
        host = os.getenv("PROXY_HOST", "t.16yun.cn")
        port = os.getenv("PROXY_PORT", "31111")
        user = os.getenv("PROXY_USERNAME", "user")
        pwd = os.getenv("PROXY_PASSWORD", "password")

        # 场景 A:强制切换 IP(每次新建连接)
        if request.meta.get("force_new_connection"):
            request.meta["dont_merge_cookies"] = True

        # 场景 C:Proxy-Tunnel
        tunnel = request.meta.get("proxy_tunnel")
        if tunnel:
            request.headers["Proxy-Tunnel"] = tunnel

        # 认证
        auth = base64.b64encode(f"{user}:{pwd}".encode()).decode()
        request.headers["Proxy-Authorization"] = f"Basic {auth}"

        # 设置代理
        request.meta["proxy"] = f"http://{host}:{port}"

四种场景演示 Spider

import scrapy
import os
import random

class ScenarioDemoSpider(scrapy.Spider):
    name = "scenario_demo"
    custom_settings = {
        "CONCURRENT_REQUESTS": 1,
        "DOWNLOAD_DELAY": 1,
    }

    def start_requests(self):
        target = os.getenv("TARGET_URL", "https://httpbin.org/ip")
        tunnel = os.getenv("PROXY_TUNNEL", "")

        # 场景 A:强制切换 IP(3 次请求)
        for i in range(3):
            yield scrapy.Request(
                url=target,
                callback=self.parse_result,
                meta={"scenario": "A - 强制切换", "num": i + 1, "force_new_connection": True},
                dont_filter=True,
            )

        # 场景 B:Keep-Alive 保持 IP(3 次请求)
        for i in range(3):
            yield scrapy.Request(
                url=target,
                callback=self.parse_result,
                meta={"scenario": "B - Keep-Alive", "num": i + 1},
                dont_filter=True,
            )

        # 场景 C-HTTP:Proxy-Tunnel 固定 IP
        tunnel_val = tunnel or str(random.randint(1, 10000))
        for i in range(3):
            yield scrapy.Request(
                url="http://httpbin.org/ip",
                callback=self.parse_result,
                meta={"scenario": "C-HTTP Proxy-Tunnel", "num": i + 1, "proxy_tunnel": tunnel_val},
                dont_filter=True,
            )

    def parse_result(self, response):
        import json
        data = json.loads(response.text)
        self.logger.info(
            "【%s】请求 %d: IP = %s",
            response.meta["scenario"],
            response.meta["num"],
            data.get("origin", "N/A"),
        )

四种场景对比与限制

场景Scrapy 支持说明
A:强制切换dont_merge_cookies + 中间件控制连接池
B:Keep-AliveScrapy 默认行为,同一站点请求复用连接
C-HTTP:Proxy-Tunnel中间件添加 Proxy-Tunnel 请求头
C-HTTPS:Proxy-Tunnel❌ 不支持Scrapy 的 HTTP/1.1 连接器无法在 CONNECT 阶段注入自定义头

HTTPS Proxy-Tunnel 的限制:Scrapy 默认使用 HTTP11DownloadHandler,其底层 Twisted 在 CONNECT 阶段不开放自定义头注入。如果需要在 HTTPS 下固定 IP,建议换用 requests / httpx(配合自定义 Adapter)或 aiohttp(配合 proxy_headers)。

各语言的 HTTPS Proxy-Tunnel 支持对比

框架HTTPS Proxy-Tunnel实现方式
Python requests自定义 HTTPAdapter.proxy_headers()
Python httpxhttpx.Proxy(headers=...)
Python aiohttpproxy_headers 参数
Python Scrapy无法在 CONNECT 阶段注入头
Node.js axioshttps-proxy-agent options
Go net/httpTransport.ProxyConnectHeader
Java OkHttpAuthenticator + 拦截器

需要企业代理方案?

我们可根据目标站点、并发规模与稳定性目标提供定制方案。