Python Scrapy 隧道代理:爬虫框架四种 IP 控制场景
Scrapy Spider 集成亿牛云爬虫代理,实现强制切换、Keep-Alive 保持、Proxy-Tunnel 固定 IP,以及 Scrapy 不支持 HTTPS Proxy-Tunnel 的边界说明。
亿牛云技术团队2026年5月21日2 分钟阅读
Scrapy 集成爬虫代理
Scrapy 是高效率的爬虫框架,通过 Downloader Middleware 配置代理。但由于 Scrapy 的连接池机制,部分场景(尤其是 HTTPS Proxy-Tunnel)存在限制。
环境配置
export PROXY_HOST=t.16yun.cn
export PROXY_PORT=31111
export PROXY_USERNAME=your-username
export PROXY_PASSWORD=your-password
export TARGET_URL=https://httpbin.org/ip
基础 Spider
import scrapy
class HttpbinSpider(scrapy.Spider):
name = "httpbin"
def start_requests(self):
yield scrapy.Request("https://httpbin.org/ip", callback=self.parse)
def parse(self, response):
self.logger.info("status=%s body=%s", response.status, response.text[:200])
中间件配置
在 settings.py 中启用代理中间件:
# settings.py
DOWNLOADER_MIDDLEWARES = {
"tutorial.middlewares.TunnelProxyMiddleware": 543,
}
# 代理配置
PROXY_HOST = "t.16yun.cn"
PROXY_PORT = "31111"
PROXY_USERNAME = "your-username"
PROXY_PASSWORD = "your-password"
中间件实现
# middlewares.py
import os
import base64
class TunnelProxyMiddleware:
def process_request(self, request, spider):
host = os.getenv("PROXY_HOST", "t.16yun.cn")
port = os.getenv("PROXY_PORT", "31111")
user = os.getenv("PROXY_USERNAME", "user")
pwd = os.getenv("PROXY_PASSWORD", "password")
# 场景 A:强制切换 IP(每次新建连接)
if request.meta.get("force_new_connection"):
request.meta["dont_merge_cookies"] = True
# 场景 C:Proxy-Tunnel
tunnel = request.meta.get("proxy_tunnel")
if tunnel:
request.headers["Proxy-Tunnel"] = tunnel
# 认证
auth = base64.b64encode(f"{user}:{pwd}".encode()).decode()
request.headers["Proxy-Authorization"] = f"Basic {auth}"
# 设置代理
request.meta["proxy"] = f"http://{host}:{port}"
四种场景演示 Spider
import scrapy
import os
import random
class ScenarioDemoSpider(scrapy.Spider):
name = "scenario_demo"
custom_settings = {
"CONCURRENT_REQUESTS": 1,
"DOWNLOAD_DELAY": 1,
}
def start_requests(self):
target = os.getenv("TARGET_URL", "https://httpbin.org/ip")
tunnel = os.getenv("PROXY_TUNNEL", "")
# 场景 A:强制切换 IP(3 次请求)
for i in range(3):
yield scrapy.Request(
url=target,
callback=self.parse_result,
meta={"scenario": "A - 强制切换", "num": i + 1, "force_new_connection": True},
dont_filter=True,
)
# 场景 B:Keep-Alive 保持 IP(3 次请求)
for i in range(3):
yield scrapy.Request(
url=target,
callback=self.parse_result,
meta={"scenario": "B - Keep-Alive", "num": i + 1},
dont_filter=True,
)
# 场景 C-HTTP:Proxy-Tunnel 固定 IP
tunnel_val = tunnel or str(random.randint(1, 10000))
for i in range(3):
yield scrapy.Request(
url="http://httpbin.org/ip",
callback=self.parse_result,
meta={"scenario": "C-HTTP Proxy-Tunnel", "num": i + 1, "proxy_tunnel": tunnel_val},
dont_filter=True,
)
def parse_result(self, response):
import json
data = json.loads(response.text)
self.logger.info(
"【%s】请求 %d: IP = %s",
response.meta["scenario"],
response.meta["num"],
data.get("origin", "N/A"),
)
四种场景对比与限制
| 场景 | Scrapy 支持 | 说明 |
|---|---|---|
| A:强制切换 | ✅ | dont_merge_cookies + 中间件控制连接池 |
| B:Keep-Alive | ✅ | Scrapy 默认行为,同一站点请求复用连接 |
| C-HTTP:Proxy-Tunnel | ✅ | 中间件添加 Proxy-Tunnel 请求头 |
| C-HTTPS:Proxy-Tunnel | ❌ 不支持 | Scrapy 的 HTTP/1.1 连接器无法在 CONNECT 阶段注入自定义头 |
HTTPS Proxy-Tunnel 的限制:Scrapy 默认使用
HTTP11DownloadHandler,其底层 Twisted 在 CONNECT 阶段不开放自定义头注入。如果需要在 HTTPS 下固定 IP,建议换用requests/httpx(配合自定义 Adapter)或aiohttp(配合proxy_headers)。
各语言的 HTTPS Proxy-Tunnel 支持对比
| 框架 | HTTPS Proxy-Tunnel | 实现方式 |
|---|---|---|
| Python requests | ✅ | 自定义 HTTPAdapter.proxy_headers() |
| Python httpx | ✅ | httpx.Proxy(headers=...) |
| Python aiohttp | ✅ | proxy_headers 参数 |
| Python Scrapy | ❌ | 无法在 CONNECT 阶段注入头 |
| Node.js axios | ✅ | https-proxy-agent options |
| Go net/http | ✅ | Transport.ProxyConnectHeader |
| Java OkHttp | ✅ | Authenticator + 拦截器 |
需要企业代理方案?
我们可根据目标站点、并发规模与稳定性目标提供定制方案。