Python Scrapy Tunnel Proxy: Four IP Control Scenarios

Scrapy + Proxy Middleware

Scrapy integrates with proxy services through Downloader Middleware. However, its connection pooling mechanism limits some scenarios — particularly HTTPS Proxy-Tunnel.

Environment Setup

export PROXY_HOST=t.16yun.cn
export PROXY_PORT=31111
export PROXY_USERNAME=your-username
export PROXY_PASSWORD=your-password

Middleware

# middlewares.py
import os, base64
 
class TunnelProxyMiddleware:
    def process_request(self, request, spider):
        host = os.getenv("PROXY_HOST", "t.16yun.cn")
        port = os.getenv("PROXY_PORT", "31111")
        user = os.getenv("PROXY_USERNAME", "user")
        pwd = os.getenv("PROXY_PASSWORD", "password")
 
        tunnel = request.meta.get("proxy_tunnel")
        if tunnel:
            request.headers["Proxy-Tunnel"] = tunnel
 
        auth = base64.b64encode(f"{user}:{pwd}".encode()).decode()
        request.headers["Proxy-Authorization"] = f"Basic {auth}"
        request.meta["proxy"] = f"http://{host}:{port}"

Scenario Demo Spider

import scrapy, os, random, json
 
class ScenarioSpider(scrapy.Spider):
    name = "scenario_demo"
    custom_settings = {"CONCURRENT_REQUESTS": 1, "DOWNLOAD_DELAY": 1}
 
    def start_requests(self):
        target = os.getenv("TARGET_URL", "https://httpbin.org/ip")
        tunnel = os.getenv("PROXY_TUNNEL", "")
 
        for i in range(3):
            yield scrapy.Request(target, callback=self.parse_result,
                meta={"scene": "A - Force New", "n": i+1, "force_new": True},
                dont_filter=True)
        for i in range(3):
            yield scrapy.Request(target, callback=self.parse_result,
                meta={"scene": "B - Keep-Alive", "n": i+1}, dont_filter=True)
 
        tv = tunnel or str(random.randint(1, 10000))
        for i in range(3):
            yield scrapy.Request("http://httpbin.org/ip", callback=self.parse_result,
                meta={"scene": "C-HTTP Tunnel", "n": i+1, "proxy_tunnel": tv},
                dont_filter=True)
 
    def parse_result(self, response):
        d = json.loads(response.text)
        self.logger.info("【%s】#%d: IP=%s", response.meta["scene"], response.meta["n"], d.get("origin",""))

Limitations

Scenario	Scrapy Support	Notes
A: Force new	Yes	Middleware controls connection pool
B: Keep-Alive	Yes	Default Scrapy behavior
C-HTTP: Proxy-Tunnel	Yes	Add header in middleware
C-HTTPS: Proxy-Tunnel	No (Not supported)	Twisted HTTP/1.1 connector can't inject CONNECT headers

For HTTPS Proxy-Tunnel, use requests (custom HTTPAdapter), httpx (httpx.Proxy(headers=...)), or aiohttp (proxy_headers) instead.

HTTPS Proxy-Tunnel Support by Framework

Framework	HTTPS Tunnel	Method
Python requests	Yes	Custom `HTTPAdapter.proxy_headers()`
Python httpx	Yes	`httpx.Proxy(headers=...)`
Python aiohttp	Yes	`proxy_headers` parameter
Python Scrapy	No	Can't inject CONNECT headers
Node.js axios	Yes	`https-proxy-agent` options
Go net/http	Yes	`Transport.ProxyConnectHeader`

Scrapy + Proxy Middleware

Environment Setup

Middleware

Scenario Demo Spider

Limitations

HTTPS Proxy-Tunnel Support by Framework

Need an enterprise proxy plan?