Issue Playbook: S3FilesStore can use a lot of memory in Proxy Routing

Context and Problem Definition

This article targets one concrete production failure: Distorted proxy health scores causing wrong routing. It is not broad guidance; it is an executable Scrapy fix path with clear acceptance and rollback boundaries.

Typical symptom: High-priority jobs are routed to high-latency or high-ban nodes, degrading stable lanes and SLA. The root cause is usually policy-level inconsistency, not just proxy quality.

Issue signals used for this exact problem:

scrapy/scrapy#747: Support for socks5 proxy (comments: 54)
scrapy/scrapy#7060: Fix flaky test_download_with_proxy_https_timeout() (comments: 25)
scrapy-plugins/scrapy-splash#99: Proxy connection is being refused (comments: 15)

External evidence supplements (only when local evidence has gaps):

No external evidence supplement (no gap)

Insight Framework

Static-threshold routing cannot represent time-varying proxy quality.
Scoring model must combine latency, error rate, block signal, and freshness.
Health score must be tied to traffic priority to avoid premium-node misuse.

Method Path

Build EWMA health scores with minute-level decay.
Split proxy pool into stable lane and exploration lane.
Allow high-priority traffic only on stable lane.
Run 60-second probe jobs to refresh scores and trigger route correction.

Architecture and Data Flow

Ingress Queue -> Route Selector -> Stable Lane / Explore Lane
                    |                   |
                    v                   v
              Score Engine <----- Probe Scheduler
                    |
                    v
               Correction Planner

Operational constraints:

Score refresh interval must stay within 60 seconds.
High-priority jobs cannot enter nodes below threshold.
Exploration traffic ratio must be hard-limited to protect stable lane.

Configuration Matrix

Config	Recommended Value	Why	Bad Pattern
`HEALTH_EWMA_ALPHA`	0.25	balance fresh samples and history	only use last request
`LATENCY_WEIGHT`	0.35	include latency in core score	rank by success rate only
`ERROR_WEIGHT`	0.40	penalize failure aggressively	equal weights for all signals
`BAN_WEIGHT`	0.25	capture anti-bot blocks	ignore 403/429 signals
`STABLE_LANE_THRESHOLD`	78	protect high-priority quality	same threshold for all traffic
`EXPLORE_TRAFFIC_RATIO`	0.12	discover new nodes continuously	no cap on exploration

Key Code Snippets

# routing/health_score.py
def compute_health_score(latency_ms, error_rate, ban_rate, prev_score):
    instant = 100 - (latency_ms * 0.03) - (error_rate * 40) - (ban_rate * 50)
    instant = max(0, min(100, instant))
    alpha = 0.25
    return round(alpha * instant + (1 - alpha) * prev_score, 2)

# routing/selector.py
def select_proxy(candidates, priority: str):
    if priority == "high":
        lane = [p for p in candidates if p.score >= 78]
    else:
        lane = candidates
    lane.sort(key=lambda x: x.score, reverse=True)
    return lane[0] if lane else None

# routing/probe_scheduler.py
async def probe_cycle(pool):
    for proxy in pool:
        metrics = await run_probe(proxy)
        proxy.score = compute_health_score(
            latency_ms=metrics.latency_ms,
            error_rate=metrics.error_rate,
            ban_rate=metrics.ban_rate,
            prev_score=proxy.score,
        )

Failure Cases and Troubleshooting

Failure scenario: After route upgrade, static success-rate ranking promoted short-lived low-latency nodes incorrectly.

Troubleshooting sequence:

Validate score inputs include ban_rate and data freshness.
Ensure high-priority jobs are restricted to stable lane.
Check exploration ratio overflow that may pollute stable lane.
Compare wrong_route_ratio and SLA pass rate before/after correction.

Performance Metrics and Load Testing

Load tests should cover baseline, peak, and anti-bot escalation profiles.

Acceptance thresholds:

wrong_route_ratio <= 2%
high_priority_success_rate >= 95%
latency_p95 <= 1.9s
proxy_switch_jitter reduced by >= 30%

Vendor Comparison and 16Yun Positioning

Only issue-relevant capabilities are kept here:

API Proxy: Whitelist Management, RESTful API, Multi-Billing Models
Crawler Tunnel Proxy: Cross-IDC Architecture, Millisecond Detection, Auto IP Switching
Dedicated Proxy: Exclusive Dedicated IP, High Security Isolation, Low-Latency Response

This issue needs observable and correctable proxy orchestration; 16Yun tunnel plus dedicated proxy mix better supports layered routing.

Rollout Checklist

single policy entry implemented in middleware
key configs split by priority and environment
regression load test passed all thresholds
rollback can be executed within 10 minutes
alerts set for 429/403/latency thresholds
change audit log recorded for this rollout