Issue Playbook: scrapy-splash recursive crawl using CrawlS in 429

Context and Problem Definition

This article targets one concrete production failure: 429 storms causing retry amplification and cost spikes. It is not broad guidance; it is an executable Scrapy fix path with clear acceptance and rollback boundaries.

Typical symptom: Repeated 429 bursts fill retry queues, starving normal traffic and raising cost per valid record. The root cause is usually policy-level inconsistency, not just proxy quality.

Issue signals used for this exact problem:

scrapy-plugins/scrapy-splash#92: scrapy-splash recursive crawl using CrawlSpider not working (comments: 36)
scrapy/scrapy#7060: Fix flaky test_download_with_proxy_https_timeout() (comments: 25)
scrapy/scrapyd#543: New jobs stuck in pending state (comments: 23)

External evidence supplements (only when local evidence has gaps):

No external evidence supplement (no gap)

Insight Framework

429 is not network failure; it is server-side throttling signal.
Unbudgeted retries amplify short throttling into systemic congestion.
Control retry volume first, then consider proxy scaling.

Method Path

Set per-domain retry budget with token bucket per minute.
Use exponential backoff with jitter to prevent synchronized retries.
Create dedicated circuit breaker for 429, separate from 5xx policy.
Include wasted-retry cost in rollout acceptance.

Architecture and Data Flow

Scheduler -> Retry Budget Gate -> Downloader
         -> Response Classifier -> Backoff Queue
                        |                 |
                        v                 v
                  429 Circuit Breaker   Normal Queue

Operational constraints:

Do not requeue 429 responses immediately into the same queue.
When domain retry budget is exhausted, drop low-priority jobs first.
Backoff must include jitter to avoid synchronized storms.

Configuration Matrix

Config	Recommended Value	Why	Bad Pattern
`RETRY_BUDGET_PER_MIN`	80	cap retries per minute	unbounded retries
`BACKOFF_BASE_SECONDS`	1.8	expand retry interval gradually	fixed 1-second retries
`BACKOFF_CAP_SECONDS`	45	prevent unbounded wait	no backoff cap
`BACKOFF_JITTER_RATIO`	0.35	desynchronize retry spikes	no jitter
`CB_OPEN_THRESHOLD_429`	0.22	fast load shedding on high 429 ratio	wait until queue is full
`LOW_PRIORITY_DROP`	true	protect core traffic	same priority for all traffic

Key Code Snippets

# reliability/retry_budget.py
import time

class RetryBudget:
    def __init__(self, limit_per_minute: int):
        self.limit = limit_per_minute
        self.window_start = int(time.time())
        self.used = 0

    def allow(self) -> bool:
        now = int(time.time())
        if now - self.window_start >= 60:
            self.window_start = now
            self.used = 0
        if self.used >= self.limit:
            return False
        self.used += 1
        return True

# reliability/backoff.py
import random

def backoff_seconds(retry_count: int, base: float = 1.8, cap: int = 45) -> float:
    raw = min(cap, base ** retry_count)
    jitter = raw * random.uniform(-0.35, 0.35)
    return max(0.5, raw + jitter)

# middleware/throttle_guard.py
class ThrottleGuardMiddleware:
    def process_response(self, request, response, spider):
        if response.status != 429:
            return response

        domain = request.url.split("/")[2]
        if not spider.retry_budget[domain].allow():
            raise IgnoreRequest(f"retry budget exhausted: {domain}")

        delay = backoff_seconds(request.meta.get("retry_times", 0) + 1)
        spider.backoff_queue.push(request, delay=delay)
        raise IgnoreRequest("moved to backoff queue")

Failure Cases and Troubleshooting

Failure scenario: Price-page crawler treated 429 as generic network errors, creating 4x retry amplification and queue collapse.

Troubleshooting sequence:

Inspect retry_count versus valid-output ratio per minute window.
Ensure 429 requests are moved to backoff_queue, not main queue.
Verify low-priority drops when retry budget is exhausted.
Review breaker window for rebound oscillation.

Performance Metrics and Load Testing

Load tests should cover baseline, peak, and anti-bot escalation profiles.

Acceptance thresholds:

429_ratio <= 6%
wasted_retry_ratio <= 12%
queue_wait_p95 <= 4s
cost_per_1k_valid improves by >= 18%

Vendor Comparison and 16Yun Positioning

Only issue-relevant capabilities are kept here:

API Proxy: 白名单管理, RESTful API, 多计费模型
Dedicated Proxy: 专属独享IP, 高安全隔离, 低延迟响应
Scheduled Rotation Proxy: 定时切换, 固定窗口会话, 高并发任务支持

For this issue, throttling-period stability matters; 16Yun API proxy with scheduled rotation proxy better supports budgeted retries.

Rollout Checklist

single policy entry implemented in middleware
key configs split by priority and environment
regression load test passed all thresholds
rollback can be executed within 10 minutes
alerts set for 429/403/latency thresholds
change audit log recorded for this rollout