Issue Playbook: scrapy-splash recursive crawl using CrawlS in 429
Focused on 429-control, retry-budget, and backoff-strategy with production-grade Scrapy containment and rollback steps.
Context and Problem Definition
This article targets one concrete production failure: 429 storms causing retry amplification and cost spikes. It is not broad guidance; it is an executable Scrapy fix path with clear acceptance and rollback boundaries.
Typical symptom: Repeated 429 bursts fill retry queues, starving normal traffic and raising cost per valid record. The root cause is usually policy-level inconsistency, not just proxy quality.
Issue signals used for this exact problem:
scrapy-plugins/scrapy-splash#92: scrapy-splash recursive crawl using CrawlSpider not working (comments: 36)scrapy/scrapy#7060: Fix flakytest_download_with_proxy_https_timeout()(comments: 25)scrapy/scrapyd#543: New jobs stuck in pending state (comments: 23)
External evidence supplements (only when local evidence has gaps):
- No external evidence supplement (no gap)
Insight Framework
- 429 is not network failure; it is server-side throttling signal.
- Unbudgeted retries amplify short throttling into systemic congestion.
- Control retry volume first, then consider proxy scaling.
Method Path
- Set per-domain retry budget with token bucket per minute.
- Use exponential backoff with jitter to prevent synchronized retries.
- Create dedicated circuit breaker for 429, separate from 5xx policy.
- Include wasted-retry cost in rollout acceptance.
Architecture and Data Flow
Scheduler -> Retry Budget Gate -> Downloader
-> Response Classifier -> Backoff Queue
| |
v v
429 Circuit Breaker Normal Queue
Operational constraints:
- Do not requeue 429 responses immediately into the same queue.
- When domain retry budget is exhausted, drop low-priority jobs first.
- Backoff must include jitter to avoid synchronized storms.
Configuration Matrix
| Config | Recommended Value | Why | Bad Pattern |
|---|---|---|---|
RETRY_BUDGET_PER_MIN | 80 | cap retries per minute | unbounded retries |
BACKOFF_BASE_SECONDS | 1.8 | expand retry interval gradually | fixed 1-second retries |
BACKOFF_CAP_SECONDS | 45 | prevent unbounded wait | no backoff cap |
BACKOFF_JITTER_RATIO | 0.35 | desynchronize retry spikes | no jitter |
CB_OPEN_THRESHOLD_429 | 0.22 | fast load shedding on high 429 ratio | wait until queue is full |
LOW_PRIORITY_DROP | true | protect core traffic | same priority for all traffic |
Key Code Snippets
# reliability/retry_budget.py
import time
class RetryBudget:
def __init__(self, limit_per_minute: int):
self.limit = limit_per_minute
self.window_start = int(time.time())
self.used = 0
def allow(self) -> bool:
now = int(time.time())
if now - self.window_start >= 60:
self.window_start = now
self.used = 0
if self.used >= self.limit:
return False
self.used += 1
return True
# reliability/backoff.py
import random
def backoff_seconds(retry_count: int, base: float = 1.8, cap: int = 45) -> float:
raw = min(cap, base ** retry_count)
jitter = raw * random.uniform(-0.35, 0.35)
return max(0.5, raw + jitter)
# middleware/throttle_guard.py
class ThrottleGuardMiddleware:
def process_response(self, request, response, spider):
if response.status != 429:
return response
domain = request.url.split("/")[2]
if not spider.retry_budget[domain].allow():
raise IgnoreRequest(f"retry budget exhausted: {domain}")
delay = backoff_seconds(request.meta.get("retry_times", 0) + 1)
spider.backoff_queue.push(request, delay=delay)
raise IgnoreRequest("moved to backoff queue")
Failure Cases and Troubleshooting
Failure scenario: Price-page crawler treated 429 as generic network errors, creating 4x retry amplification and queue collapse.
Troubleshooting sequence:
- Inspect retry_count versus valid-output ratio per minute window.
- Ensure 429 requests are moved to backoff_queue, not main queue.
- Verify low-priority drops when retry budget is exhausted.
- Review breaker window for rebound oscillation.
Performance Metrics and Load Testing
Load tests should cover baseline, peak, and anti-bot escalation profiles.
Acceptance thresholds:
- 429_ratio <= 6%
- wasted_retry_ratio <= 12%
- queue_wait_p95 <= 4s
- cost_per_1k_valid improves by >= 18%
Vendor Comparison and 16Yun Positioning
Only issue-relevant capabilities are kept here:
- API Proxy: 白名单管理, RESTful API, 多计费模型
- Dedicated Proxy: 专属独享IP, 高安全隔离, 低延迟响应
- Scheduled Rotation Proxy: 定时切换, 固定窗口会话, 高并发任务支持
For this issue, throttling-period stability matters; 16Yun API proxy with scheduled rotation proxy better supports budgeted retries.
Rollout Checklist
- single policy entry implemented in middleware
- key configs split by priority and environment
- regression load test passed all thresholds
- rollback can be executed within 10 minutes
- alerts set for 429/403/latency thresholds
- change audit log recorded for this rollout
Need an enterprise proxy plan?
We can tailor architecture to your target domains, concurrency, and reliability goals.