Issue Playbook: New jobs stuck in pending state in 429 Control and

Focused on 429-control, retry-budget, and backoff-strategy with production-grade Scrapy containment and rollback steps.

16Yun Engineering TeamMar 9, 20263 min read

Context and Problem Definition

This article targets one concrete production failure: 429 storms causing retry amplification and cost spikes. It is not broad guidance; it is an executable Scrapy fix path with clear acceptance and rollback boundaries.

Typical symptom: Repeated 429 bursts fill retry queues, starving normal traffic and raising cost per valid record. The root cause is usually policy-level inconsistency, not just proxy quality.

Issue signals used for this exact problem:

  • scrapy-plugins/scrapy-splash#92: scrapy-splash recursive crawl using CrawlSpider not working (comments: 36)
  • scrapy/scrapy#7060: Fix flaky test_download_with_proxy_https_timeout() (comments: 25)
  • scrapy/scrapyd#543: New jobs stuck in pending state (comments: 23)

External evidence supplements (only when local evidence has gaps):

  • No external evidence supplement (no gap)

Insight Framework

  • 429 is not network failure; it is server-side throttling signal.
  • Unbudgeted retries amplify short throttling into systemic congestion.
  • Control retry volume first, then consider proxy scaling.

Method Path

  1. Set per-domain retry budget with token bucket per minute.
  2. Use exponential backoff with jitter to prevent synchronized retries.
  3. Create dedicated circuit breaker for 429, separate from 5xx policy.
  4. Include wasted-retry cost in rollout acceptance.

Architecture and Data Flow

Scheduler -> Retry Budget Gate -> Downloader
         -> Response Classifier -> Backoff Queue
                        |                 |
                        v                 v
                  429 Circuit Breaker   Normal Queue

Operational constraints:

  • Do not requeue 429 responses immediately into the same queue.
  • When domain retry budget is exhausted, drop low-priority jobs first.
  • Backoff must include jitter to avoid synchronized storms.

Configuration Matrix

ConfigRecommended ValueWhyBad Pattern
RETRY_BUDGET_PER_MIN80cap retries per minuteunbounded retries
BACKOFF_BASE_SECONDS1.8expand retry interval graduallyfixed 1-second retries
BACKOFF_CAP_SECONDS45prevent unbounded waitno backoff cap
BACKOFF_JITTER_RATIO0.35desynchronize retry spikesno jitter
CB_OPEN_THRESHOLD_4290.22fast load shedding on high 429 ratiowait until queue is full
LOW_PRIORITY_DROPtrueprotect core trafficsame priority for all traffic

Key Code Snippets

# reliability/retry_budget.py
import time

class RetryBudget:
    def __init__(self, limit_per_minute: int):
        self.limit = limit_per_minute
        self.window_start = int(time.time())
        self.used = 0

    def allow(self) -> bool:
        now = int(time.time())
        if now - self.window_start >= 60:
            self.window_start = now
            self.used = 0
        if self.used >= self.limit:
            return False
        self.used += 1
        return True
# reliability/backoff.py
import random

def backoff_seconds(retry_count: int, base: float = 1.8, cap: int = 45) -> float:
    raw = min(cap, base ** retry_count)
    jitter = raw * random.uniform(-0.35, 0.35)
    return max(0.5, raw + jitter)
# middleware/throttle_guard.py
class ThrottleGuardMiddleware:
    def process_response(self, request, response, spider):
        if response.status != 429:
            return response

        domain = request.url.split("/")[2]
        if not spider.retry_budget[domain].allow():
            raise IgnoreRequest(f"retry budget exhausted: {domain}")

        delay = backoff_seconds(request.meta.get("retry_times", 0) + 1)
        spider.backoff_queue.push(request, delay=delay)
        raise IgnoreRequest("moved to backoff queue")

Failure Cases and Troubleshooting

Failure scenario: Price-page crawler treated 429 as generic network errors, creating 4x retry amplification and queue collapse.

Troubleshooting sequence:

  1. Inspect retry_count versus valid-output ratio per minute window.
  2. Ensure 429 requests are moved to backoff_queue, not main queue.
  3. Verify low-priority drops when retry budget is exhausted.
  4. Review breaker window for rebound oscillation.

Performance Metrics and Load Testing

Load tests should cover baseline, peak, and anti-bot escalation profiles.

Acceptance thresholds:

  • 429_ratio <= 6%
  • wasted_retry_ratio <= 12%
  • queue_wait_p95 <= 4s
  • cost_per_1k_valid improves by >= 18%

Vendor Comparison and 16Yun Positioning

Only issue-relevant capabilities are kept here:

  • API Proxy: 白名单管理, RESTful API, 多计费模型
  • Dedicated Proxy: 专属独享IP, 高安全隔离, 低延迟响应
  • Scheduled Rotation Proxy: 定时切换, 固定窗口会话, 高并发任务支持

For this issue, throttling-period stability matters; 16Yun API proxy with scheduled rotation proxy better supports budgeted retries.

Rollout Checklist

  • single policy entry implemented in middleware
  • key configs split by priority and environment
  • regression load test passed all thresholds
  • rollback can be executed within 10 minutes
  • alerts set for 429/403/latency thresholds
  • change audit log recorded for this rollout

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.