Chrome Headless at Scale (Part 2): CDP Connection Pool Exhaustion and Tab Accumulation

120-second timeouts still fail. WebSocket backlog grows. TCP connections get exhausted. Before CPU gives out, the network connection infrastructure goes first.

16Yun Engineering TeamApr 3, 20263 min read

A Typical Failure Pattern

You open the first page. It loads. Then you open a second tab in the same browser instance. It times out. You try a third tab. Also times out. You increase the timeout from 30 seconds to 120 seconds. Try again. Still times out.

This isn't slow page loads. This is a saturated connection pool.

A Steel GitHub issue documented exactly this pattern. The user's configuration:

  • Steel SDK version 0.15.0
  • Playwright 1.57.0
  • Timeout: 1,600,000ms (over 26 minutes)
  • Proxy: disabled

Result: first tab loaded successfully. Second and third tabs almost always timed out.

Three Independent Root Causes

1. CDP WebSocket Multiplexing Limits

Each CDP connection uses a single WebSocket. When you create a new tab via browser.newPage(), CDP internally negotiates a new session ID. Playwright and Puppeteer serialize and deserialize frames on the same WebSocket, creating Head-of-Line blocking.

If tab A is loading a large resource (a 10MB image), CDP's Page.frameStoppedLoading event can't be sent until the resource finishes transferring. Meanwhile, tabs B and C queue up waiting on the same WebSocket.

This is a protocol-level constraint. No amount of configuration tuning fixes it.

2. Chrome's Internal Request Scheduling

Even without CDP blocking, Chrome limits concurrent requests per browser instance. Each tab shares the Browser process's network stack. The default limit is 6 concurrent connections per host (HTTP/1.1) or 256 (HTTP/2). But when the target page itself contains hundreds of sub-resources, those connections fill up fast.

3. Tab Non-Release Causes Cumulative Leakage

This is the most insidious problem. Every newPage() call creates a new tab, but close() doesn't always release everything:

Operation sequence:
  page = await browser.newPage()    # Create tab A
  await page.goto(url1)             # Navigate
  data = await page.content()       # Extract data
  page = await browser.newPage()    # Create tab B (tab A not explicitly closed)
  await page.goto(url2)
  data = await page.content()
 
Tab A's Renderer process may still be running.
Repeat 100 times → 100 unclosed tabs consuming 300-500 processes.

Many script authors don't call page.close() explicitly, assuming the next newPage() overwrites the reference. JavaScript GC can reclaim references. It cannot reclaim OS processes.

The Real Lesson from Steel Issue #247

Key data points:

  • Even with proxy disabled, concurrent tabs timed out
  • Timeout tuning is irrelevant — the user tried from 30s to 1,600,000ms
  • The issue is reproducible — not random, strongly correlated with concurrency

This isn't about target server response speed. It's about how the browser internally handles concurrent tabs.

Solutions

Approach 1: One Browser Instance Per Task

Don't open multiple tabs in one browser. Use independent instances:

# Wrong: one browser, many tabs
browser = await launch()
for url in urls:
    page = await browser.newPage()
    await page.goto(url)
 
# Right: one browser per task
for url in urls:
    browser = await launch()
    page = await browser.newPage()
    await page.goto(url)
    await browser.close()

This adds startup latency (~2s per task) but eliminates connection pool contention and tab leaks.

Approach 2: Limit Tabs Per Browser

If you must share a browser instance (e.g., to keep login state):

MAX_TABS_PER_BROWSER = 3
 
async def process_with_tab_limit(urls):
    browser = await launch()
    semaphore = asyncio.Semaphore(MAX_TABS_PER_BROWSER)
 
    async def process_one(url):
        async with semaphore:
            page = await browser.newPage()
            try:
                await page.goto(url)
                return await page.content()
            finally:
                await page.close()
 
    results = await asyncio.gather(*[process_one(u) for u in urls])
    await browser.close()
    return results

Approach 3: Explicit WebSocket Management

For high concurrency, consider lower-level CDP connection management instead of Playwright/Puppeteer wrappers. agent-browser uses independent IPC pipes per command, avoiding shared WebSocket HoL blocking. Lightpanda's lighter architecture also helps — each instance consumes fewer resources, allowing more instances per server.

Approach 4: Monitor Connection Pool State

# Monitor WebSocket connections
ss -tlnp | grep -c 9222  # default CDP port
 
# Monitor ESTABLISHED connections
ss -tan | grep ESTAB | wc -l
 
# Monitor TIME_WAIT (leak signal)
ss -tan | grep TIME_WAIT | wc -l

Rising TIME_WAIT without decline indicates connection pool leakage.

Summary

Connection pool exhaustion and tab accumulation are the second most common scaling failure in browser automation. The root causes:

  1. CDP WebSocket design constraint — multiplexing HoL blocking on shared connections
  2. Chrome internal scheduling — per-browser-instance concurrency limits
  3. Tab leaks from missing explicit close() — GC ≠ process cleanup

The shared solution: one browser instance, one task, then close. It sounds wasteful. At scale, it's the least error-prone pattern.

The next article analyzes the third scaling bottleneck: why Kubernetes native HPA doesn't work well for browser instances.

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.