Chrome Headless at Scale (Part 1): Orphan Processes and a 24GB Memory Leak

SIGTERM won't kill them. Zombie processes consume CPU. Memory leaks reach 24GB. Every browser automation team hits this wall eventually.

16Yun Engineering TeamApr 2, 20264 min read

The Problem: You Closed the Browser, But It's Still Running

I first ran into this issue while helping a customer diagnose why their server kept getting slower. They were running a Steel self-hosted instance for a few days of extraction tasks. Available memory on the server dropped from 16GB to under 2GB. CPU load tripled. Restarting the Steel container didn't help.

ps aux | grep chrome revealed the answer: 47 Chrome processes on the system, over 30 of which had already lost their parent process.

These are orphan processes. The browser instance had ended, but Chrome's child processes — renderers, GPU process, network process — were never cleaned up. They kept consuming CPU and memory in the background.

This isn't unique to Steel. Chrome's multi-process architecture means every tab, every extension, every renderer is a separate OS process. When the controlling tool manages the browser via CDP, any gap in lifecycle management creates orphans.

Chrome's Process Model: Why One Tab = Five Processes

To understand why orphans are inevitable at scale, you need to understand Chrome's process architecture.

One browser tab

    ├── Browser process (main)
    ├── Renderer process (one per tab)
    ├── GPU process (hardware acceleration)
    ├── Network process
    └── Utility process (decoding, plugins)

Each tab creates 3-5 OS-level processes. Open 100 tabs and you have 300-500 processes. This isn't threading — Chrome uses process-level isolation, giving each its own virtual address space.

When CDP sends Browser.close():

  1. Browser process receives the signal
  2. Browser notifies children (Renderer, GPU, Network) to exit
  3. Normally, children receive SIGTERM, clean up, and exit
  4. In abnormal cases, children hang, and the Browser process exits without them

Step 4 creates orphan processes.

Why Orphans Are So Hard to Kill

Killing an orphan is harder than you'd expect:

# First try: SIGTERM
kill <pid>
# Process still there
 
# Second try: SIGKILL
kill -9 <pid>
# Process still there, now showing as defunct
 
# Third try: check state
ps aux | grep chrome | grep Z
# Lots of Z (zombie) state Chrome processes

Zombie processes can't be killed because their parent never called wait() to read their exit status. For Chrome orphans, the more common case is: the process isn't actually killed because it's in an uninterruptible sleep state (D state), waiting on I/O or a lock.

A Playwright MCP issue documented the numbers: a 21-hour 28-minute session consumed 24,646,376 KB of virtual memory (~24.6 GB), with 598MB RSS. When OOM killed the Node process, Chrome's child processes were left behind:

May 26 23:30:36 host kernel: oom-kill: constraint=CONSTRAINT_NONE
  task=node, pid=832511, uid=1000
May 26 23:30:36 host kernel: Out of memory: Killed process 832511 (node)
  total-vm:24646376kB, anon-rss:598188kB

The cascade effect is worse: after killing the target process, if memory is still insufficient, the OOM killer continues. In the same issue, systemd-resolved and dbus were killed as collateral, taking down the entire network stack — requiring a full reboot.

Three Scenarios That Breed Orphans

Scenario 1: CDP connection drops abruptly The agent is executing when network jitter kills the CDP WebSocket. The upper tool detects the timeout and exits without sending Browser.close(). Chrome keeps running.

Scenario 2: Hard timeout interrupts A 30-second timeout is set. The page loads at second 31, but the tool already discarded the Browser reference at second 30. Chrome continues running without a CDP connection, waiting for instructions that never come.

Scenario 3: Container force-destroyed Common in Docker/K8s: kubectl delete pod sends SIGTERM. Chrome doesn't exit within 10 seconds. Kubelet sends SIGKILL. If Chrome's GPU or Network process doesn't respond to SIGKILL (stuck in D state), they persist as orphans.

Mitigation Approaches

Approach 1: cgroup resource limits

From production experience — limit user process memory at the systemd level so OOM kills only the target process, not system services:

# /etc/systemd/system/user-1000.slice.d/override.conf
[Slice]
MemoryHigh=1500M
MemoryMax=2G

This ensures Chrome memory leaks only kill the user's processes, not systemd-resolved or dbus.

Approach 2: Post-task process cleanup

Explicitly clean up Chrome processes after automation tasks:

# Clean up Chrome processes after task completion
pkill -f "chrome.*--headless"
pkill -f "chrome.*--remote-debugging"

Or more precisely, by user UID:

# PAM session-close hook: clean up on SSH disconnect
pkill -u <user> -f playwright
pkill -u <user> -f "chrome.*--headless"

Approach 3: Periodic browser recycling

Don't expect a browser instance to run for hours without leaking. The approach validated in the Playwright MCP issue — recycle on a schedule:

# Recycle browser every N operations or M minutes
if operation_count >= RECYCLE_AFTER_N_OPS or time_since_last_recycle > RECYCLE_INTERVAL:
    await browser.close()
    browser = await launch_browser()
    operation_count = 0

For AI agents, recycling is cheap — the agent will re-navigate to the target page on the next operation naturally.

Approach 4: Monitoring and alerting

The simplest and most overlooked step. Track process counts:

# Check Chrome process count every minute
watch -n 60 'pgrep -c chrome || echo "no chrome"'
# Alert on threshold
if [ $(pgrep -c chrome) -gt 50 ]; then
  echo "WARNING: $(pgrep -c chrome) Chrome processes running"
fi

Three Highest-ROI Actions

If you're going to production today, do these three first:

  1. systemd cgroup memory limits (5 minutes to configure, prevents single process from killing the server)
  2. Post-task pkill cleanup (3 lines of script, prevents orphan accumulation)
  3. Chrome process count monitoring (1 crontab line, at least you'll know when something's wrong)

Handle survival first. Optimization comes later.

The next article covers the second scaling bottleneck: connection pool exhaustion and tab accumulation.

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.