Chrome Headless at Scale (Part 1): Orphan Processes and a 24GB Memory Leak
SIGTERM won't kill them. Zombie processes consume CPU. Memory leaks reach 24GB. Every browser automation team hits this wall eventually.
The Problem: You Closed the Browser, But It's Still Running
I first ran into this issue while helping a customer diagnose why their server kept getting slower. They were running a Steel self-hosted instance for a few days of extraction tasks. Available memory on the server dropped from 16GB to under 2GB. CPU load tripled. Restarting the Steel container didn't help.
ps aux | grep chrome revealed the answer: 47 Chrome processes on the system, over 30 of which had already lost their parent process.
These are orphan processes. The browser instance had ended, but Chrome's child processes — renderers, GPU process, network process — were never cleaned up. They kept consuming CPU and memory in the background.
This isn't unique to Steel. Chrome's multi-process architecture means every tab, every extension, every renderer is a separate OS process. When the controlling tool manages the browser via CDP, any gap in lifecycle management creates orphans.
Chrome's Process Model: Why One Tab = Five Processes
To understand why orphans are inevitable at scale, you need to understand Chrome's process architecture.
One browser tab
│
├── Browser process (main)
├── Renderer process (one per tab)
├── GPU process (hardware acceleration)
├── Network process
└── Utility process (decoding, plugins)Each tab creates 3-5 OS-level processes. Open 100 tabs and you have 300-500 processes. This isn't threading — Chrome uses process-level isolation, giving each its own virtual address space.
When CDP sends Browser.close():
- Browser process receives the signal
- Browser notifies children (Renderer, GPU, Network) to exit
- Normally, children receive SIGTERM, clean up, and exit
- In abnormal cases, children hang, and the Browser process exits without them
Step 4 creates orphan processes.
Why Orphans Are So Hard to Kill
Killing an orphan is harder than you'd expect:
# First try: SIGTERM
kill <pid>
# Process still there
# Second try: SIGKILL
kill -9 <pid>
# Process still there, now showing as defunct
# Third try: check state
ps aux | grep chrome | grep Z
# Lots of Z (zombie) state Chrome processesZombie processes can't be killed because their parent never called wait() to read their exit status. For Chrome orphans, the more common case is: the process isn't actually killed because it's in an uninterruptible sleep state (D state), waiting on I/O or a lock.
A Playwright MCP issue documented the numbers: a 21-hour 28-minute session consumed 24,646,376 KB of virtual memory (~24.6 GB), with 598MB RSS. When OOM killed the Node process, Chrome's child processes were left behind:
May 26 23:30:36 host kernel: oom-kill: constraint=CONSTRAINT_NONE
task=node, pid=832511, uid=1000
May 26 23:30:36 host kernel: Out of memory: Killed process 832511 (node)
total-vm:24646376kB, anon-rss:598188kBThe cascade effect is worse: after killing the target process, if memory is still insufficient, the OOM killer continues. In the same issue, systemd-resolved and dbus were killed as collateral, taking down the entire network stack — requiring a full reboot.
Three Scenarios That Breed Orphans
Scenario 1: CDP connection drops abruptly
The agent is executing when network jitter kills the CDP WebSocket. The upper tool detects the timeout and exits without sending Browser.close(). Chrome keeps running.
Scenario 2: Hard timeout interrupts A 30-second timeout is set. The page loads at second 31, but the tool already discarded the Browser reference at second 30. Chrome continues running without a CDP connection, waiting for instructions that never come.
Scenario 3: Container force-destroyed
Common in Docker/K8s: kubectl delete pod sends SIGTERM. Chrome doesn't exit within 10 seconds. Kubelet sends SIGKILL. If Chrome's GPU or Network process doesn't respond to SIGKILL (stuck in D state), they persist as orphans.
Mitigation Approaches
Approach 1: cgroup resource limits
From production experience — limit user process memory at the systemd level so OOM kills only the target process, not system services:
# /etc/systemd/system/user-1000.slice.d/override.conf
[Slice]
MemoryHigh=1500M
MemoryMax=2GThis ensures Chrome memory leaks only kill the user's processes, not systemd-resolved or dbus.
Approach 2: Post-task process cleanup
Explicitly clean up Chrome processes after automation tasks:
# Clean up Chrome processes after task completion
pkill -f "chrome.*--headless"
pkill -f "chrome.*--remote-debugging"Or more precisely, by user UID:
# PAM session-close hook: clean up on SSH disconnect
pkill -u <user> -f playwright
pkill -u <user> -f "chrome.*--headless"Approach 3: Periodic browser recycling
Don't expect a browser instance to run for hours without leaking. The approach validated in the Playwright MCP issue — recycle on a schedule:
# Recycle browser every N operations or M minutes
if operation_count >= RECYCLE_AFTER_N_OPS or time_since_last_recycle > RECYCLE_INTERVAL:
await browser.close()
browser = await launch_browser()
operation_count = 0For AI agents, recycling is cheap — the agent will re-navigate to the target page on the next operation naturally.
Approach 4: Monitoring and alerting
The simplest and most overlooked step. Track process counts:
# Check Chrome process count every minute
watch -n 60 'pgrep -c chrome || echo "no chrome"'
# Alert on threshold
if [ $(pgrep -c chrome) -gt 50 ]; then
echo "WARNING: $(pgrep -c chrome) Chrome processes running"
fiThree Highest-ROI Actions
If you're going to production today, do these three first:
- systemd cgroup memory limits (5 minutes to configure, prevents single process from killing the server)
- Post-task pkill cleanup (3 lines of script, prevents orphan accumulation)
- Chrome process count monitoring (1 crontab line, at least you'll know when something's wrong)
Handle survival first. Optimization comes later.
The next article covers the second scaling bottleneck: connection pool exhaustion and tab accumulation.
Need an enterprise proxy plan?
We can tailor architecture to your target domains, concurrency, and reliability goals.