Trafilatura + CloakBrowser: The Complete Full-Text Scraping Pipeline
Combine CloakBrowser's fingerprint-masked rendering with Trafilatura's extraction for sites behind Cloudflare, SPAs, and login walls.
The Problem
Trafilatura processes static HTML only. But many target sites fall into these categories:
- SPA apps (Vue/React): Content is rendered dynamically by JavaScript. HTML source is just a shell
- Cloudflare-protected sites: Direct requests return Turnstile challenges, not real content
- Login-required sites: Unauthenticated requests return redirects or empty pages
Result: Trafilatura returns empty or incomplete extraction.
Solution: CloakBrowser renders → gets the full HTML → Trafilatura extracts. The former solves "getting the complete page," the latter solves "extracting clean content from it."
Pipeline Architecture
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ 16Yun Proxy │ │ CloakBrowser │ │ Trafilatura │
│ Residential IP │ ──→ │ Fingerprint + │ ──→ │ Content + │
│ geoip matching │ │ JS Render │ │ Metadata Extract│
└──────────────────┘ └──────────────────┘ └──────────────────┘
│
┌──────┴──────┐
│ Storage │
│ File/DB │
└─────────────┘
Basic Implementation
import trafilatura
from cloakbrowser import launch
def render_and_extract(url, proxy=None):
"""Render page with CloakBrowser, then extract with Trafilatura"""
browser = launch(
proxy=proxy,
headless=False,
geoip=True,
humanize=True,
)
page = browser.new_page()
try:
page.goto(url, wait_until="networkidle")
page.wait_for_timeout(3000)
html_content = page.content()
result = trafilatura.extract(
html_content,
output_format="markdown",
with_metadata=True,
include_tables=True,
)
return result
finally:
browser.close()
result = render_and_extract(
"https://example.com/article",
proxy="http://user:pass@proxy.16yun.cn:8888",
)
print(result)
Real-World Case: Cloudflare-Protected Blog
import trafilatura
import json
from cloakbrowser import launch_persistent_context
def scrape_cloudflare_protected_site():
"""Scrape a Cloudflare-protected tech blog"""
ctx = launch_persistent_context(
"./profiles/tech-blog",
headless=False,
proxy="http://user:pass@proxy.16yun.cn:8888",
geoip=True,
humanize=True,
)
page = ctx.new_page()
# Visit homepage — Turnstile resolves automatically
page.goto("https://example.com", wait_until="networkidle")
page.wait_for_timeout(5000)
# Get article links
article_links = page.eval_on_selector_all(
"a[href*='/blog/']",
"elements => elements.map(el => el.href)"
)
print(f"Found {len(article_links)} articles")
# Extract each article
for i, article_url in enumerate(article_links[:10]):
print(f"[{i+1}/10] {article_url}")
page.goto(article_url, wait_until="networkidle")
page.wait_for_timeout(2000)
html_content = page.content()
result = trafilatura.extract(
html_content,
output_format="json",
with_metadata=True,
include_tables=True,
)
if result:
data = json.loads(result)
data["url"] = article_url
with open(f"articles/article-{i+1}.json", "w") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
print(f" ✅ Saved: {data.get('title', 'untitled')}")
ctx.close()
scrape_cloudflare_protected_site()
Pipeline Component
Package the render+extract pipeline as a reusable component:
import trafilatura
import json
from cloakbrowser import launch_persistent_context
class RenderingExtractor:
"""Render + Extract pipeline"""
def __init__(self, proxy, profile_dir="./profiles"):
self.proxy = proxy
self.profile_dir = profile_dir
def extract(self, url, profile_name="default"):
ctx = launch_persistent_context(
f"{self.profile_dir}/{profile_name}",
headless=False,
proxy=self.proxy,
geoip=True,
humanize=True,
)
page = ctx.new_page()
try:
page.goto(url, wait_until="networkidle")
page.wait_for_timeout(3000)
html_content = page.content()
result = trafilatura.extract(
html_content,
output_format="json",
with_metadata=True,
include_tables=True,
)
return json.loads(result) if result else None
finally:
ctx.close()
extractor = RenderingExtractor(
proxy="http://user:pass@proxy.16yun.cn:8888"
)
result = extractor.extract("https://example.com/article", "site-profile")
Role of Each Component
| Component | Role | 16Yun Product |
|---|---|---|
| Proxy | Residential exit IP, avoid CDN blocking | Crawler / API / Dedicated Proxy |
| CloakBrowser | Fingerprint masking + JS rendering + anti-bot bypass | geoip=True for timezone matching |
| Trafilatura | Extract clean text + metadata from full HTML | Not directly related |
Performance Considerations
Render Timing
SPA pages may need 5-10 seconds to render fully:
page.wait_for_selector("article", timeout=15000)
page.goto(url, wait_until="networkidle")
Concurrency Limits
Each CloakBrowser instance uses ~300-500MB RAM:
| Server | Recommended Concurrency |
|---|---|
| 4-core 8GB | 3-5 |
| 8-core 16GB | 8-12 |
| 16-core 32GB | 16-25 |
Summary
| Scenario | Solution | Problem Solved |
|---|---|---|
| Static HTML | Trafilatura directly | Fastest, lightest |
| JS-rendered pages | CloakBrowser + Trafilatura | Full content after render |
| Cloudflare protected | CloakBrowser (C++ patches) + residential proxy | Fingerprint + IP bypass |
| Large-scale | Pipeline + proxy pool + caching | Scalable, fault-tolerant |
The combined effect:
Target site sees: residential IP + real Chrome fingerprint + human behavior → shows real content
Scraper gets: complete HTML → Trafilatura extracts clean text and metadata
Need an enterprise proxy plan?
We can tailor architecture to your target domains, concurrency, and reliability goals.