Trafilatura + CloakBrowser: The Complete Full-Text Scraping Pipeline

Combine CloakBrowser's fingerprint-masked rendering with Trafilatura's extraction for sites behind Cloudflare, SPAs, and login walls.

16Yun Engineering TeamMay 1, 20262 min read

The Problem

Trafilatura processes static HTML only. But many target sites fall into these categories:

  • SPA apps (Vue/React): Content is rendered dynamically by JavaScript. HTML source is just a shell
  • Cloudflare-protected sites: Direct requests return Turnstile challenges, not real content
  • Login-required sites: Unauthenticated requests return redirects or empty pages

Result: Trafilatura returns empty or incomplete extraction.

Solution: CloakBrowser renders → gets the full HTML → Trafilatura extracts. The former solves "getting the complete page," the latter solves "extracting clean content from it."

Pipeline Architecture

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  16Yun Proxy     │     │  CloakBrowser    │     │  Trafilatura     │
│  Residential IP  │ ──→ │  Fingerprint +   │ ──→ │  Content +       │
│  geoip matching  │     │  JS Render       │     │  Metadata Extract│
└──────────────────┘     └──────────────────┘     └──────────────────┘
                                                          │
                                                   ┌──────┴──────┐
                                                   │  Storage     │
                                                   │  File/DB     │
                                                   └─────────────┘

Basic Implementation

import trafilatura
from cloakbrowser import launch

def render_and_extract(url, proxy=None):
    """Render page with CloakBrowser, then extract with Trafilatura"""

    browser = launch(
        proxy=proxy,
        headless=False,
        geoip=True,
        humanize=True,
    )
    page = browser.new_page()

    try:
        page.goto(url, wait_until="networkidle")
        page.wait_for_timeout(3000)

        html_content = page.content()

        result = trafilatura.extract(
            html_content,
            output_format="markdown",
            with_metadata=True,
            include_tables=True,
        )

        return result
    finally:
        browser.close()

result = render_and_extract(
    "https://example.com/article",
    proxy="http://user:pass@proxy.16yun.cn:8888",
)
print(result)

Real-World Case: Cloudflare-Protected Blog

import trafilatura
import json
from cloakbrowser import launch_persistent_context

def scrape_cloudflare_protected_site():
    """Scrape a Cloudflare-protected tech blog"""

    ctx = launch_persistent_context(
        "./profiles/tech-blog",
        headless=False,
        proxy="http://user:pass@proxy.16yun.cn:8888",
        geoip=True,
        humanize=True,
    )

    page = ctx.new_page()

    # Visit homepage — Turnstile resolves automatically
    page.goto("https://example.com", wait_until="networkidle")
    page.wait_for_timeout(5000)

    # Get article links
    article_links = page.eval_on_selector_all(
        "a[href*='/blog/']",
        "elements => elements.map(el => el.href)"
    )
    print(f"Found {len(article_links)} articles")

    # Extract each article
    for i, article_url in enumerate(article_links[:10]):
        print(f"[{i+1}/10] {article_url}")
        page.goto(article_url, wait_until="networkidle")
        page.wait_for_timeout(2000)

        html_content = page.content()
        result = trafilatura.extract(
            html_content,
            output_format="json",
            with_metadata=True,
            include_tables=True,
        )

        if result:
            data = json.loads(result)
            data["url"] = article_url

            with open(f"articles/article-{i+1}.json", "w") as f:
                json.dump(data, f, ensure_ascii=False, indent=2)
            print(f"  ✅ Saved: {data.get('title', 'untitled')}")

    ctx.close()

scrape_cloudflare_protected_site()

Pipeline Component

Package the render+extract pipeline as a reusable component:

import trafilatura
import json
from cloakbrowser import launch_persistent_context

class RenderingExtractor:
    """Render + Extract pipeline"""

    def __init__(self, proxy, profile_dir="./profiles"):
        self.proxy = proxy
        self.profile_dir = profile_dir

    def extract(self, url, profile_name="default"):
        ctx = launch_persistent_context(
            f"{self.profile_dir}/{profile_name}",
            headless=False,
            proxy=self.proxy,
            geoip=True,
            humanize=True,
        )
        page = ctx.new_page()

        try:
            page.goto(url, wait_until="networkidle")
            page.wait_for_timeout(3000)
            html_content = page.content()
            result = trafilatura.extract(
                html_content,
                output_format="json",
                with_metadata=True,
                include_tables=True,
            )
            return json.loads(result) if result else None
        finally:
            ctx.close()

extractor = RenderingExtractor(
    proxy="http://user:pass@proxy.16yun.cn:8888"
)
result = extractor.extract("https://example.com/article", "site-profile")

Role of Each Component

ComponentRole16Yun Product
ProxyResidential exit IP, avoid CDN blockingCrawler / API / Dedicated Proxy
CloakBrowserFingerprint masking + JS rendering + anti-bot bypassgeoip=True for timezone matching
TrafilaturaExtract clean text + metadata from full HTMLNot directly related

Performance Considerations

Render Timing

SPA pages may need 5-10 seconds to render fully:

page.wait_for_selector("article", timeout=15000)
page.goto(url, wait_until="networkidle")

Concurrency Limits

Each CloakBrowser instance uses ~300-500MB RAM:

ServerRecommended Concurrency
4-core 8GB3-5
8-core 16GB8-12
16-core 32GB16-25

Summary

ScenarioSolutionProblem Solved
Static HTMLTrafilatura directlyFastest, lightest
JS-rendered pagesCloakBrowser + TrafilaturaFull content after render
Cloudflare protectedCloakBrowser (C++ patches) + residential proxyFingerprint + IP bypass
Large-scalePipeline + proxy pool + cachingScalable, fault-tolerant

The combined effect:

Target site sees: residential IP + real Chrome fingerprint + human behavior → shows real content

Scraper gets: complete HTML → Trafilatura extracts clean text and metadata

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.