Trafilatura Advanced: Custom Extraction, Language Detection & Performance

Custom extraction strategies, language detection integration, offline HTML batch processing, caching, and incremental updates.

16Yun Engineering TeamApr 29, 20261 min read

Custom Extraction Strategies

Exclude Specific Elements

Pre-process HTML to remove unwanted sections before extraction:

import trafilatura
from lxml import html

downloaded = trafilatura.fetch_url("https://example.com/article")

tree = html.fromstring(downloaded)
for element in tree.xpath("//aside | //nav | //div[@class='sidebar']"):
    element.getparent().remove(element)

extracted_html = html.tostring(tree, encoding="unicode")
result = trafilatura.extract(extracted_html, output_format="markdown")

Target Specific Regions

When the page has a clear content container, narrow the extraction scope:

import trafilatura
from lxml import html

downloaded = trafilatura.fetch_url("https://example.com/article")
tree = html.fromstring(downloaded)

article_elem = tree.xpath("//article")
if article_elem:
    article_html = html.tostring(article_elem[0], encoding="unicode")
    result = trafilatura.extract(article_html, output_format="markdown")
else:
    result = trafilatura.extract(downloaded, output_format="markdown")

Preserve Formatting

result = trafilatura.extract(
    downloaded,
    output_format="markdown",
    include_formatting=True,
    include_tables=True,
)

Language Detection

Trafilatura integrates language detection for multi-language sites:

pip install trafilatura[langid]

For higher accuracy:

pip install trafilatura[fasttext]

Detect During Extraction

import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(
    downloaded,
    output_format="json",
    with_metadata=True,
)

import json
data = json.loads(result)

from trafilatura import language
lang = language.detect(data.get("text", ""))
print(f"Detected language: {lang}")

Language Filtering

from trafilatura import language

text = "一些中文内容..."
lang, confidence = language.detect_with_confidence(text)
if lang == "en" and confidence > 0.8:
    print("High-quality English content")
EngineAccuracySpeedInstall
langidMediumFasttrafilatura[langid]
fasttextHighMediumtrafilatura[fasttext]

Offline HTML Batch Processing

Process a directory of pre-downloaded HTML files — no network needed:

import os
import trafilatura
from pathlib import Path

html_dir = Path("./downloaded_pages")
output_dir = Path("./extracted_articles")
output_dir.mkdir(exist_ok=True)

for html_file in html_dir.glob("*.html"):
    with open(html_file, "r", encoding="utf-8") as f:
        html_content = f.read()

    result = trafilatura.extract(
        html_content,
        output_format="markdown",
        with_metadata=True,
    )

    if result:
        output_file = output_dir / f"{html_file.stem}.md"
        with open(output_file, "w", encoding="utf-8") as f:
            f.write(result)
        print(f"✅ {html_file.name} → {output_file.name}")

CLI Batch

trafilatura --input-dir ./downloaded_pages --output-dir ./articles
trafilatura --input-file ./page.html --output-file ./article.md

Performance Optimization

Cache Downloads

import hashlib, json, os, requests, trafilatura

CACHE_DIR = "./cache"

def cached_fetch(url):
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cache_path = os.path.join(CACHE_DIR, f"{cache_key}.html")

    if os.path.exists(cache_path):
        with open(cache_path, "r", encoding="utf-8") as f:
            return f.read()

    os.makedirs(CACHE_DIR, exist_ok=True)
    resp = requests.get(url, proxies=PROXY, timeout=20)
    resp.encoding = "utf-8"

    with open(cache_path, "w", encoding="utf-8") as f:
        f.write(resp.text)
    return resp.text

Pre-warm Trafilatura

import trafilatura
# Warm up the module
_ = trafilatura.extract("<html><body><p>warmup</p></body></html>")

Limit Processing Scope

result = trafilatura.extract(
    downloaded,
    output_format="txt",
    max_tree_size=10000,
)

Incremental Update Strategy

import json, os
from datetime import datetime

STATE_FILE = "crawl_state.json"

def load_state():
    if os.path.exists(STATE_FILE):
        with open(STATE_FILE, "r") as f:
            return json.load(f)
    return {"processed_urls": [], "last_crawl": None}

def save_state(state):
    with open(STATE_FILE, "w") as f:
        json.dump(state, f, ensure_ascii=False, indent=2)

state = load_state()
processed = set(state.get("processed_urls", []))

new_urls = [u for u in all_urls if u not in processed]

for url in new_urls:
    # ... download and extract ...
    processed.add(url)

state["processed_urls"] = list(processed)
state["last_crawl"] = datetime.now().isoformat()
save_state(state)

Summary

TechniqueUse CaseKey Code
Element exclusionPages with fixed noiselxml pre-processing
Targeted regionClean page structure//article XPath
Language detectionMulti-language sitestrafilatura[fasttext]
Offline batchPre-downloaded HTMLtrafilatura --input-dir
Download cacheFrequent reprocessingMD5 hash cache
Incremental updateLong-running tasksJSON state file

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.