Trafilatura Advanced: Custom Extraction, Language Detection & Performance
Custom extraction strategies, language detection integration, offline HTML batch processing, caching, and incremental updates.
16Yun Engineering TeamApr 29, 20261 min read
Custom Extraction Strategies
Exclude Specific Elements
Pre-process HTML to remove unwanted sections before extraction:
import trafilatura
from lxml import html
downloaded = trafilatura.fetch_url("https://example.com/article")
tree = html.fromstring(downloaded)
for element in tree.xpath("//aside | //nav | //div[@class='sidebar']"):
element.getparent().remove(element)
extracted_html = html.tostring(tree, encoding="unicode")
result = trafilatura.extract(extracted_html, output_format="markdown")
Target Specific Regions
When the page has a clear content container, narrow the extraction scope:
import trafilatura
from lxml import html
downloaded = trafilatura.fetch_url("https://example.com/article")
tree = html.fromstring(downloaded)
article_elem = tree.xpath("//article")
if article_elem:
article_html = html.tostring(article_elem[0], encoding="unicode")
result = trafilatura.extract(article_html, output_format="markdown")
else:
result = trafilatura.extract(downloaded, output_format="markdown")
Preserve Formatting
result = trafilatura.extract(
downloaded,
output_format="markdown",
include_formatting=True,
include_tables=True,
)
Language Detection
Trafilatura integrates language detection for multi-language sites:
pip install trafilatura[langid]
For higher accuracy:
pip install trafilatura[fasttext]
Detect During Extraction
import trafilatura
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(
downloaded,
output_format="json",
with_metadata=True,
)
import json
data = json.loads(result)
from trafilatura import language
lang = language.detect(data.get("text", ""))
print(f"Detected language: {lang}")
Language Filtering
from trafilatura import language
text = "一些中文内容..."
lang, confidence = language.detect_with_confidence(text)
if lang == "en" and confidence > 0.8:
print("High-quality English content")
| Engine | Accuracy | Speed | Install |
|---|---|---|---|
langid | Medium | Fast | trafilatura[langid] |
fasttext | High | Medium | trafilatura[fasttext] |
Offline HTML Batch Processing
Process a directory of pre-downloaded HTML files — no network needed:
import os
import trafilatura
from pathlib import Path
html_dir = Path("./downloaded_pages")
output_dir = Path("./extracted_articles")
output_dir.mkdir(exist_ok=True)
for html_file in html_dir.glob("*.html"):
with open(html_file, "r", encoding="utf-8") as f:
html_content = f.read()
result = trafilatura.extract(
html_content,
output_format="markdown",
with_metadata=True,
)
if result:
output_file = output_dir / f"{html_file.stem}.md"
with open(output_file, "w", encoding="utf-8") as f:
f.write(result)
print(f"✅ {html_file.name} → {output_file.name}")
CLI Batch
trafilatura --input-dir ./downloaded_pages --output-dir ./articles
trafilatura --input-file ./page.html --output-file ./article.md
Performance Optimization
Cache Downloads
import hashlib, json, os, requests, trafilatura
CACHE_DIR = "./cache"
def cached_fetch(url):
cache_key = hashlib.md5(url.encode()).hexdigest()
cache_path = os.path.join(CACHE_DIR, f"{cache_key}.html")
if os.path.exists(cache_path):
with open(cache_path, "r", encoding="utf-8") as f:
return f.read()
os.makedirs(CACHE_DIR, exist_ok=True)
resp = requests.get(url, proxies=PROXY, timeout=20)
resp.encoding = "utf-8"
with open(cache_path, "w", encoding="utf-8") as f:
f.write(resp.text)
return resp.text
Pre-warm Trafilatura
import trafilatura
# Warm up the module
_ = trafilatura.extract("<html><body><p>warmup</p></body></html>")
Limit Processing Scope
result = trafilatura.extract(
downloaded,
output_format="txt",
max_tree_size=10000,
)
Incremental Update Strategy
import json, os
from datetime import datetime
STATE_FILE = "crawl_state.json"
def load_state():
if os.path.exists(STATE_FILE):
with open(STATE_FILE, "r") as f:
return json.load(f)
return {"processed_urls": [], "last_crawl": None}
def save_state(state):
with open(STATE_FILE, "w") as f:
json.dump(state, f, ensure_ascii=False, indent=2)
state = load_state()
processed = set(state.get("processed_urls", []))
new_urls = [u for u in all_urls if u not in processed]
for url in new_urls:
# ... download and extract ...
processed.add(url)
state["processed_urls"] = list(processed)
state["last_crawl"] = datetime.now().isoformat()
save_state(state)
Summary
| Technique | Use Case | Key Code |
|---|---|---|
| Element exclusion | Pages with fixed noise | lxml pre-processing |
| Targeted region | Clean page structure | //article XPath |
| Language detection | Multi-language sites | trafilatura[fasttext] |
| Offline batch | Pre-downloaded HTML | trafilatura --input-dir |
| Download cache | Frequent reprocessing | MD5 hash cache |
| Incremental update | Long-running tasks | JSON state file |
Need an enterprise proxy plan?
We can tailor architecture to your target domains, concurrency, and reliability goals.