Precision Extraction: Content, Metadata, and Tables — Trafilatura Deep Dive
Trafilatura's extraction engine, available metadata fields, benchmark comparison against readability/newspaper3k/boilerpy3, and tuning best practices.
Trafilatura's Extraction Engine
Trafilatura combines three strategies for content extraction:
- HTML structure analysis — DOM tree, text density, punctuation ratio, link-to-text ratio
- jusText algorithm — heuristic text classification (content / noise / heading)
- Readability fallback — classic readability algorithm as backup
This multi-engine design ensures consistent performance across diverse page layouts.
Available Fields
import trafilatura
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(
downloaded,
output_format="json",
with_metadata=True,
include_tables=True,
include_comments=True,
include_formatting=True,
include_images=True,
include_links=True,
)
| Field | JSON Key | Source |
|---|---|---|
| Title | title | <title> / <h1> / OpenGraph |
| Author | author | <meta name="author"> / byline |
| Date | date | <time> / <meta date> / URL pattern |
| Categories | categories | Breadcrumbs / <meta> |
| Tags | tags | Article tags / <meta keywords> |
| Main text | text | Extraction algorithm output |
| Comments | comments | Comment section (with include_comments=True) |
| Tables | tables | HTML <table> elements (with include_tables=True) |
| Images | images | Alt text and URLs (with include_images=True) |
Metadata Example
import json
import trafilatura
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="json", with_metadata=True)
data = json.loads(result)
print(f"Title: {data['title']}")
print(f"Author: {data['author']}")
print(f"Date: {data['date']}")
print(f"Categories: {', '.join(data['categories'])}")
print(f"Tags: {', '.join(data['tags'])}")
Benchmark Comparison
Trafilatura consistently leads public benchmarks. Results from ScrapingHub's article extraction benchmark:
| Library | F1 Score | Precision | Recall | Maintenance |
|---|---|---|---|---|
| Trafilatura | 0.92 | 0.94 | 0.90 | Active |
| readability-lxml | 0.78 | 0.82 | 0.75 | Low |
| newspaper3k | 0.71 | 0.68 | 0.75 | Stalled |
| boilerpy3 | 0.75 | 0.80 | 0.71 | Low |
| jusText | 0.73 | 0.70 | 0.77 | Low |
Sources: ScrapingHub article-extraction-benchmark, Bevendorff et al. 2023, Lejeune & Barbaresi 2020.
Why Trafilatura Leads
- Multi-strategy fusion — when one algorithm fails on a particular site, others provide coverage
- Active maintenance — continuous updates since 2021, adapting to modern HTML structures
- Rich metadata — most competitors extract only body text; Trafilatura extracts 6+ metadata fields
- Output flexibility — 6 output formats reduce downstream processing cost
Table Extraction
Trafilatura extracts HTML tables into structured output:
import trafilatura
downloaded = trafilatura.fetch_url("https://example.com/data-page")
result = trafilatura.extract(downloaded, output_format="json", include_tables=True)
In Markdown output, tables convert to Markdown table syntax. In JSON, they're structured arrays.
Tuning Extraction
Site-Specific Strategies
# Short articles — lower minimum text length threshold
result = trafilatura.extract(
downloaded,
output_format="txt",
no_fallback=False,
include_tables=True,
include_formatting=True,
target_language="en",
url="https://example.com/article",
)
Non-Standard Pages
# Comment-rich pages (forums)
result = trafilatura.extract(
downloaded,
include_comments=True,
deduplicate=True,
)
Proxies Improve Extraction Quality
Extraction quality depends on complete HTML. Blocked or degraded responses produce poor results.
import requests
import trafilatura
session = requests.Session()
session.proxies = {
"http": "http://user:pass@proxy.16yun.cn:8888",
"https": "http://user:pass@proxy.16yun.cn:8888",
}
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
resp = session.get("https://example.com/article")
resp.encoding = "utf-8"
result = trafilatura.extract(resp.text, output_format="markdown", with_metadata=True)
Using 16Yun's Crawler Proxy with residential IPs reduces CDN blocking, ensuring Trafilatura receives the complete HTML it needs for the best extraction results.
Summary
| Advantage | Detail |
|---|---|
| Highest F1 score | 0.92, ~14 points above readability |
| Full metadata | Title, author, date, categories, tags in one call |
| Output formats | 6 formats for downstream flexibility |
| Active maintenance | Thousands of dependents, regular releases |
| Configurable | Fine-grained control over tables, comments, images, formatting |
Need an enterprise proxy plan?
We can tailor architecture to your target domains, concurrency, and reliability goals.