Precision Extraction: Content, Metadata, and Tables — Trafilatura Deep Dive

Trafilatura's extraction engine, available metadata fields, benchmark comparison against readability/newspaper3k/boilerpy3, and tuning best practices.

16Yun Engineering TeamApr 25, 20262 min read

Trafilatura's Extraction Engine

Trafilatura combines three strategies for content extraction:

  1. HTML structure analysis — DOM tree, text density, punctuation ratio, link-to-text ratio
  2. jusText algorithm — heuristic text classification (content / noise / heading)
  3. Readability fallback — classic readability algorithm as backup

This multi-engine design ensures consistent performance across diverse page layouts.

Available Fields

import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(
    downloaded,
    output_format="json",
    with_metadata=True,
    include_tables=True,
    include_comments=True,
    include_formatting=True,
    include_images=True,
    include_links=True,
)
FieldJSON KeySource
Titletitle<title> / <h1> / OpenGraph
Authorauthor<meta name="author"> / byline
Datedate<time> / <meta date> / URL pattern
CategoriescategoriesBreadcrumbs / <meta>
TagstagsArticle tags / <meta keywords>
Main texttextExtraction algorithm output
CommentscommentsComment section (with include_comments=True)
TablestablesHTML <table> elements (with include_tables=True)
ImagesimagesAlt text and URLs (with include_images=True)

Metadata Example

import json
import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="json", with_metadata=True)

data = json.loads(result)
print(f"Title: {data['title']}")
print(f"Author: {data['author']}")
print(f"Date: {data['date']}")
print(f"Categories: {', '.join(data['categories'])}")
print(f"Tags: {', '.join(data['tags'])}")

Benchmark Comparison

Trafilatura consistently leads public benchmarks. Results from ScrapingHub's article extraction benchmark:

LibraryF1 ScorePrecisionRecallMaintenance
Trafilatura0.920.940.90Active
readability-lxml0.780.820.75Low
newspaper3k0.710.680.75Stalled
boilerpy30.750.800.71Low
jusText0.730.700.77Low

Sources: ScrapingHub article-extraction-benchmark, Bevendorff et al. 2023, Lejeune & Barbaresi 2020.

Why Trafilatura Leads

  1. Multi-strategy fusion — when one algorithm fails on a particular site, others provide coverage
  2. Active maintenance — continuous updates since 2021, adapting to modern HTML structures
  3. Rich metadata — most competitors extract only body text; Trafilatura extracts 6+ metadata fields
  4. Output flexibility — 6 output formats reduce downstream processing cost

Table Extraction

Trafilatura extracts HTML tables into structured output:

import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/data-page")
result = trafilatura.extract(downloaded, output_format="json", include_tables=True)

In Markdown output, tables convert to Markdown table syntax. In JSON, they're structured arrays.

Tuning Extraction

Site-Specific Strategies

# Short articles — lower minimum text length threshold
result = trafilatura.extract(
    downloaded,
    output_format="txt",
    no_fallback=False,
    include_tables=True,
    include_formatting=True,
    target_language="en",
    url="https://example.com/article",
)

Non-Standard Pages

# Comment-rich pages (forums)
result = trafilatura.extract(
    downloaded,
    include_comments=True,
    deduplicate=True,
)

Proxies Improve Extraction Quality

Extraction quality depends on complete HTML. Blocked or degraded responses produce poor results.

import requests
import trafilatura

session = requests.Session()
session.proxies = {
    "http": "http://user:pass@proxy.16yun.cn:8888",
    "https": "http://user:pass@proxy.16yun.cn:8888",
}
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

resp = session.get("https://example.com/article")
resp.encoding = "utf-8"
result = trafilatura.extract(resp.text, output_format="markdown", with_metadata=True)

Using 16Yun's Crawler Proxy with residential IPs reduces CDN blocking, ensuring Trafilatura receives the complete HTML it needs for the best extraction results.

Summary

AdvantageDetail
Highest F1 score0.92, ~14 points above readability
Full metadataTitle, author, date, categories, tags in one call
Output formats6 formats for downstream flexibility
Active maintenanceThousands of dependents, regular releases
ConfigurableFine-grained control over tables, comments, images, formatting

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.