Precision Extraction: Content, Metadata, and Tables — Trafilatura Deep Dive

Trafilatura's Extraction Engine

Trafilatura combines three strategies for content extraction:

HTML structure analysis — DOM tree, text density, punctuation ratio, link-to-text ratio
jusText algorithm — heuristic text classification (content / noise / heading)
Readability fallback — classic readability algorithm as backup

This multi-engine design ensures consistent performance across diverse page layouts.

Available Fields

import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(
    downloaded,
    output_format="json",
    with_metadata=True,
    include_tables=True,
    include_comments=True,
    include_formatting=True,
    include_images=True,
    include_links=True,
)

Field	JSON Key	Source
Title	`title`	`<title>` / `<h1>` / OpenGraph
Author	`author`	`<meta name="author">` / byline
Date	`date`	`<time>` / `<meta date>` / URL pattern
Categories	`categories`	Breadcrumbs / `<meta>`
Tags	`tags`	Article tags / `<meta keywords>`
Main text	`text`	Extraction algorithm output
Comments	`comments`	Comment section (with `include_comments=True`)
Tables	`tables`	HTML `<table>` elements (with `include_tables=True`)
Images	`images`	Alt text and URLs (with `include_images=True`)

Metadata Example

import json
import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="json", with_metadata=True)

data = json.loads(result)
print(f"Title: {data['title']}")
print(f"Author: {data['author']}")
print(f"Date: {data['date']}")
print(f"Categories: {', '.join(data['categories'])}")
print(f"Tags: {', '.join(data['tags'])}")

Benchmark Comparison

Trafilatura consistently leads public benchmarks. Results from ScrapingHub's article extraction benchmark:

Library	F1 Score	Precision	Recall	Maintenance
Trafilatura	0.92	0.94	0.90	Active
readability-lxml	0.78	0.82	0.75	Low
newspaper3k	0.71	0.68	0.75	Stalled
boilerpy3	0.75	0.80	0.71	Low
jusText	0.73	0.70	0.77	Low

Sources: ScrapingHub article-extraction-benchmark, Bevendorff et al. 2023, Lejeune & Barbaresi 2020.

Why Trafilatura Leads

Multi-strategy fusion — when one algorithm fails on a particular site, others provide coverage
Active maintenance — continuous updates since 2021, adapting to modern HTML structures
Rich metadata — most competitors extract only body text; Trafilatura extracts 6+ metadata fields
Output flexibility — 6 output formats reduce downstream processing cost

Table Extraction

Trafilatura extracts HTML tables into structured output:

import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/data-page")
result = trafilatura.extract(downloaded, output_format="json", include_tables=True)

In Markdown output, tables convert to Markdown table syntax. In JSON, they're structured arrays.

Tuning Extraction

Site-Specific Strategies

# Short articles — lower minimum text length threshold
result = trafilatura.extract(
    downloaded,
    output_format="txt",
    no_fallback=False,
    include_tables=True,
    include_formatting=True,
    target_language="en",
    url="https://example.com/article",
)

Non-Standard Pages

# Comment-rich pages (forums)
result = trafilatura.extract(
    downloaded,
    include_comments=True,
    deduplicate=True,
)

Proxies Improve Extraction Quality

Extraction quality depends on complete HTML. Blocked or degraded responses produce poor results.

import requests
import trafilatura

session = requests.Session()
session.proxies = {
    "http": "http://user:pass@proxy.16yun.cn:8888",
    "https": "http://user:pass@proxy.16yun.cn:8888",
}
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

resp = session.get("https://example.com/article")
resp.encoding = "utf-8"
result = trafilatura.extract(resp.text, output_format="markdown", with_metadata=True)

Using 16Yun's Crawler Proxy with residential IPs reduces CDN blocking, ensuring Trafilatura receives the complete HTML it needs for the best extraction results.

Summary

Advantage	Detail
Highest F1 score	0.92, ~14 points above readability
Full metadata	Title, author, date, categories, tags in one call
Output formats	6 formats for downstream flexibility
Active maintenance	Thousands of dependents, regular releases
Configurable	Fine-grained control over tables, comments, images, formatting