Trafilatura 101: From HTML to Structured Text in 3 Lines

Install

pip install trafilatura

Verify:

import trafilatura
print(trafilatura.__version__)

Three Lines to Extract Article Text

import trafilatura
 
downloaded = trafilatura.fetch_url("https://example.16yun.cn/article")
result = trafilatura.extract(downloaded)
print(result)

fetch_url() downloads the HTML, extract() strips out navigation, ads, and sidebars — leaving only the main article text.

Structured Data Output

import trafilatura
 
downloaded = trafilatura.fetch_url("https://example.16yun.cn/article")
result = trafilatura.extract(downloaded, output_format="json", with_metadata=True)
print(result)

Output:

{
  "title": "Article Title",
  "author": "Author Name",
  "date": "2026-06-16",
  "categories": ["Technology"],
  "tags": ["Python", "Scraping"],
  "text": "Full article text..."
}

Output Format Comparison

Format	Parameter	Best For
Plain text	`txt` (default)	NLP corpora, full-text search
Markdown	`markdown`	Blog import, documentation
JSON	`json`	Programmatic processing, APIs
CSV	`csv`	Spreadsheet analysis
XML	`xml`	XML pipeline integration
XML-TEI	`xmltei`	Academic / digital humanities

Extraction Options

import trafilatura
 
downloaded = trafilatura.fetch_url("https://example.16yun.cn/article")
result = trafilatura.extract(
    downloaded,
    output_format="json",
    with_metadata=True,
    include_comments=True,
    include_tables=True,
    include_images=False,
    include_formatting=True,
)

Parameter	Default	Effect
`with_metadata`	`False`	Extract title, author, date, categories, tags
`include_comments`	`False`	Include comment section text
`include_tables`	`True`	Extract table content
`include_images`	`False`	Include image alt text and links
`include_formatting`	`False`	Preserve bold/italic in Markdown
`include_links`	`False`	Include hyperlinks

Using with Proxies

For sites with anti-bot protection, route requests through a proxy:

# Verify proxy connectivity
curl -x http://user:pass@proxy.16yun.cn:8888 https://httpbin.org/ip

import trafilatura
import requests
 
session = requests.Session()
session.proxies = {
    "http": "http://user:pass@proxy.16yun.cn:8888",
    "https": "http://user:pass@proxy.16yun.cn:8888",
}
 
# Download via custom session
response = session.get("https://example.16yun.cn/article")
response.encoding = "utf-8"
result = trafilatura.extract(response.text, output_format="markdown")
print(result)

Recommended 16Yun Product Configurations

Scenario	Proxy Product	Setup
Single article test	Crawler Proxy	`http://user:pass@proxy.16yun.cn:8888`
Batch scraping	API Proxy	Extract IP list via API, rotate per request
Long-running jobs	Dedicated Proxy	Fixed exit IP with retry strategy

CLI Usage

No Python code needed:

# Extract plain text
trafilatura -u https://example.16yun.cn/article
 
# Markdown output
trafilatura -u https://example.16yun.cn/article --output-format markdown
 
# JSON with metadata
trafilatura -u https://example.16yun.cn/article --output-format json --with-metadata

Troubleshooting

Empty extraction result

The site may use JavaScript rendering (SPA). Trafilatura works with static HTML only
Check page source: curl -s https://example.16yun.cn | head -100
For JS-rendered content, use CloakBrowser first (see article 6 in this series)

Request blocked

Verify proxy configuration with curl
Some CDNs block non-browser requests — set a User-Agent:

session.headers.update({"User-Agent": "Mozilla/5.0 ..."})

Character encoding issues

Set response.encoding explicitly or let requests auto-detect