Trafilatura 101: From HTML to Structured Text in 3 Lines
Extract full-text content and metadata from any web page — supports TXT, Markdown, JSON, CSV, and XML output.
16Yun Engineering TeamApr 21, 20262 min read
Install
pip install trafilatura
Verify:
import trafilatura
print(trafilatura.__version__)
Three Lines to Extract Article Text
import trafilatura
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded)
print(result)
fetch_url() downloads the HTML, extract() strips out navigation, ads, and sidebars — leaving only the main article text.
Structured Data Output
import trafilatura
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="json", with_metadata=True)
print(result)
Output:
{
"title": "Article Title",
"author": "Author Name",
"date": "2026-06-16",
"categories": ["Technology"],
"tags": ["Python", "Scraping"],
"text": "Full article text..."
}
Output Format Comparison
| Format | Parameter | Best For |
|---|---|---|
| Plain text | txt (default) | NLP corpora, full-text search |
| Markdown | markdown | Blog import, documentation |
| JSON | json | Programmatic processing, APIs |
| CSV | csv | Spreadsheet analysis |
| XML | xml | XML pipeline integration |
| XML-TEI | xmltei | Academic / digital humanities |
Extraction Options
import trafilatura
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(
downloaded,
output_format="json",
with_metadata=True,
include_comments=True,
include_tables=True,
include_images=False,
include_formatting=True,
)
| Parameter | Default | Effect |
|---|---|---|
with_metadata | False | Extract title, author, date, categories, tags |
include_comments | False | Include comment section text |
include_tables | True | Extract table content |
include_images | False | Include image alt text and links |
include_formatting | False | Preserve bold/italic in Markdown |
include_links | False | Include hyperlinks |
Using with Proxies
For sites with anti-bot protection, route requests through a proxy:
# Verify proxy connectivity
curl -x http://user:pass@proxy.16yun.cn:8888 https://httpbin.org/ip
import trafilatura
import requests
session = requests.Session()
session.proxies = {
"http": "http://user:pass@proxy.16yun.cn:8888",
"https": "http://user:pass@proxy.16yun.cn:8888",
}
# Download via custom session
response = session.get("https://example.com/article")
response.encoding = "utf-8"
result = trafilatura.extract(response.text, output_format="markdown")
print(result)
Recommended 16Yun Product Configurations
| Scenario | Proxy Product | Setup |
|---|---|---|
| Single article test | Crawler Proxy | http://user:pass@proxy.16yun.cn:8888 |
| Batch scraping | API Proxy | Extract IP list via API, rotate per request |
| Long-running jobs | Dedicated Proxy | Fixed exit IP with retry strategy |
CLI Usage
No Python code needed:
# Extract plain text
trafilatura -u https://example.com/article
# Markdown output
trafilatura -u https://example.com/article --output-format markdown
# JSON with metadata
trafilatura -u https://example.com/article --output-format json --with-metadata
Troubleshooting
Empty extraction result
- The site may use JavaScript rendering (SPA). Trafilatura works with static HTML only
- Check page source:
curl -s https://example.com | head -100 - For JS-rendered content, use CloakBrowser first (see article 6 in this series)
Request blocked
- Verify proxy configuration with curl
- Some CDNs block non-browser requests — set a User-Agent:
session.headers.update({"User-Agent": "Mozilla/5.0 ..."})
Character encoding issues
- Set
response.encodingexplicitly or let requests auto-detect
Need an enterprise proxy plan?
We can tailor architecture to your target domains, concurrency, and reliability goals.