Trafilatura 101: From HTML to Structured Text in 3 Lines

Extract full-text content and metadata from any web page — supports TXT, Markdown, JSON, CSV, and XML output.

16Yun Engineering TeamApr 21, 20262 min read

Install

pip install trafilatura

Verify:

import trafilatura
print(trafilatura.__version__)

Three Lines to Extract Article Text

import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded)
print(result)

fetch_url() downloads the HTML, extract() strips out navigation, ads, and sidebars — leaving only the main article text.

Structured Data Output

import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="json", with_metadata=True)
print(result)

Output:

{
  "title": "Article Title",
  "author": "Author Name",
  "date": "2026-06-16",
  "categories": ["Technology"],
  "tags": ["Python", "Scraping"],
  "text": "Full article text..."
}

Output Format Comparison

FormatParameterBest For
Plain texttxt (default)NLP corpora, full-text search
MarkdownmarkdownBlog import, documentation
JSONjsonProgrammatic processing, APIs
CSVcsvSpreadsheet analysis
XMLxmlXML pipeline integration
XML-TEIxmlteiAcademic / digital humanities

Extraction Options

import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(
    downloaded,
    output_format="json",
    with_metadata=True,
    include_comments=True,
    include_tables=True,
    include_images=False,
    include_formatting=True,
)
ParameterDefaultEffect
with_metadataFalseExtract title, author, date, categories, tags
include_commentsFalseInclude comment section text
include_tablesTrueExtract table content
include_imagesFalseInclude image alt text and links
include_formattingFalsePreserve bold/italic in Markdown
include_linksFalseInclude hyperlinks

Using with Proxies

For sites with anti-bot protection, route requests through a proxy:

# Verify proxy connectivity
curl -x http://user:pass@proxy.16yun.cn:8888 https://httpbin.org/ip
import trafilatura
import requests

session = requests.Session()
session.proxies = {
    "http": "http://user:pass@proxy.16yun.cn:8888",
    "https": "http://user:pass@proxy.16yun.cn:8888",
}

# Download via custom session
response = session.get("https://example.com/article")
response.encoding = "utf-8"
result = trafilatura.extract(response.text, output_format="markdown")
print(result)
ScenarioProxy ProductSetup
Single article testCrawler Proxyhttp://user:pass@proxy.16yun.cn:8888
Batch scrapingAPI ProxyExtract IP list via API, rotate per request
Long-running jobsDedicated ProxyFixed exit IP with retry strategy

CLI Usage

No Python code needed:

# Extract plain text
trafilatura -u https://example.com/article

# Markdown output
trafilatura -u https://example.com/article --output-format markdown

# JSON with metadata
trafilatura -u https://example.com/article --output-format json --with-metadata

Troubleshooting

Empty extraction result

  • The site may use JavaScript rendering (SPA). Trafilatura works with static HTML only
  • Check page source: curl -s https://example.com | head -100
  • For JS-rendered content, use CloakBrowser first (see article 6 in this series)

Request blocked

  • Verify proxy configuration with curl
  • Some CDNs block non-browser requests — set a User-Agent:
session.headers.update({"User-Agent": "Mozilla/5.0 ..."})

Character encoding issues

  • Set response.encoding explicitly or let requests auto-detect

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.