Smart Crawling: Discover and Scrape Full Websites with Trafilatura

Auto-discover all pages via Sitemap and RSS Feed, filter and deduplicate URLs, batch extract with proxies.

16Yun Engineering TeamApr 23, 20261 min read

URL Discovery via Sitemap

Trafilatura auto-discovers sitemap files (XML and TXT):

import trafilatura

urls = trafilatura.sitemaps.sitemap_search("https://example.com")
print(f"Found {len(urls)} URLs")

for url in urls[:10]:
    print(url)

sitemap_search() tries common paths (/sitemap.xml, /sitemap_index.xml) and follows nested sitemap index files automatically.

Direct Sitemap URL

from trafilatura.sitemaps import parse_sitemap

urls = parse_sitemap("https://example.com/sitemap.xml")
print(f"Extracted {len(urls)} URLs")

URL Filtering

# Keep only blog posts
blog_urls = [u for u in urls if "/blog/" in u]

# Exclude tag and author pages
filtered = [u for u in urls if "/tag/" not in u and "/author/" not in u]

# Same domain only
same_domain = [u for u in urls if u.startswith("https://example.com")]
FilterCodeUse Case
Path prefixurl.startswith("https://example.com/blog/")Blog only
Path contains"/article/" in urlArticles only
Exclude pattern"/tag/" not in urlExclude tag pages
Regexre.search(r"/\d{4}/\d{2}/", url)Date-based paths

Feed-Based Updates

For sites with frequent updates:

from trafilatura.feeds import find_feed_urls, parse_feed

feed_urls = find_feed_urls("https://example.com")
print(f"Found {len(feed_urls)} feeds")

for feed_url in feed_urls:
    entries = parse_feed(feed_url)
    for entry in entries:
        print(f"{entry.title}: {entry.url}")

Supported formats: RSS 2.0, ATOM, JSON Feed.

Incremental Update Strategy

import json
from trafilatura.feeds import find_feed_urls, parse_feed

# Load previously processed URLs
try:
    with open("processed_urls.json", "r") as f:
        processed = set(json.load(f))
except FileNotFoundError:
    processed = set()

# Check latest feed
feed_urls = find_feed_urls("https://example.com/blog")
new_articles = []

for feed_url in feed_urls:
    entries = parse_feed(feed_url)
    for entry in entries:
        if entry.url not in processed:
            new_articles.append(entry.url)

print(f"{len(new_articles)} new articles")

processed.update(new_articles)
with open("processed_urls.json", "w") as f:
    json.dump(list(processed), f)

Combined Workflow

import trafilatura
from trafilatura.sitemaps import sitemap_search
from trafilatura.feeds import find_feed_urls, parse_feed

# 1. Historical articles from sitemap
sitemap_urls = sitemap_search("https://example.com")
blog_urls = [u for u in sitemap_urls if "/blog/" in u]

# 2. Latest articles from feed
feed_urls = find_feed_urls("https://example.com/blog")
feed_article_urls = []
for feed_url in feed_urls:
    entries = parse_feed(feed_url)
    feed_article_urls.extend([e.url for e in entries])

# 3. Merge and deduplicate
all_urls = list(set(blog_urls + feed_article_urls))
print(f"Total unique URLs: {len(all_urls)}")

# 4. Batch extract
for url in all_urls[:5]:
    downloaded = trafilatura.fetch_url(url)
    result = trafilatura.extract(downloaded, output_format="markdown")
    if result:
        print(f"✅ {url}")

Full-Site Scraping with Proxies

Using 16Yun API Proxy to extract and rotate IPs:

import trafilatura
import requests
import random

api_url = "http://ip.16yun.cn:817/myip/pl/xxx/?s=xxx&u=user&format=json&count=20"
response = requests.get(api_url)
proxy_list = response.json()

for url in all_urls:
    proxy = random.choice(proxy_list)
    proxies = {
        "http": f"http://user:pass@{proxy['ip']}:{proxy['port']}",
        "https": f"http://user:pass@{proxy['ip']}:{proxy['port']}",
    }

    try:
        resp = requests.get(url, proxies=proxies, timeout=15)
        resp.encoding = "utf-8"
        result = trafilatura.extract(resp.text, output_format="markdown")
        if result:
            with open(f"articles/{url.split('/')[-1]}.md", "w") as f:
                f.write(result)
            print(f"✅ {url}")
    except Exception as e:
        print(f"❌ {url}: {e}")

Error Handling

Per 16Yun's help documentation error codes:

StatusCauseAction
429Rate limitedReduce concurrency, increase interval, rotate IP
407Proxy auth failedVerify credentials
504Target timeoutRetry 2-3 times, skip persistent failures

CLI Mode

# Discover URLs from sitemap
trafilatura --sitemap https://example.com/sitemap.xml

# Discover URLs from feed
trafilatura --feed https://example.com/feed.xml

# Batch extract from URL list file
trafilatura --list urls.txt --output-dir ./articles

Important Notes

  • Sitemaps can contain tens of thousands of URLs — filter before extracting
  • Respect robots.txt rate limits
  • Long-running tasks should implement checkpoint/resume by tracking processed URLs
  • 16Yun's Crawler Proxy (tunnel mode) simplifies proxy management — one entry point handles rotation

Need an enterprise proxy plan?

We can tailor architecture to your target domains, concurrency, and reliability goals.