Smart Crawling: Discover and Scrape Full Websites with Trafilatura
Auto-discover all pages via Sitemap and RSS Feed, filter and deduplicate URLs, batch extract with proxies.
16Yun Engineering TeamApr 23, 20261 min read
URL Discovery via Sitemap
Trafilatura auto-discovers sitemap files (XML and TXT):
import trafilatura
urls = trafilatura.sitemaps.sitemap_search("https://example.com")
print(f"Found {len(urls)} URLs")
for url in urls[:10]:
print(url)
sitemap_search() tries common paths (/sitemap.xml, /sitemap_index.xml) and follows nested sitemap index files automatically.
Direct Sitemap URL
from trafilatura.sitemaps import parse_sitemap
urls = parse_sitemap("https://example.com/sitemap.xml")
print(f"Extracted {len(urls)} URLs")
URL Filtering
# Keep only blog posts
blog_urls = [u for u in urls if "/blog/" in u]
# Exclude tag and author pages
filtered = [u for u in urls if "/tag/" not in u and "/author/" not in u]
# Same domain only
same_domain = [u for u in urls if u.startswith("https://example.com")]
| Filter | Code | Use Case |
|---|---|---|
| Path prefix | url.startswith("https://example.com/blog/") | Blog only |
| Path contains | "/article/" in url | Articles only |
| Exclude pattern | "/tag/" not in url | Exclude tag pages |
| Regex | re.search(r"/\d{4}/\d{2}/", url) | Date-based paths |
Feed-Based Updates
For sites with frequent updates:
from trafilatura.feeds import find_feed_urls, parse_feed
feed_urls = find_feed_urls("https://example.com")
print(f"Found {len(feed_urls)} feeds")
for feed_url in feed_urls:
entries = parse_feed(feed_url)
for entry in entries:
print(f"{entry.title}: {entry.url}")
Supported formats: RSS 2.0, ATOM, JSON Feed.
Incremental Update Strategy
import json
from trafilatura.feeds import find_feed_urls, parse_feed
# Load previously processed URLs
try:
with open("processed_urls.json", "r") as f:
processed = set(json.load(f))
except FileNotFoundError:
processed = set()
# Check latest feed
feed_urls = find_feed_urls("https://example.com/blog")
new_articles = []
for feed_url in feed_urls:
entries = parse_feed(feed_url)
for entry in entries:
if entry.url not in processed:
new_articles.append(entry.url)
print(f"{len(new_articles)} new articles")
processed.update(new_articles)
with open("processed_urls.json", "w") as f:
json.dump(list(processed), f)
Combined Workflow
import trafilatura
from trafilatura.sitemaps import sitemap_search
from trafilatura.feeds import find_feed_urls, parse_feed
# 1. Historical articles from sitemap
sitemap_urls = sitemap_search("https://example.com")
blog_urls = [u for u in sitemap_urls if "/blog/" in u]
# 2. Latest articles from feed
feed_urls = find_feed_urls("https://example.com/blog")
feed_article_urls = []
for feed_url in feed_urls:
entries = parse_feed(feed_url)
feed_article_urls.extend([e.url for e in entries])
# 3. Merge and deduplicate
all_urls = list(set(blog_urls + feed_article_urls))
print(f"Total unique URLs: {len(all_urls)}")
# 4. Batch extract
for url in all_urls[:5]:
downloaded = trafilatura.fetch_url(url)
result = trafilatura.extract(downloaded, output_format="markdown")
if result:
print(f"✅ {url}")
Full-Site Scraping with Proxies
Using 16Yun API Proxy to extract and rotate IPs:
import trafilatura
import requests
import random
api_url = "http://ip.16yun.cn:817/myip/pl/xxx/?s=xxx&u=user&format=json&count=20"
response = requests.get(api_url)
proxy_list = response.json()
for url in all_urls:
proxy = random.choice(proxy_list)
proxies = {
"http": f"http://user:pass@{proxy['ip']}:{proxy['port']}",
"https": f"http://user:pass@{proxy['ip']}:{proxy['port']}",
}
try:
resp = requests.get(url, proxies=proxies, timeout=15)
resp.encoding = "utf-8"
result = trafilatura.extract(resp.text, output_format="markdown")
if result:
with open(f"articles/{url.split('/')[-1]}.md", "w") as f:
f.write(result)
print(f"✅ {url}")
except Exception as e:
print(f"❌ {url}: {e}")
Error Handling
Per 16Yun's help documentation error codes:
| Status | Cause | Action |
|---|---|---|
| 429 | Rate limited | Reduce concurrency, increase interval, rotate IP |
| 407 | Proxy auth failed | Verify credentials |
| 504 | Target timeout | Retry 2-3 times, skip persistent failures |
CLI Mode
# Discover URLs from sitemap
trafilatura --sitemap https://example.com/sitemap.xml
# Discover URLs from feed
trafilatura --feed https://example.com/feed.xml
# Batch extract from URL list file
trafilatura --list urls.txt --output-dir ./articles
Important Notes
- Sitemaps can contain tens of thousands of URLs — filter before extracting
- Respect robots.txt rate limits
- Long-running tasks should implement checkpoint/resume by tracking processed URLs
- 16Yun's Crawler Proxy (tunnel mode) simplifies proxy management — one entry point handles rotation
Need an enterprise proxy plan?
We can tailor architecture to your target domains, concurrency, and reliability goals.