10-Day Learn Scrapy Day 5: Core Engineering Lecture

Part 5: Feed Export and Incremental Keys

This is Day 5/10 of "10-Day Learn Scrapy". Today solves one concrete problem only.

What Is Feed Export and Incremental Keys?

Export Strategy and Incremental Crawl is a focused unit of scraping work that can be implemented and verified independently. Conclusion: you must deliver dual JSONL/CSV output with incremental key by end of day.

Beginners Scrapy Tutorial

Constraints for this day:

single-module scope only
evidence must include commands, code, outputs, and validation
every failure needs one fix note

Today's repo documentation anchors:

scrapy/scrapy: key directories docs, extras, scrapy, sep
scrapy/scrapyd: key directories docs, integration_tests, scrapyd, tests
scrapy-plugins/scrapy-playwright: key directories docs, examples, scrapy_playwright, tests

Step 1 - Environment and Baseline Setup

cd ~/scrapy-labs/day01/bookslab
scrapy crawl books -O output/day05.jsonl
scrapy crawl books -O output/day05.csv

Step 2 - Build the Core Module

Core implementation snippet for today:

# add stable key for incremental processing
import hashlib
def build_item_key(title: str, upc: str) -> str:
    raw = f"{title}|{upc}".encode("utf-8")
    return hashlib.sha1(raw).hexdigest()[:16]

Step 3 - Run and Capture Outputs

Expected output check:

the crawl writes a structured output file;
critical fields are present and non-empty for sampled rows.

Step 4 - Validate and Fix Failures

Supporting code snippet for today's flow:

# usage when yielding
item["item_key"] = build_item_key(item.get("title", ""), item.get("upc", ""))

Step 5 - Boundary and Acceptance

Pitfall 1: command success without data-quality checks.
Pitfall 2: manual visual inspection without scripts.
Pitfall 3: multi-variable changes in one experiment.

Acceptance table:

Check	Pass Criteria	Failure Signal	Fix Direction
Output size	>= 200 rows	far below threshold	inspect pagination/request path
Field quality	missing ratio <= 5%	many empty title/url	revisit selectors and cleaning
Validation script	pass	assert fail	debug failed rows and rerun
Rollback	recover in 10 min	irreversible changes	keep baseline config

Next Steps

Summarize today's knowledge coverage: core concepts, module implementation, validation and troubleshooting, production boundary
Record one failure and one fix action
Continue to the next Part with the same Step rhythm