10-Day Learn Scrapy Day 4: Core Engineering Lecture

Part 4: Item Pipeline and Data Normalization

This is Day 4/10 of "10-Day Learn Scrapy". Today solves one concrete problem only.

What Is Item Pipeline and Data Normalization?

Item Pipeline Cleaning and Normalization is a focused unit of scraping work that can be implemented and verified independently. Conclusion: you must deliver normalized price, rating, and text fields by end of day.

Beginners Scrapy Tutorial

Constraints for this day:

single-module scope only
evidence must include commands, code, outputs, and validation
every failure needs one fix note

Today's repo documentation anchors:

scrapy/scrapy: key directories docs, extras, scrapy, sep
scrapy/scrapyd: key directories docs, integration_tests, scrapyd, tests
scrapy-plugins/scrapy-playwright: key directories docs, examples, scrapy_playwright, tests

Step 1 - Environment and Baseline Setup

cd ~/scrapy-labs/day01/bookslab
scrapy crawl books -O output/day04.json
python scripts/check_schema.py output/day04.json

Step 2 - Build the Core Module

Core implementation snippet for today:

# pipelines.py
import re
class BooksPipeline:
    def process_item(self, item, spider):
        raw_price = item.get("price_text", "")
        m = re.search(r"(\d+\.\d+)", raw_price)
        item["price_gbp"] = float(m.group(1)) if m else None
        item["title"] = (item.get("title") or "").strip()
        return item

Step 3 - Run and Capture Outputs

Expected output check:

the crawl writes a structured output file;
critical fields are present and non-empty for sampled rows.

Step 4 - Validate and Fix Failures

Supporting code snippet for today's flow:

# settings.py
ITEM_PIPELINES = {
    "bookslab.pipelines.BooksPipeline": 300,
}

Step 5 - Boundary and Acceptance

Pitfall 1: command success without data-quality checks.
Pitfall 2: manual visual inspection without scripts.
Pitfall 3: multi-variable changes in one experiment.

Acceptance table:

Check	Pass Criteria	Failure Signal	Fix Direction
Output size	>= 200 rows	far below threshold	inspect pagination/request path
Field quality	missing ratio <= 5%	many empty title/url	revisit selectors and cleaning
Validation script	pass	assert fail	debug failed rows and rerun
Rollback	recover in 10 min	irreversible changes	keep baseline config

Next Steps

Summarize today's knowledge coverage: core concepts, module implementation, validation and troubleshooting, production boundary
Record one failure and one fix action
Continue to the next Part with the same Step rhythm