10-Day Learn Scrapy Day 1: Core Engineering Lecture

Part 1: Project Bootstrap and First Spider

This is Day 1/10 of "10-Day Learn Scrapy". Today solves one concrete problem only.

What Is Project Bootstrap and First Spider?

Project Bootstrap and First Runnable Spider is a focused unit of scraping work that can be implemented and verified independently. Conclusion: you must deliver a pagination-capable spider exporting clean JSON by end of day.

Beginners Scrapy Tutorial

Constraints for this day:

single-module scope only
evidence must include commands, code, outputs, and validation
every failure needs one fix note

Today's repo documentation anchors:

scrapy/scrapy: key directories docs, extras, scrapy, sep
scrapy/scrapyd: key directories docs, integration_tests, scrapyd, tests
scrapy-plugins/scrapy-playwright: key directories docs, examples, scrapy_playwright, tests

Step 1 - Environment and Baseline Setup

mkdir -p ~/scrapy-labs/day01
cd ~/scrapy-labs/day01
python3 -m venv .venv
source .venv/bin/activate
pip install scrapy==2.13.3
scrapy startproject bookslab
cd bookslab
scrapy genspider books example.16yun.cn
scrapy crawl books -O output/day01.json

Step 2 - Build the Core Module

Core implementation snippet for today:

# spiders/books.py
import scrapy
 
class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://example.16yun.cn/catalogue/page-1.html"]
 
    def parse(self, response):
        for card in response.css("article.product_pod"):
            yield {
                "title": card.css("h3 a::attr(title)").get(),
                "url": response.urljoin(card.css("h3 a::attr(href)").get()),
            }
        next_href = response.css("li.next a::attr(href)").get()
        if next_href:
            yield response.follow(next_href, self.parse)

Step 3 - Run and Capture Outputs

Expected output check:

the crawl writes a structured output file;
critical fields are present and non-empty for sampled rows.

Step 4 - Validate and Fix Failures

Supporting code snippet for today's flow:

# validate_day01.py
import json
from pathlib import Path
rows = json.loads(Path("output/day01.json").read_text())
assert len(rows) >= 200
assert all(r.get("title") for r in rows[:20])
print("day01 ok", len(rows))

Step 5 - Boundary and Acceptance

Pitfall 1: command success without data-quality checks.
Pitfall 2: manual visual inspection without scripts.
Pitfall 3: multi-variable changes in one experiment.

Acceptance table:

Check	Pass Criteria	Failure Signal	Fix Direction
Output size	>= 200 rows	far below threshold	inspect pagination/request path
Field quality	missing ratio <= 5%	many empty title/url	revisit selectors and cleaning
Validation script	pass	assert fail	debug failed rows and rerun
Rollback	recover in 10 min	irreversible changes	keep baseline config

Next Steps

Summarize today's knowledge coverage: core concepts, module implementation, validation and troubleshooting, production boundary
Record one failure and one fix action
Continue to the next Part with the same Step rhythm