Phase 2 Extraction Plan

This plan defines how rules will be produced in controlled batches without reproducing the books.

Non-negotiables (carried into Phase 2)

No full-book OCR/transcription.
No long verbatim passages.
Rules are paraphrased and capped (rule_text ≤ 800 chars).
Every rule includes at least one source pointer in source_refs.
If a rule depends on exact wording, the rule still paraphrases but must include:
- rule_text: “Exact wording required—refer to pointer”
- plus a usable pointer.

Primary reference PDFs for pointer extraction:

When you say: EXTRACT <CATEGORY> [<SCOPE>]

I will output a bundle that includes:

Rules NDJSON (150–250 rule records)
- Path: spec/rules/<category>/<batch_id>.ndjson
- One JSON object per line, validated against spec/schema/rule.schema.json.
Index deltas for that category
- spec/indexes/keywords_<category>.json
- spec/indexes/source_refs_<category>.json
- spec/indexes/coverage_delta_<category>.json
Coverage notes report
- Short Markdown report describing enforcement split (lint/typeset/postrender/manual)
- plus any known gaps or manual-only areas

<batch_id> format:

Batches are append-only:

If rules need revision, mark old rule deprecated, add a new rule ID (or new version segment) and keep both records.

Pointer strings live in source_refs[] and are not quotes.

Preferred pointer format:

Example pattern (not a quote):

Notes:

“book_page” uses the printed page number in the book when present (arabic or roman).
“scan p” uses the PDF page index when printed page numbers are ambiguous.

Rationale:

lint: detectable from AST or text normalization (spacing, punctuation patterns, citation fields).
typeset: enforced via CSS/tokens/paged-media decisions.
postrender: requires layout inspection after rendering.
manual: cannot be reliably automated; must include tags: ["manual_checklist=true"] and be emitted into checklist outputs.

If a concept spans multiple enforcement layers:

Prefer splitting into two rules:
- one lint rule (source cleanliness)
- one postrender rule (layout outcome)
Use dependencies to link them.

For each category batch, include some rules that specifically target degraded inputs:

These rules should generally be warn or should, unless they prevent corruption (then must).