# Phase 2 Extraction Plan This plan defines how rules will be produced in controlled batches without reproducing the books. ## Non-negotiables (carried into Phase 2) * No full-book OCR/transcription. * No long verbatim passages. * Rules are paraphrased and capped (`rule_text` ≤ 800 chars). * Every rule includes at least one source pointer in `source_refs`. * If a rule depends on exact wording, the rule still paraphrases but must include: * `rule_text`: “Exact wording required—refer to pointer” * plus a usable pointer. Primary reference PDFs for pointer extraction: * The Chicago Manual of Style (18th ed).pdf * Robert Bringhurst – The Elements of Typographic Style.pdf ## Output batching format When you say: `EXTRACT []` I will output a bundle that includes: 1. **Rules NDJSON** (150–250 rule records) * Path: `spec/rules//.ndjson` * One JSON object per line, validated against `spec/schema/rule.schema.json`. 2. **Index deltas** for that category * `spec/indexes/keywords_.json` * `spec/indexes/source_refs_.json` * `spec/indexes/coverage_delta_.json` 3. **Coverage notes** report * Short Markdown report describing enforcement split (lint/typeset/postrender/manual) * plus any known gaps or manual-only areas ## Batch naming `` format: * `v1__` (e.g. `v1_punctuation_001`) Batches are append-only: * If rules need revision, mark old rule `deprecated`, add a new rule ID (or new version segment) and keep both records. ## Pointer scheme details Pointer strings live in `source_refs[]` and are **not** quotes. Preferred pointer format: * `CMOS18 §
p` * `BRING §
p` * Optional disambiguation: `(scan p)` Example pattern (not a quote): * `CMOS18 §6.1 p377 (scan p10)` Notes: * “book_page” uses the printed page number in the book when present (arabic or roman). * “scan p” uses the PDF page index when printed page numbers are ambiguous. ## Recommended extraction order (high-impact first) 1. numbers 2. punctuation 3. citations 4. headings 5. tables 6. figures 7. links 8. code 9. layout (widows/orphans, keeps, overflow) 10. front/back matter 11. accessibility 12. i18n Rationale: * Numbers/punctuation/citations most directly affect correctness and consistency. * Layout rules benefit from having structure and tokens in place. ## Enforcement mapping guidelines (honest labeling) * `lint`: detectable from AST or text normalization (spacing, punctuation patterns, citation fields). * `typeset`: enforced via CSS/tokens/paged-media decisions. * `postrender`: requires layout inspection after rendering. * `manual`: cannot be reliably automated; must include `tags: ["manual_checklist=true"]` and be emitted into checklist outputs. If a concept spans multiple enforcement layers: * Prefer splitting into two rules: * one lint rule (source cleanliness) * one postrender rule (layout outcome) * Use `dependencies` to link them. ## “Degraded mode” considerations during extraction For each category batch, include some rules that specifically target degraded inputs: * hard-wrap repair suggestions * heading inference warnings * link sanitation and encoding fixes * Unicode normalization notes These rules should generally be `warn` or `should`, unless they prevent corruption (then `must`).