# Rule Coverage Roadmap (CMOS / Bringhurst) Goal: move coverage close to **100% of rule-bearing sections** in the Chicago Manual of Style (18th ed) and Bringhurst (Elements of Typographic Style), without reproducing book text. ## Definition of "100%" Coverage is **section-based**, not "full-book OCR." To keep multi-session work honest and queueable, we track two milestones: - **Milestone A (mapped):** a section is no longer `uncovered` once at least one paraphrased rule exists with a valid pointer. In practice this corresponds to `status=partial` in the coverage map. - **Milestone B (complete):** a section is only considered complete once it is marked `covered` or `out_of_scope`. - `covered` means a completion pass was done and the section’s rule-bearing guidance is believed to be fully represented as paraphrased rules (no verbatim text). - `out_of_scope` means the section is primarily narrative/historical/non-prescriptive and is explicitly excluded. This avoids over-claiming: `partial` means “some rules exist”, not “we’re done.” ## Inputs (local only) - `/root/docs/_uploads/The Chicago Manual of Style (18th ed OCR).pdf` - `/root/docs/_uploads/Robert Bringhurst – The Elements of Typographic Style (OCR).pdf` OCR is used only to locate sections + page pointers. No verbatim text enters the repo. ## Artifacts (new) - `spec/coverage/cmos18_sections.json` - `spec/coverage/bring_sections.json` - `spec/coverage/README.md` - `spec/coverage/coverage_summary.json` (optional, derived) Each entry should track: `section_id`, `title`, `pointer`, `status` (`uncovered|partial|covered|out_of_scope`), and `rule_ids[]`. ## Phases ### Phase 0: Coverage Map Scaffold - Create coverage files + schema conventions. - Seed with top-level sections and a few subsections to validate the workflow. ### Phase 1: High-Impact Rules (CMOS) - numbers, punctuation, citations, headings, links. - Batch size: 150–250 rules per category. ### Phase 2: Structure + Layout (BRING + CMOS) - typography, layout, widows/orphans, tables, figures. - Translate to `typeset`/`postrender` where possible. ### Phase 3: Remaining Categories - code, front/back matter, accessibility, i18n. ### Phase 4: Enforcement Uplift - Convert top-impact manual rules to `lint`/`postrender` where feasible. - Tighten QA gates without breaking deterministic behavior. ## Rules of engagement - Paraphrase only; no verbatim text. - If exact wording is required, use: `Exact wording required—refer to pointer`. - Every rule must include a `source_refs` pointer. - Prefer **numbers-only** audits when validating coverage maps against OCR. - Run: - `PYTHONPATH=src python3 -m iftypeset.cli validate-spec --spec spec --build-indexes` - `./scripts/ci.sh` - Checkpoint after meaningful batches: `./scripts/checkpoint.sh "extraction batch "` ## Spot checks (numbers only) To reduce “missing section” drift, use the OCR audit helper to compare **section numbers only** against the coverage map within a scan range (and to sanity-check pointer scan/printed pages). Example (CMOS18 chapter 13 slice): ```bash python3 tools/coverage_ocr_audit.py \ --pdf "/root/docs/_uploads/The Chicago Manual of Style (18th ed OCR).pdf" \ --coverage spec/coverage/cmos18_sections.json \ --chapter 13 \ --scan-start 840 \ --scan-end 870 \ --out-json out/coverage-ocr-audit-ch13.json \ --out-md out/coverage-ocr-audit-ch13.md ``` Interpretation: - `OCR-only` section ids likely need new coverage entries. - `Coverage-only` section ids often indicate pointer drift or OCR weirdness (verify the scan page). - `Printed page mismatches` are strong signals that a pointer is wrong or that the scan page is not the referenced content. ## Progress metrics - `% sections mapped` per book (non-`uncovered`). - `% sections complete` per book (`covered` + `out_of_scope`). - `must` rules implemented coverage (CI floor = 95%). - Overall implemented coverage (CI floor = 80%).