3.9 KiB
Rule Coverage Roadmap (CMOS / Bringhurst)
Goal: move coverage close to 100% of rule-bearing sections in the Chicago Manual of Style (18th ed) and Bringhurst (Elements of Typographic Style), without reproducing book text.
Definition of "100%"
Coverage is section-based, not "full-book OCR." To keep multi-session work honest and queueable, we track two milestones:
- Milestone A (mapped): a section is no longer
uncoveredonce at least one paraphrased rule exists with a valid pointer. In practice this corresponds tostatus=partialin the coverage map. - Milestone B (complete): a section is only considered complete once it is marked
coveredorout_of_scope.coveredmeans a completion pass was done and the section’s rule-bearing guidance is believed to be fully represented as paraphrased rules (no verbatim text).out_of_scopemeans the section is primarily narrative/historical/non-prescriptive and is explicitly excluded.
This avoids over-claiming: partial means “some rules exist”, not “we’re done.”
Inputs (local only)
/root/docs/_uploads/The Chicago Manual of Style (18th ed OCR).pdf/root/docs/_uploads/Robert Bringhurst – The Elements of Typographic Style (OCR).pdf
OCR is used only to locate sections + page pointers. No verbatim text enters the repo.
Artifacts (new)
spec/coverage/cmos18_sections.jsonspec/coverage/bring_sections.jsonspec/coverage/README.mdspec/coverage/coverage_summary.json(optional, derived)
Each entry should track: section_id, title, pointer, status (uncovered|partial|covered|out_of_scope), and rule_ids[].
Phases
Phase 0: Coverage Map Scaffold
- Create coverage files + schema conventions.
- Seed with top-level sections and a few subsections to validate the workflow.
Phase 1: High-Impact Rules (CMOS)
- numbers, punctuation, citations, headings, links.
- Batch size: 150–250 rules per category.
Phase 2: Structure + Layout (BRING + CMOS)
- typography, layout, widows/orphans, tables, figures.
- Translate to
typeset/postrenderwhere possible.
Phase 3: Remaining Categories
- code, front/back matter, accessibility, i18n.
Phase 4: Enforcement Uplift
- Convert top-impact manual rules to
lint/postrenderwhere feasible. - Tighten QA gates without breaking deterministic behavior.
Rules of engagement
- Paraphrase only; no verbatim text.
- If exact wording is required, use:
Exact wording required—refer to pointer. - Every rule must include a
source_refspointer. - Prefer numbers-only audits when validating coverage maps against OCR.
- Run:
PYTHONPATH=src python3 -m iftypeset.cli validate-spec --spec spec --build-indexes./scripts/ci.sh
- Checkpoint after meaningful batches:
./scripts/checkpoint.sh "extraction <category> batch <NNN>"
Spot checks (numbers only)
To reduce “missing section” drift, use the OCR audit helper to compare section numbers only against the coverage map within a scan range (and to sanity-check pointer scan/printed pages).
Example (CMOS18 chapter 13 slice):
python3 tools/coverage_ocr_audit.py \
--pdf "/root/docs/_uploads/The Chicago Manual of Style (18th ed OCR).pdf" \
--coverage spec/coverage/cmos18_sections.json \
--chapter 13 \
--scan-start 840 \
--scan-end 870 \
--out-json out/coverage-ocr-audit-ch13.json \
--out-md out/coverage-ocr-audit-ch13.md
Interpretation:
OCR-onlysection ids likely need new coverage entries.Coverage-onlysection ids often indicate pointer drift or OCR weirdness (verify the scan page).Printed page mismatchesare strong signals that a pointer is wrong or that the scan page is not the referenced content.
Progress metrics
% sections mappedper book (non-uncovered).% sections completeper book (covered+out_of_scope).mustrules implemented coverage (CI floor = 95%).- Overall implemented coverage (CI floor = 80%).