iftypeset/docs/19-coverage-roadmap.md
codex e92f1c3b93
Some checks are pending
ci / ci (push) Waiting to run
iftypeset: document CI pipeline + Playwright + font contract
2026-01-08 18:10:41 +00:00

3.9 KiB
Raw Export PDF Blame History

Rule Coverage Roadmap (CMOS / Bringhurst)

Goal: move coverage close to 100% of rule-bearing sections in the Chicago Manual of Style (18th ed) and Bringhurst (Elements of Typographic Style), without reproducing book text.

Definition of "100%"

Coverage is section-based, not "full-book OCR." To keep multi-session work honest and queueable, we track two milestones:

  • Milestone A (mapped): a section is no longer uncovered once at least one paraphrased rule exists with a valid pointer. In practice this corresponds to status=partial in the coverage map.
  • Milestone B (complete): a section is only considered complete once it is marked covered or out_of_scope.
    • covered means a completion pass was done and the sections rule-bearing guidance is believed to be fully represented as paraphrased rules (no verbatim text).
    • out_of_scope means the section is primarily narrative/historical/non-prescriptive and is explicitly excluded.

This avoids over-claiming: partial means “some rules exist”, not “were done.”

Inputs (local only)

  • /root/docs/_uploads/The Chicago Manual of Style (18th ed OCR).pdf
  • /root/docs/_uploads/Robert Bringhurst The Elements of Typographic Style (OCR).pdf

OCR is used only to locate sections + page pointers. No verbatim text enters the repo.

Artifacts (new)

  • spec/coverage/cmos18_sections.json
  • spec/coverage/bring_sections.json
  • spec/coverage/README.md
  • spec/coverage/coverage_summary.json (optional, derived)

Each entry should track: section_id, title, pointer, status (uncovered|partial|covered|out_of_scope), and rule_ids[].

Phases

Phase 0: Coverage Map Scaffold

  • Create coverage files + schema conventions.
  • Seed with top-level sections and a few subsections to validate the workflow.

Phase 1: High-Impact Rules (CMOS)

  • numbers, punctuation, citations, headings, links.
  • Batch size: 150250 rules per category.

Phase 2: Structure + Layout (BRING + CMOS)

  • typography, layout, widows/orphans, tables, figures.
  • Translate to typeset/postrender where possible.

Phase 3: Remaining Categories

  • code, front/back matter, accessibility, i18n.

Phase 4: Enforcement Uplift

  • Convert top-impact manual rules to lint/postrender where feasible.
  • Tighten QA gates without breaking deterministic behavior.

Rules of engagement

  • Paraphrase only; no verbatim text.
  • If exact wording is required, use: Exact wording required—refer to pointer.
  • Every rule must include a source_refs pointer.
  • Prefer numbers-only audits when validating coverage maps against OCR.
  • Run:
    • PYTHONPATH=src python3 -m iftypeset.cli validate-spec --spec spec --build-indexes
    • ./scripts/ci.sh
  • Checkpoint after meaningful batches: ./scripts/checkpoint.sh "extraction <category> batch <NNN>"

Spot checks (numbers only)

To reduce “missing section” drift, use the OCR audit helper to compare section numbers only against the coverage map within a scan range (and to sanity-check pointer scan/printed pages).

Example (CMOS18 chapter 13 slice):

python3 tools/coverage_ocr_audit.py \
  --pdf "/root/docs/_uploads/The Chicago Manual of Style (18th ed OCR).pdf" \
  --coverage spec/coverage/cmos18_sections.json \
  --chapter 13 \
  --scan-start 840 \
  --scan-end 870 \
  --out-json out/coverage-ocr-audit-ch13.json \
  --out-md out/coverage-ocr-audit-ch13.md

Interpretation:

  • OCR-only section ids likely need new coverage entries.
  • Coverage-only section ids often indicate pointer drift or OCR weirdness (verify the scan page).
  • Printed page mismatches are strong signals that a pointer is wrong or that the scan page is not the referenced content.

Progress metrics

  • % sections mapped per book (non-uncovered).
  • % sections complete per book (covered + out_of_scope).
  • must rules implemented coverage (CI floor = 95%).
  • Overall implemented coverage (CI floor = 80%).