iftypeset/spec/extraction_plan.md
2026-01-03 20:29:35 +00:00

3.4 KiB

Phase 2 Extraction Plan

This plan defines how rules will be produced in controlled batches without reproducing the books.

Non-negotiables (carried into Phase 2)

  • No full-book OCR/transcription.
  • No long verbatim passages.
  • Rules are paraphrased and capped (rule_text ≤ 800 chars).
  • Every rule includes at least one source pointer in source_refs.
  • If a rule depends on exact wording, the rule still paraphrases but must include:
    • rule_text: “Exact wording required—refer to pointer”
    • plus a usable pointer.

Primary reference PDFs for pointer extraction:

  • The Chicago Manual of Style (18th ed).pdf
  • Robert Bringhurst The Elements of Typographic Style.pdf

Output batching format

When you say: EXTRACT <CATEGORY> [<SCOPE>]

I will output a bundle that includes:

  1. Rules NDJSON (150250 rule records)

    • Path: spec/rules/<category>/<batch_id>.ndjson
    • One JSON object per line, validated against spec/schema/rule.schema.json.
  2. Index deltas for that category

    • spec/indexes/keywords_<category>.json
    • spec/indexes/source_refs_<category>.json
    • spec/indexes/coverage_delta_<category>.json
  3. Coverage notes report

    • Short Markdown report describing enforcement split (lint/typeset/postrender/manual)
    • plus any known gaps or manual-only areas

Batch naming

<batch_id> format:

  • v1_<category>_<nnn> (e.g. v1_punctuation_001)

Batches are append-only:

  • If rules need revision, mark old rule deprecated, add a new rule ID (or new version segment) and keep both records.

Pointer scheme details

Pointer strings live in source_refs[] and are not quotes.

Preferred pointer format:

  • CMOS18 §<section> p<book_page>
  • BRING §<section> p<book_page>
  • Optional disambiguation: (scan p<pdf_page_index>)

Example pattern (not a quote):

  • CMOS18 §6.1 p377 (scan p10)

Notes:

  • “book_page” uses the printed page number in the book when present (arabic or roman).
  • “scan p” uses the PDF page index when printed page numbers are ambiguous.
  1. numbers
  2. punctuation
  3. citations
  4. headings
  5. tables
  6. figures
  7. links
  8. code
  9. layout (widows/orphans, keeps, overflow)
  10. front/back matter
  11. accessibility
  12. i18n

Rationale:

  • Numbers/punctuation/citations most directly affect correctness and consistency.
  • Layout rules benefit from having structure and tokens in place.

Enforcement mapping guidelines (honest labeling)

  • lint: detectable from AST or text normalization (spacing, punctuation patterns, citation fields).
  • typeset: enforced via CSS/tokens/paged-media decisions.
  • postrender: requires layout inspection after rendering.
  • manual: cannot be reliably automated; must include tags: ["manual_checklist=true"] and be emitted into checklist outputs.

If a concept spans multiple enforcement layers:

  • Prefer splitting into two rules:
    • one lint rule (source cleanliness)
    • one postrender rule (layout outcome)
  • Use dependencies to link them.

“Degraded mode” considerations during extraction

For each category batch, include some rules that specifically target degraded inputs:

  • hard-wrap repair suggestions
  • heading inference warnings
  • link sanitation and encoding fixes
  • Unicode normalization notes

These rules should generally be warn or should, unless they prevent corruption (then must).