iftypeset/docs/03-rule-ingestion-sop.md
2026-01-03 20:29:35 +00:00

4.7 KiB

Rule Ingestion SOP (Chicago / Bringhurst pointers, no quotes)

This is the operator workflow for adding new rules to iftypeset without drifting into “we copied the book”.

Nonnegotiables (repeat them until boring)

  • Never store book text (OCR output, excerpts, paragraphs) in the repo.
  • Rules are paraphrases only + pointer refs.
  • Chicago OCR is allowed ephemerally to locate pointers; temp files must be deleted.
  • If exact wording matters, the rule must say: Exact wording required—refer to pointer.

What youre producing

You add a small batch file under spec/rules/<category>/v1_<category>_<nnn>.ndjson where each line is a JSON rule record validated by spec/schema/rule.schema.json.

Step-by-step workflow

1) Pick the category + severity honestly

  • Category: one of the category_taxonomy buckets in spec/manifest.yaml.
  • Severity:
    • must: blocks release unless profile overrides lower it
    • should: best practice; can be warn in degraded mode
    • warn: advisory

If it cant be automated, tag it manual_checklist=true and set enforcement: manual.

2) Locate the pointer (without storing text)

Bringhurst (text-layer usable)

  • Use tools/bringhurst_locate.py (preferred) or ripgrep directly.
  • Capture only:
    • section identifier
    • book page number (if present)
    • scan page index (optional)

Chicago (image scan)

  • Use tools/chicago_ocr.py grep-only.
  • Do not copy OCR output into files.
  • Use OCR to find:
    • the relevant section (§)
    • the printed page number
    • the scan page index

Pointer format examples (note: these are pointers, not quotes):

  • CMOS18 §6.1 p377 (scan p10)
  • BRING §2.3.2 p39 (scan p412)

Rule: (scan pN) is a single 1-based PDF page index, not a range.

3) Write the rule record (paraphrase only)

Create a new NDJSON line. Keep rule_text ≤ 800 chars. Prefer short, enforceable statements.

Minimal template:

{
  "id": "CMOS.PUNCTUATION.DASHES.EM_DASH",
  "title": "Use em dashes consistently",
  "source_refs": ["CMOS18 §X.Y pNNN (scan pMMM)"],
  "category": "punctuation",
  "severity": "should",
  "applies_to": "all",
  "rule_text": "Paraphrase of the rule (no quotes). If wording matters, say: Exact wording required—refer to pointer.",
  "rationale": "Why it matters (one line).",
  "enforcement": "lint",
  "autofix": "suggest",
  "autofix_notes": "What we can safely fix (short).",
  "tags": ["spacing", "manual_checklist=false"],
  "keywords": ["em dash", "dash", "punctuation"],
  "dependencies": [],
  "exceptions": [],
  "status": "draft"
}

Guidelines:

  • Do not embed long examples. If you need examples, create them under spec/examples/ and reference examples_ref.
  • Prefer splitting cross-layer concepts into two rules:
    • lint rule for source cleanliness
    • postrender rule for layout outcome
  • Use dependencies if rule ordering matters (e.g., “normalize quotes” before “ellipsis spacing”).

4) Tag manual rules so the checklist can be generated

If a rule requires human judgment (e.g., “choose between two valid citation styles”), set:

  • enforcement: manual
  • tags: ["manual_checklist=true"]
  • autofix: none

5) Validate + rebuild indexes (every batch)

Run:

  • PYTHONPATH=src python3 -m iftypeset.cli validate-spec --spec spec --build-indexes
  • PYTHONPATH=src python3 -m iftypeset.cli report --spec spec --out out --build-indexes

Do not merge a batch if schema validation fails.

6) Add fixtures / examples (so rules stay enforced)

For each batch, add at least:

  • 13 spec/examples/* entries that trigger the rule (small, targeted).
  • 1 fixture doc under fixtures/ if the rule affects real documents.

Rules without fixtures drift into “it exists but nothing enforces it.”

7) Promote from draft → active

Only set status: active when:

  • the enforcement implementation exists (lint/typeset/postrender/manual)
  • at least one fixture/example covers it

Common traps (avoid)

  • Copying text into rule_text (even “short” quotes). Dont.
  • Ranges in scan pages: use a single (scan pN) hint.
  • MUST rules that are unenforceable: tag as manual checklist or downgrade.
  • Overfitting to one document: rules should generalize beyond a single sample.
  • “Autofix rewrite” that changes meaning: keep fixes deterministic and reversible.

Review checklist (before shipping a batch)

  • No book text stored in repo (grep your changes).
  • All rules have valid source_refs pointers.
  • rule_text is paraphrase-only and short.
  • Manual rules are tagged correctly.
  • validate-spec + report pass.
  • At least one fixture/example added for the batch.