4.7 KiB
Rule Ingestion SOP (Chicago / Bringhurst pointers, no quotes)
This is the operator workflow for adding new rules to iftypeset without drifting into “we copied the book”.
Non‑negotiables (repeat them until boring)
- Never store book text (OCR output, excerpts, paragraphs) in the repo.
- Rules are paraphrases only + pointer refs.
- Chicago OCR is allowed ephemerally to locate pointers; temp files must be deleted.
- If exact wording matters, the rule must say:
Exact wording required—refer to pointer.
What you’re producing
You add a small batch file under spec/rules/<category>/v1_<category>_<nnn>.ndjson where each line is a JSON rule record validated by spec/schema/rule.schema.json.
Step-by-step workflow
1) Pick the category + severity honestly
- Category: one of the
category_taxonomybuckets inspec/manifest.yaml. - Severity:
must: blocks release unless profile overrides lower itshould: best practice; can be warn in degraded modewarn: advisory
If it can’t be automated, tag it manual_checklist=true and set enforcement: manual.
2) Locate the pointer (without storing text)
Bringhurst (text-layer usable)
- Use
tools/bringhurst_locate.py(preferred) orripgrepdirectly. - Capture only:
- section identifier
- book page number (if present)
- scan page index (optional)
Chicago (image scan)
- Use
tools/chicago_ocr.pygrep-only. - Do not copy OCR output into files.
- Use OCR to find:
- the relevant section (§)
- the printed page number
- the scan page index
Pointer format examples (note: these are pointers, not quotes):
CMOS18 §6.1 p377 (scan p10)BRING §2.3.2 p39 (scan p412)
Rule: (scan pN) is a single 1-based PDF page index, not a range.
3) Write the rule record (paraphrase only)
Create a new NDJSON line. Keep rule_text ≤ 800 chars. Prefer short, enforceable statements.
Minimal template:
{
"id": "CMOS.PUNCTUATION.DASHES.EM_DASH",
"title": "Use em dashes consistently",
"source_refs": ["CMOS18 §X.Y pNNN (scan pMMM)"],
"category": "punctuation",
"severity": "should",
"applies_to": "all",
"rule_text": "Paraphrase of the rule (no quotes). If wording matters, say: Exact wording required—refer to pointer.",
"rationale": "Why it matters (one line).",
"enforcement": "lint",
"autofix": "suggest",
"autofix_notes": "What we can safely fix (short).",
"tags": ["spacing", "manual_checklist=false"],
"keywords": ["em dash", "dash", "punctuation"],
"dependencies": [],
"exceptions": [],
"status": "draft"
}
Guidelines:
- Do not embed long examples. If you need examples, create them under
spec/examples/and referenceexamples_ref. - Prefer splitting cross-layer concepts into two rules:
lintrule for source cleanlinesspostrenderrule for layout outcome
- Use
dependenciesif rule ordering matters (e.g., “normalize quotes” before “ellipsis spacing”).
4) Tag manual rules so the checklist can be generated
If a rule requires human judgment (e.g., “choose between two valid citation styles”), set:
enforcement: manualtags: ["manual_checklist=true"]autofix: none
5) Validate + rebuild indexes (every batch)
Run:
PYTHONPATH=src python3 -m iftypeset.cli validate-spec --spec spec --build-indexesPYTHONPATH=src python3 -m iftypeset.cli report --spec spec --out out --build-indexes
Do not merge a batch if schema validation fails.
6) Add fixtures / examples (so rules stay enforced)
For each batch, add at least:
- 1–3
spec/examples/*entries that trigger the rule (small, targeted). - 1 fixture doc under
fixtures/if the rule affects real documents.
Rules without fixtures drift into “it exists but nothing enforces it.”
7) Promote from draft → active
Only set status: active when:
- the enforcement implementation exists (lint/typeset/postrender/manual)
- at least one fixture/example covers it
Common traps (avoid)
- Copying text into rule_text (even “short” quotes). Don’t.
- Ranges in scan pages: use a single
(scan pN)hint. - MUST rules that are unenforceable: tag as manual checklist or downgrade.
- Overfitting to one document: rules should generalize beyond a single sample.
- “Autofix rewrite” that changes meaning: keep fixes deterministic and reversible.
Review checklist (before shipping a batch)
- No book text stored in repo (grep your changes).
- All rules have valid
source_refspointers. rule_textis paraphrase-only and short.- Manual rules are tagged correctly.
validate-spec+reportpass.- At least one fixture/example added for the batch.