3.4 KiB
3.4 KiB
Phase 2 Extraction Plan
This plan defines how rules will be produced in controlled batches without reproducing the books.
Non-negotiables (carried into Phase 2)
- No full-book OCR/transcription.
- No long verbatim passages.
- Rules are paraphrased and capped (
rule_text≤ 800 chars). - Every rule includes at least one source pointer in
source_refs. - If a rule depends on exact wording, the rule still paraphrases but must include:
rule_text: “Exact wording required—refer to pointer”- plus a usable pointer.
Primary reference PDFs for pointer extraction:
- The Chicago Manual of Style (18th ed).pdf
- Robert Bringhurst – The Elements of Typographic Style.pdf
Output batching format
When you say: EXTRACT <CATEGORY> [<SCOPE>]
I will output a bundle that includes:
-
Rules NDJSON (150–250 rule records)
- Path:
spec/rules/<category>/<batch_id>.ndjson - One JSON object per line, validated against
spec/schema/rule.schema.json.
- Path:
-
Index deltas for that category
spec/indexes/keywords_<category>.jsonspec/indexes/source_refs_<category>.jsonspec/indexes/coverage_delta_<category>.json
-
Coverage notes report
- Short Markdown report describing enforcement split (lint/typeset/postrender/manual)
- plus any known gaps or manual-only areas
Batch naming
<batch_id> format:
v1_<category>_<nnn>(e.g.v1_punctuation_001)
Batches are append-only:
- If rules need revision, mark old rule
deprecated, add a new rule ID (or new version segment) and keep both records.
Pointer scheme details
Pointer strings live in source_refs[] and are not quotes.
Preferred pointer format:
CMOS18 §<section> p<book_page>BRING §<section> p<book_page>- Optional disambiguation:
(scan p<pdf_page_index>)
Example pattern (not a quote):
CMOS18 §6.1 p377 (scan p10)
Notes:
- “book_page” uses the printed page number in the book when present (arabic or roman).
- “scan p” uses the PDF page index when printed page numbers are ambiguous.
Recommended extraction order (high-impact first)
- numbers
- punctuation
- citations
- headings
- tables
- figures
- links
- code
- layout (widows/orphans, keeps, overflow)
- front/back matter
- accessibility
- i18n
Rationale:
- Numbers/punctuation/citations most directly affect correctness and consistency.
- Layout rules benefit from having structure and tokens in place.
Enforcement mapping guidelines (honest labeling)
lint: detectable from AST or text normalization (spacing, punctuation patterns, citation fields).typeset: enforced via CSS/tokens/paged-media decisions.postrender: requires layout inspection after rendering.manual: cannot be reliably automated; must includetags: ["manual_checklist=true"]and be emitted into checklist outputs.
If a concept spans multiple enforcement layers:
- Prefer splitting into two rules:
- one lint rule (source cleanliness)
- one postrender rule (layout outcome)
- Use
dependenciesto link them.
“Degraded mode” considerations during extraction
For each category batch, include some rules that specifically target degraded inputs:
- hard-wrap repair suggestions
- heading inference warnings
- link sanitation and encoding fixes
- Unicode normalization notes
These rules should generally be warn or should, unless they prevent corruption (then must).