# Phase 2 Extraction Plan

This plan defines how rules will be produced in controlled batches without reproducing the books.

## Non-negotiables (carried into Phase 2)

* No full-book OCR/transcription.
* No long verbatim passages.
* Rules are paraphrased and capped (`rule_text` ≤ 800 chars).
* Every rule includes at least one source pointer in `source_refs`.
* If a rule depends on exact wording, the rule still paraphrases but must include:
  * `rule_text`: “Exact wording required—refer to pointer”
  * plus a usable pointer.

Primary reference PDFs for pointer extraction:

* The Chicago Manual of Style (18th ed).pdf
* Robert Bringhurst – The Elements of Typographic Style.pdf

## Output batching format

When you say: `EXTRACT <CATEGORY> [<SCOPE>]`

I will output a bundle that includes:

1. **Rules NDJSON** (150–250 rule records)
   * Path: `spec/rules/<category>/<batch_id>.ndjson`
   * One JSON object per line, validated against `spec/schema/rule.schema.json`.

2. **Index deltas** for that category
   * `spec/indexes/keywords_<category>.json`
   * `spec/indexes/source_refs_<category>.json`
   * `spec/indexes/coverage_delta_<category>.json`

3. **Coverage notes** report
   * Short Markdown report describing enforcement split (lint/typeset/postrender/manual)
   * plus any known gaps or manual-only areas

## Batch naming

`<batch_id>` format:

* `v1_<category>_<nnn>` (e.g. `v1_punctuation_001`)

Batches are append-only:

* If rules need revision, mark old rule `deprecated`, add a new rule ID (or new version segment) and keep both records.

## Pointer scheme details

Pointer strings live in `source_refs[]` and are **not** quotes.

Preferred pointer format:

* `CMOS18 §<section> p<book_page>`
* `BRING §<section> p<book_page>`
* Optional disambiguation: `(scan p<pdf_page_index>)`

Example pattern (not a quote):

* `CMOS18 §6.1 p377 (scan p10)`

Notes:

* “book_page” uses the printed page number in the book when present (arabic or roman).
* “scan p” uses the PDF page index when printed page numbers are ambiguous.

## Recommended extraction order (high-impact first)

1. numbers
2. punctuation
3. citations
4. headings
5. tables
6. figures
7. links
8. code
9. layout (widows/orphans, keeps, overflow)
10. front/back matter
11. accessibility
12. i18n

Rationale:

* Numbers/punctuation/citations most directly affect correctness and consistency.
* Layout rules benefit from having structure and tokens in place.

## Enforcement mapping guidelines (honest labeling)

* `lint`: detectable from AST or text normalization (spacing, punctuation patterns, citation fields).
* `typeset`: enforced via CSS/tokens/paged-media decisions.
* `postrender`: requires layout inspection after rendering.
* `manual`: cannot be reliably automated; must include `tags: ["manual_checklist=true"]` and be emitted into checklist outputs.

If a concept spans multiple enforcement layers:

* Prefer splitting into two rules:
  * one lint rule (source cleanliness)
  * one postrender rule (layout outcome)
* Use `dependencies` to link them.

## “Degraded mode” considerations during extraction

For each category batch, include some rules that specifically target degraded inputs:

* hard-wrap repair suggestions
* heading inference warnings
* link sanitation and encoding fixes
* Unicode normalization notes

These rules should generally be `warn` or `should`, unless they prevent corruption (then `must`).