iftypeset/docs/05-external-evaluation-prompt.md
2026-01-03 20:29:35 +00:00

8.4 KiB

External Evaluation Prompt — iftypeset (pubstyle)

Goal: confirm there is no fundamental flaw (technical, legal, product) and identify obvious issues early.
Audience: humans or LLM reviewers.
Repo root: ai-workspace/iftypeset/

0) Context (read this first)

iftypeset is a thin, deterministic publishing runtime for Markdown → HTML → PDF that adds:

  • A machinereadable rule registry (rules are paraphrases only) with pointer refs back to primary sources (Chicago / Bringhurst) instead of reproducing book text.
  • Typeset profiles (screen-first vs print-first vs dense tech, etc.) that map typographic intent into render tokens/CSS.
  • Postrender QA gates that can fail builds when layout degrades (widows/orphans/keeps/overflow/link-wrap/numbering issues).
  • Do not OCR/transcribe entire books into the repo (copyright). Rules must remain paraphrases with pointers only.
  • Source pointers must be sufficient for someone who has the book to find the guidance, without quoting it.
  • The runtime must be able to run in constrained environments (e.g. Forgejo PDF export workers) and produce deterministic artifacts.

1) What to review (map of the repo)

Start here:

  • README.md
  • STATUS.md
  • app/ARCHITECTURE.md
  • app/CLI_SPEC.md
  • docs/01-demo-acceptance.md
  • docs/02-competitor-matrix.md
  • docs/03-rule-ingestion-sop.md
  • docs/04-renderer-strategy.md

Spec + rules:

  • spec/schema/rule.schema.json
  • spec/manifest.yaml
  • spec/profiles/*.yaml
  • spec/quality_gates.yaml
  • spec/rules/**.ndjson
  • spec/indexes/*.json (derived; rebuildable)

Forgejo integration note:

  • forgejo/README.md

2) Quick verification (local)

From ai-workspace/iftypeset/, run:

./scripts/ci.sh

Confirm it:

  • validates the spec
  • generates a coverage report
  • runs unit tests

If it fails, include the command output in your review.

3) Required reviewer metadata (so we can trust the review)

If you are a human reviewer

  • reviewer_background: 12 lines (e.g., “publishing/typography”, “security/GRC”, “docs tooling”).
  • tools_used: list (e.g., Prince, Antenna House, Pandoc, Quarto, LaTeX, Typst, WeasyPrint, Paged.js, DocRaptor).
  • date_utc: ISO 8601.

If you are an LLM reviewer

  • llm_name: provider + model string
  • probable_model: if ambiguous
  • cutoff_date: YYYYMM or unknown
  • response_date_utc: ISO 8601
  • web_access_used: yes|no

4) Evaluation rubric (scorecard)

Score each category 05 and write 13 sentences of justification.

4.1 Product + positioning

  1. Problem clarity (05)
    Does this solve a real pain for teams shipping PDFs, beyond “another renderer”?

  2. Differentiation (05)
    Is the “rule registry + QA gates + deterministic artifacts” wedge clear and credible vs: Pandoc/Quarto/Typst/LaTeX, Prince/AntennaHouse/WeasyPrint/Vivliostyle/Paged.js, DocRaptor, etc.?

  3. Viability (05)
    Is this buildable to a useful v0.1 in weeks (not months) with a small team?

4.1a Content + style (docs/readability)

  1. Docs clarity (05)
    Can a new contributor follow README.md and get a useful output quickly?

  2. Spec readability (05)
    Are spec/manifest.yaml, spec/profiles/*.yaml, and spec/quality_gates.yaml self-explanatory enough for a reviewer?

  3. Market-facing clarity (05)
    If this were shown to a buyer, does it read like a product with a clear contract, or a research project?

4.2 Technical architecture

  1. Spec design (05)
    Are rule.schema.json, manifest.yaml, and the profile/gate model coherent and extensible?

  2. Enforcement model (05)
    Is the split between lint / typeset / postrender / manual realistic? Are “manual checklist” rules handled honestly?

  3. Determinism strategy (05)
    Does the repo clearly define what “deterministic” means (inputs, renderer versions, fonts, outputs)?

4.3 Rules + content quality

  1. Rule record quality (05)
    Do rule records look like paraphrases with pointers (not copied text)? Are IDs/tags/keywords useful?

  2. Coverage strategy (05)
    Are we prioritizing the right categories first (numbers/punctuation/citations/layout), and is coverage reporting useful?

4.4 UX / operational usability

  1. CLI ergonomics (05)
    Is the CLI spec clear for CI usage (exit codes, JSON artifacts, strictness flags)?

  2. Integration story (05)
    Is Forgejo integration plausible and incremental (CSS first, then QA gates)?

4.5 Market viability (compare to existing options)

Rate each 05 based on your experience (no need to be exhaustive; avoid vendor hype).

  1. Replace vs complement (05)
    Is iftypeset best positioned as a replacement for existing toolchains, or as a QA layer you plug into them?

  2. Who pays first (05)
    Does the repo make it clear who would adopt/pay first (docs teams, GRC, legal, research, vendors)?

  3. Defensible wedge (05)
    Is “publishing CI with hard QA gates + auditable rule registry” a defensible wedge, or easy for existing tools to add?

5) “Fundamental flaw” checklist (answer explicitly)

Mark each: PASS / RISK / FAIL, with a oneline explanation.

  1. Copyright / licensing risk
    Any sign the repo is storing book text rather than paraphrases + pointers?

  2. Determinism risk
    Are we likely to produce different PDFs across machines/runs due to fonts/renderer drift?

  3. QA gate feasibility
    Are the proposed post-render QA gates realistically implementable, or is this a research project?

  4. Scope creep risk
    Does the plan keep a narrow v0.1 “definition of done”, or is it trying to boil the ocean?

  5. Market reality
    Is there a clear “why buy/use this” vs adopting an existing doc toolchain and living with some ugliness?

5a) Section-by-section ratings (required)

Rate each 05 and include 12 lines of justification. The goal is to catch “obvious issues” early.

  • README.md: clarity + truthfulness (does it match current behavior?)
  • STATUS.md: accuracy + usefulness (is it a reliable snapshot?)
  • app/ARCHITECTURE.md: coherence + feasibility
  • app/CLI_SPEC.md: completeness + CI friendliness
  • docs/01-demo-acceptance.md: crisp v0.1 target or scope creep?
  • docs/02-competitor-matrix.md: honest + actionable (no wishful marketing)
  • docs/03-rule-ingestion-sop.md: safe + repeatable (avoids copyright drift)
  • docs/04-renderer-strategy.md: realistic adapter plan
  • spec/manifest.yaml: enforceable contracts + degraded mode clarity
  • spec/schema/rule.schema.json: schema quality (strict enough, not brittle)
  • spec/profiles/*.yaml: profiles feel sane, not arbitrary
  • spec/quality_gates.yaml: gates are measurable + meaningful
  • spec/rules/**.ndjson: rule quality (paraphrase + pointer discipline)

6) Deliverables quality (what “good” looks like)

Assess whether the repo is on track to produce, for a single Markdown input:

  • render.html + render.css (deterministic)
  • render.pdf (deterministic given pinned engine/fonts)
  • lint-report.json
  • layout-report.json
  • qa-report.json (pass/fail thresholds)
  • coverage-report.json (rule implementation progress)
  • manual-checklist.md (for rules that cannot be automated)

If you think any of these deliverables are unnecessary or missing, say so.

7) Patch suggestions (actionable)

Provide 515 suggestions in this format:

  • target: file path(s)
  • problem: 1 sentence
  • change: concrete text/code change (copy/pasteable)
  • why: 1 sentence
  • priority: P0 / P1 / P2
  • confidence: high / medium / low

Preferred patch format

If possible, include unified diffs:

--- a/path/file.md
+++ b/path/file.md
@@
 ...

8) Output template (copy/paste)

Use this structure in your response:

  1. Summary (510 bullets)
  2. Scorecard (05 each)
  3. Fundamental flaw checklist (PASS/RISK/FAIL)
  4. Top risks (P0/P1)
  5. Patch suggestions (with diffs if possible)
  6. Go / NoGo recommendation for v0.1

9) Important constraint for reviewers

Do not paste verbatim passages from Chicago/Bringhurst into your review output. Use pointers only (e.g., BRING §2.1.8 p32) and describe the issue in your own words.

10) Quick market question (optional, but useful)

If you had to ship “good-looking PDFs with hard QA gates” tomorrow, what would you use today, and why would you still choose iftypeset (or not)?