re-voice app proposal: “upload → shadow dossier”

Product goal

Let a user upload any document (PDF/DOCX/MD/HTML/images) and receive a shadow dossier rendered through a chosen style bible (e.g. if://bible/dave/v1.0).

Non-goals (v0)

Perfect fidelity layout extraction (we only need usable text + key figures)
Long-term storage/retention policies (we can stub, then harden)

Architecture (thin UI, strong pipeline)

1) Ingest

Upload endpoint: POST /api/dossiers (multipart)
Compute and persist:
- sha256 of original
- detected mime
- storage pointer (disk/S3/Forgejo blob)
Create Document row: {id, sha256, filename, mime, created_at, owner}

2) Extract → Canonicalize

Use a pluggable extractor chain:

PDF:
1. pdftotext (fast path, text-layer PDFs)
2. OCR fallback (pdftoppm → tesseract) for image-only PDFs
DOCX: pandoc or python-docx
HTML: readability-style boilerplate removal
Images: OCR (tesseract) with basic deskew

Output a canonical block model (enables better prompting + citations):

{
  "doc_id": "…",
  "blocks": [
    {"type":"heading","level":1,"text":"…"},
    {"type":"paragraph","text":"…"},
    {"type":"list","items":["…","…"]}
  ]
}

3) Style bible compiler

Store bibles in-repo as Markdown + a small metadata header (id, version, citation, hard rules).

Compile the bible into:

system_prompt (voice + forbidden/required constraints)
template (required dossier structure)
lint_rules (post-checks: emojis/paragraph, pronouns, required footer, etc.)

4) Generate

Two-step generation is safer and more controllable:

Content distillation (extract doc facts → structured notes)
Style application (render notes into dossier template under bible constraints)

Recommended runtime:

OpenAI-compatible Chat Completions backend (Juakali / OpenWebUI stack)
Persist {model, prompts, output_sha256} for auditability

5) Validate (style linter)

Run a deterministic linter per bible:

hard constraints (e.g., “emoji per paragraph” for Dave)
vocabulary swaps (optional)
required footer/disclaimer
“no secrets” scan (best-effort)

If lint fails: auto-repair pass (LLM) or return “needs revision” with lint report.

5b) Mermaid preflight (PDF export reliability)

If the output includes Mermaid diagrams, run a preflight pass before PDF export:

auto-heal Mermaid blocks (quote labels, normalize headers, balance subgraph/end)
validate Mermaid rendering in the same runtime used by the PDF exporter

In re-voice, this is exposed as:

revoice preflight --style <style> --input <output.md> --source <source-doc>

6) Export + publishing

Outputs:

Markdown (primary)
PDF via existing Forgejo PDF export (.../raw/...&format=pdf) by committing generated Markdown to a repo

Publishing strategy:

Store outputs in a Forgejo repo (per team/project)
Provide immutable links to {sha} + .sha256 sidecars

Security + operational considerations

Run extraction/OCR in a sandboxed worker (CPU/mem/time limits).
Never store API keys in repos; use env/secret manager.
Keep an audit trail: source hash → extracted text hash → output hash → model/prompt hashes.

3.2 KiB Raw Export PDF Blame History