# re-voice app proposal: “upload → shadow dossier” ## Product goal Let a user upload **any document** (PDF/DOCX/MD/HTML/images) and receive a **shadow dossier** rendered through a chosen **style bible** (e.g. `if://bible/dave/v1.0`). ## Non-goals (v0) - Perfect fidelity layout extraction (we only need usable text + key figures) - Long-term storage/retention policies (we can stub, then harden) ## Architecture (thin UI, strong pipeline) ### 1) Ingest - Upload endpoint: `POST /api/dossiers` (multipart) - Compute and persist: - `sha256` of original - detected `mime` - storage pointer (disk/S3/Forgejo blob) - Create `Document` row: `{id, sha256, filename, mime, created_at, owner}` ### 2) Extract → Canonicalize Use a pluggable extractor chain: - PDF: 1. `pdftotext` (fast path, text-layer PDFs) 2. OCR fallback (`pdftoppm` → `tesseract`) for image-only PDFs - DOCX: `pandoc` or `python-docx` - HTML: `readability`-style boilerplate removal - Images: OCR (`tesseract`) with basic deskew Output a canonical block model (enables better prompting + citations): ```json { "doc_id": "…", "blocks": [ {"type":"heading","level":1,"text":"…"}, {"type":"paragraph","text":"…"}, {"type":"list","items":["…","…"]} ] } ``` ### 3) Style bible compiler Store bibles in-repo as Markdown + a small metadata header (id, version, citation, hard rules). Compile the bible into: - `system_prompt` (voice + forbidden/required constraints) - `template` (required dossier structure) - `lint_rules` (post-checks: emojis/paragraph, pronouns, required footer, etc.) ### 4) Generate Two-step generation is safer and more controllable: 1. **Content distillation** (extract doc facts → structured notes) 2. **Style application** (render notes into dossier template under bible constraints) Recommended runtime: - OpenAI-compatible Chat Completions backend (Juakali / OpenWebUI stack) - Persist `{model, prompts, output_sha256}` for auditability ### 5) Validate (style linter) Run a deterministic linter per bible: - hard constraints (e.g., “emoji per paragraph” for Dave) - vocabulary swaps (optional) - required footer/disclaimer - “no secrets” scan (best-effort) If lint fails: auto-repair pass (LLM) or return “needs revision” with lint report. ### 5b) Mermaid preflight (PDF export reliability) If the output includes Mermaid diagrams, run a preflight pass before PDF export: - auto-heal Mermaid blocks (quote labels, normalize headers, balance `subgraph/end`) - validate Mermaid rendering in the same runtime used by the PDF exporter In `re-voice`, this is exposed as: `revoice preflight --style