# re-voice app proposal: “upload → shadow dossier” ## Product goal Let a user upload **any document** (PDF/DOCX/MD/HTML/images) and receive a **shadow dossier** rendered through a chosen **style bible** (e.g. `if://bible/dave/v1.0`). ## Non-goals (v0) - Perfect fidelity layout extraction (we only need usable text + key figures) - Long-term storage/retention policies (we can stub, then harden) ## Architecture (thin UI, strong pipeline) ### 1) Ingest - Upload endpoint: `POST /api/dossiers` (multipart) - Compute and persist: - `sha256` of original - detected `mime` - storage pointer (disk/S3/Forgejo blob) - Create `Document` row: `{id, sha256, filename, mime, created_at, owner}` ### 2) Extract → Canonicalize Use a pluggable extractor chain: - PDF: 1. `pdftotext` (fast path, text-layer PDFs) 2. OCR fallback (`pdftoppm` → `tesseract`) for image-only PDFs - DOCX: `pandoc` or `python-docx` - HTML: `readability`-style boilerplate removal - Images: OCR (`tesseract`) with basic deskew Output a canonical block model (enables better prompting + citations): ```json { "doc_id": "…", "blocks": [ {"type":"heading","level":1,"text":"…"}, {"type":"paragraph","text":"…"}, {"type":"list","items":["…","…"]} ] } ``` ### 3) Style bible compiler Store bibles in-repo as Markdown + a small metadata header (id, version, citation, hard rules). Compile the bible into: - `system_prompt` (voice + forbidden/required constraints) - `template` (required dossier structure) - `lint_rules` (post-checks: emojis/paragraph, pronouns, required footer, etc.) ### 4) Generate Two-step generation is safer and more controllable: 1. **Content distillation** (extract doc facts → structured notes) 2. **Style application** (render notes into dossier template under bible constraints) Recommended runtime: - OpenAI-compatible Chat Completions backend (Juakali / OpenWebUI stack) - Persist `{model, prompts, output_sha256}` for auditability ### 5) Validate (style linter) Run a deterministic linter per bible: - hard constraints (e.g., “emoji per paragraph” for Dave) - vocabulary swaps (optional) - required footer/disclaimer - “no secrets” scan (best-effort) If lint fails: auto-repair pass (LLM) or return “needs revision” with lint report. ### 6) Export + publishing Outputs: - Markdown (primary) - PDF via existing Forgejo PDF export (`.../raw/...&format=pdf`) by committing generated Markdown to a repo Publishing strategy: - Store outputs in a Forgejo repo (per team/project) - Provide immutable links to `{sha}` + `.sha256` sidecars ## Security + operational considerations - Run extraction/OCR in a sandboxed worker (CPU/mem/time limits). - Never store API keys in repos; use env/secret manager. - Keep an audit trail: source hash → extracted text hash → output hash → model/prompt hashes.