re-voice/docs/APP_SPEC.md

92 lines
2.8 KiB
Markdown

# re-voice app proposal: “upload → shadow dossier”
## Product goal
Let a user upload **any document** (PDF/DOCX/MD/HTML/images) and receive a **shadow dossier** rendered through a chosen **style bible** (e.g. `if://bible/dave/v1.0`).
## Non-goals (v0)
- Perfect fidelity layout extraction (we only need usable text + key figures)
- Long-term storage/retention policies (we can stub, then harden)
## Architecture (thin UI, strong pipeline)
### 1) Ingest
- Upload endpoint: `POST /api/dossiers` (multipart)
- Compute and persist:
- `sha256` of original
- detected `mime`
- storage pointer (disk/S3/Forgejo blob)
- Create `Document` row: `{id, sha256, filename, mime, created_at, owner}`
### 2) Extract → Canonicalize
Use a pluggable extractor chain:
- PDF:
1. `pdftotext` (fast path, text-layer PDFs)
2. OCR fallback (`pdftoppm``tesseract`) for image-only PDFs
- DOCX: `pandoc` or `python-docx`
- HTML: `readability`-style boilerplate removal
- Images: OCR (`tesseract`) with basic deskew
Output a canonical block model (enables better prompting + citations):
```json
{
"doc_id": "…",
"blocks": [
{"type":"heading","level":1,"text":"…"},
{"type":"paragraph","text":"…"},
{"type":"list","items":["…","…"]}
]
}
```
### 3) Style bible compiler
Store bibles in-repo as Markdown + a small metadata header (id, version, citation, hard rules).
Compile the bible into:
- `system_prompt` (voice + forbidden/required constraints)
- `template` (required dossier structure)
- `lint_rules` (post-checks: emojis/paragraph, pronouns, required footer, etc.)
### 4) Generate
Two-step generation is safer and more controllable:
1. **Content distillation** (extract doc facts → structured notes)
2. **Style application** (render notes into dossier template under bible constraints)
Recommended runtime:
- OpenAI-compatible Chat Completions backend (Juakali / OpenWebUI stack)
- Persist `{model, prompts, output_sha256}` for auditability
### 5) Validate (style linter)
Run a deterministic linter per bible:
- hard constraints (e.g., “emoji per paragraph” for Dave)
- vocabulary swaps (optional)
- required footer/disclaimer
- “no secrets” scan (best-effort)
If lint fails: auto-repair pass (LLM) or return “needs revision” with lint report.
### 6) Export + publishing
Outputs:
- Markdown (primary)
- PDF via existing Forgejo PDF export (`.../raw/...&format=pdf`) by committing generated Markdown to a repo
Publishing strategy:
- Store outputs in a Forgejo repo (per team/project)
- Provide immutable links to `{sha}` + `.sha256` sidecars
## Security + operational considerations
- Run extraction/OCR in a sandboxed worker (CPU/mem/time limits).
- Never store API keys in repos; use env/secret manager.
- Keep an audit trail: source hash → extracted text hash → output hash → model/prompt hashes.