92 lines
2.8 KiB
Markdown
92 lines
2.8 KiB
Markdown
# re-voice app proposal: “upload → shadow dossier”
|
|
|
|
## Product goal
|
|
|
|
Let a user upload **any document** (PDF/DOCX/MD/HTML/images) and receive a **shadow dossier** rendered through a chosen **style bible** (e.g. `if://bible/dave/v1.0`).
|
|
|
|
## Non-goals (v0)
|
|
|
|
- Perfect fidelity layout extraction (we only need usable text + key figures)
|
|
- Long-term storage/retention policies (we can stub, then harden)
|
|
|
|
## Architecture (thin UI, strong pipeline)
|
|
|
|
### 1) Ingest
|
|
|
|
- Upload endpoint: `POST /api/dossiers` (multipart)
|
|
- Compute and persist:
|
|
- `sha256` of original
|
|
- detected `mime`
|
|
- storage pointer (disk/S3/Forgejo blob)
|
|
- Create `Document` row: `{id, sha256, filename, mime, created_at, owner}`
|
|
|
|
### 2) Extract → Canonicalize
|
|
|
|
Use a pluggable extractor chain:
|
|
|
|
- PDF:
|
|
1. `pdftotext` (fast path, text-layer PDFs)
|
|
2. OCR fallback (`pdftoppm` → `tesseract`) for image-only PDFs
|
|
- DOCX: `pandoc` or `python-docx`
|
|
- HTML: `readability`-style boilerplate removal
|
|
- Images: OCR (`tesseract`) with basic deskew
|
|
|
|
Output a canonical block model (enables better prompting + citations):
|
|
|
|
```json
|
|
{
|
|
"doc_id": "…",
|
|
"blocks": [
|
|
{"type":"heading","level":1,"text":"…"},
|
|
{"type":"paragraph","text":"…"},
|
|
{"type":"list","items":["…","…"]}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 3) Style bible compiler
|
|
|
|
Store bibles in-repo as Markdown + a small metadata header (id, version, citation, hard rules).
|
|
|
|
Compile the bible into:
|
|
- `system_prompt` (voice + forbidden/required constraints)
|
|
- `template` (required dossier structure)
|
|
- `lint_rules` (post-checks: emojis/paragraph, pronouns, required footer, etc.)
|
|
|
|
### 4) Generate
|
|
|
|
Two-step generation is safer and more controllable:
|
|
|
|
1. **Content distillation** (extract doc facts → structured notes)
|
|
2. **Style application** (render notes into dossier template under bible constraints)
|
|
|
|
Recommended runtime:
|
|
- OpenAI-compatible Chat Completions backend (Juakali / OpenWebUI stack)
|
|
- Persist `{model, prompts, output_sha256}` for auditability
|
|
|
|
### 5) Validate (style linter)
|
|
|
|
Run a deterministic linter per bible:
|
|
- hard constraints (e.g., “emoji per paragraph” for Dave)
|
|
- vocabulary swaps (optional)
|
|
- required footer/disclaimer
|
|
- “no secrets” scan (best-effort)
|
|
|
|
If lint fails: auto-repair pass (LLM) or return “needs revision” with lint report.
|
|
|
|
### 6) Export + publishing
|
|
|
|
Outputs:
|
|
- Markdown (primary)
|
|
- PDF via existing Forgejo PDF export (`.../raw/...&format=pdf`) by committing generated Markdown to a repo
|
|
|
|
Publishing strategy:
|
|
- Store outputs in a Forgejo repo (per team/project)
|
|
- Provide immutable links to `{sha}` + `.sha256` sidecars
|
|
|
|
## Security + operational considerations
|
|
|
|
- Run extraction/OCR in a sandboxed worker (CPU/mem/time limits).
|
|
- Never store API keys in repos; use env/secret manager.
|
|
- Keep an audit trail: source hash → extracted text hash → output hash → model/prompt hashes.
|
|
|