re-voice/docs/APP_SPEC.md

# re-voice app proposal: “upload → shadow dossier”

## Product goal

Let a user upload **any document** (PDF/DOCX/MD/HTML/images) and receive a **shadow dossier** rendered through a chosen **style bible** (e.g. `if://bible/dave/v1.0`).

## Non-goals (v0)

- Perfect fidelity layout extraction (we only need usable text + key figures)
- Long-term storage/retention policies (we can stub, then harden)

## Architecture (thin UI, strong pipeline)

### 1) Ingest

- Upload endpoint: `POST /api/dossiers` (multipart)
- Compute and persist:
  - `sha256` of original
  - detected `mime`
  - storage pointer (disk/S3/Forgejo blob)
- Create `Document` row: `{id, sha256, filename, mime, created_at, owner}`

### 2) Extract → Canonicalize

Use a pluggable extractor chain:

- PDF:
  1. `pdftotext` (fast path, text-layer PDFs)
  2. OCR fallback (`pdftoppm` → `tesseract`) for image-only PDFs
- DOCX: `pandoc` or `python-docx`
- HTML: `readability`-style boilerplate removal
- Images: OCR (`tesseract`) with basic deskew

Output a canonical block model (enables better prompting + citations):

```json
{
  "doc_id": "…",
  "blocks": [
    {"type":"heading","level":1,"text":"…"},
    {"type":"paragraph","text":"…"},
    {"type":"list","items":["…","…"]}
  ]
}
```

### 3) Style bible compiler

Store bibles in-repo as Markdown + a small metadata header (id, version, citation, hard rules).

Compile the bible into:
- `system_prompt` (voice + forbidden/required constraints)
- `template` (required dossier structure)
- `lint_rules` (post-checks: emojis/paragraph, pronouns, required footer, etc.)

### 4) Generate

Two-step generation is safer and more controllable:

1. **Content distillation** (extract doc facts → structured notes)
2. **Style application** (render notes into dossier template under bible constraints)

Recommended runtime:
- OpenAI-compatible Chat Completions backend (Juakali / OpenWebUI stack)
- Persist `{model, prompts, output_sha256}` for auditability

### 5) Validate (style linter)

Run a deterministic linter per bible:
- hard constraints (e.g., “emoji per paragraph” for Dave)
- vocabulary swaps (optional)
- required footer/disclaimer
- “no secrets” scan (best-effort)

If lint fails: auto-repair pass (LLM) or return “needs revision” with lint report.

### 6) Export + publishing

Outputs:
- Markdown (primary)
- PDF via existing Forgejo PDF export (`.../raw/...&format=pdf`) by committing generated Markdown to a repo

Publishing strategy:
- Store outputs in a Forgejo repo (per team/project)
- Provide immutable links to `{sha}` + `.sha256` sidecars

## Security + operational considerations

- Run extraction/OCR in a sandboxed worker (CPU/mem/time limits).
- Never store API keys in repos; use env/secret manager.
- Keep an audit trail: source hash → extracted text hash → output hash → model/prompt hashes.