Add model economics (Dave-proof) section
This commit is contained in:
parent
a7ec61725b
commit
fc2b9032a2
1 changed files with 46 additions and 0 deletions
|
|
@ -91,6 +91,52 @@ If yes: fund this as a research program in citizen‑auditable operational gover
|
|||
|
||||
If no: reject it, but name the boundary explicitly.
|
||||
|
||||
### Operational finding: Model economics (Dave-proof run)
|
||||
|
||||
A recurring objection (“Dave”) is that mental‑health‑adjacent UX requires the largest, most expensive models; therefore governance is secondary to model scale. InfraFabric treats that as a falsifiable claim: if the governance stack can enforce output invariants across model tiers, then model choice becomes an economic routing decision (default small, escalate when TRIAGE demands).
|
||||
|
||||
This dossier includes one public, reproducible proof run:
|
||||
|
||||
**What we tested (public, reproducible):**
|
||||
- 5 prompts × 3 Codex models = 15 traced runs:
|
||||
- `gpt-5.2-codex`
|
||||
- `gpt-5.1-codex-mini`
|
||||
- `gpt-5.1-codex-max`
|
||||
- Two “polish invariants” checked on the *final, user-visible* response:
|
||||
- language matches the question (`spanish? = N`)
|
||||
- no bullet lists (`bullets? = N`)
|
||||
- Every run exports an evidence bundle + SHA256 + offline verifier.
|
||||
|
||||
**Result (reference run `20251222T164352Z`):**
|
||||
- 15/15 responses returned HTTP 200 and produced verifiable evidence bundles.
|
||||
- 0/15 language leaks in the final output.
|
||||
- 0/15 bullet‑list violations in the final output.
|
||||
- 6/15 traces contain a `postprocess_applied` event with before/after SHA256, showing deterministic correction when needed (the correction itself is audited).
|
||||
|
||||
**How to verify (no insider access):**
|
||||
- Bundle: https://infrafabric.io/static/hosted/emo_dave_proof_bundle_20251222T164352Z.tar.gz
|
||||
- Instructions: https://infrafabric.io/static/hosted/EMO_DAVE_PROOF_MODEL_COMPARE_20251222T164352Z.md
|
||||
- Trace IDs (15):
|
||||
- `f3a2cc9d-4028-4e1c-a37a-dd1dc107e135`
|
||||
- `209ec997-86d8-47c6-b92a-d2cce7de4cec`
|
||||
- `b6ff776c-2c00-41d3-a1d7-20ff9ec2d7ee`
|
||||
- `96c406a0-be8f-4d6b-87ae-245aad3f700c`
|
||||
- `03cce206-d9c2-4883-838c-c871437185c6`
|
||||
- `2a48a19f-2593-4110-9bfb-e47fa24fe462`
|
||||
- `39551088-70c2-4977-9c75-50307ce92933`
|
||||
- `9c3c7002-1e8f-430f-b50e-f5633b8963db`
|
||||
- `50b756df-74dc-450c-bd46-524d799ee8db`
|
||||
- `7ed26f37-fac7-4a5e-a352-4815411f4933`
|
||||
- `9841f968-0b30-4916-985e-c5c82968a09c`
|
||||
- `e907e2f5-a1e0-4050-963b-f244f3aacffb`
|
||||
- `f2a1a46d-d242-4838-8122-d0eb8f87dfd6`
|
||||
- `2fd841ed-0ccb-4ab6-ae39-4586c38a5ec5`
|
||||
- `e2e528a9-2983-499a-af51-a23ed05149ed`
|
||||
|
||||
**What this does and does not prove:**
|
||||
- It proves the *stack* can enforce specific invariants (language + formatting) across these model tiers for these prompts, with auditable corrections when needed.
|
||||
- It does not prove the models are equivalent on clinical judgment, crisis handling, or long‑horizon reasoning. Those require separate validation and are intentionally not claimed here.
|
||||
|
||||
---
|
||||
|
||||
<a id="bridge-submission-pack"></a>
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue