Add model economics (Dave-proof) section

This commit is contained in:
root 2025-12-22 20:09:35 +00:00
parent a7ec61725b
commit fc2b9032a2

View file

@ -91,6 +91,52 @@ If yes: fund this as a research program in citizenauditable operational gover
If no: reject it, but name the boundary explicitly.
### Operational finding: Model economics (Dave-proof run)
A recurring objection (“Dave”) is that mentalhealthadjacent UX requires the largest, most expensive models; therefore governance is secondary to model scale. InfraFabric treats that as a falsifiable claim: if the governance stack can enforce output invariants across model tiers, then model choice becomes an economic routing decision (default small, escalate when TRIAGE demands).
This dossier includes one public, reproducible proof run:
**What we tested (public, reproducible):**
- 5 prompts × 3 Codex models = 15 traced runs:
- `gpt-5.2-codex`
- `gpt-5.1-codex-mini`
- `gpt-5.1-codex-max`
- Two “polish invariants” checked on the *final, user-visible* response:
- language matches the question (`spanish? = N`)
- no bullet lists (`bullets? = N`)
- Every run exports an evidence bundle + SHA256 + offline verifier.
**Result (reference run `20251222T164352Z`):**
- 15/15 responses returned HTTP 200 and produced verifiable evidence bundles.
- 0/15 language leaks in the final output.
- 0/15 bulletlist violations in the final output.
- 6/15 traces contain a `postprocess_applied` event with before/after SHA256, showing deterministic correction when needed (the correction itself is audited).
**How to verify (no insider access):**
- Bundle: https://infrafabric.io/static/hosted/emo_dave_proof_bundle_20251222T164352Z.tar.gz
- Instructions: https://infrafabric.io/static/hosted/EMO_DAVE_PROOF_MODEL_COMPARE_20251222T164352Z.md
- Trace IDs (15):
- `f3a2cc9d-4028-4e1c-a37a-dd1dc107e135`
- `209ec997-86d8-47c6-b92a-d2cce7de4cec`
- `b6ff776c-2c00-41d3-a1d7-20ff9ec2d7ee`
- `96c406a0-be8f-4d6b-87ae-245aad3f700c`
- `03cce206-d9c2-4883-838c-c871437185c6`
- `2a48a19f-2593-4110-9bfb-e47fa24fe462`
- `39551088-70c2-4977-9c75-50307ce92933`
- `9c3c7002-1e8f-430f-b50e-f5633b8963db`
- `50b756df-74dc-450c-bd46-524d799ee8db`
- `7ed26f37-fac7-4a5e-a352-4815411f4933`
- `9841f968-0b30-4916-985e-c5c82968a09c`
- `e907e2f5-a1e0-4050-963b-f244f3aacffb`
- `f2a1a46d-d242-4838-8122-d0eb8f87dfd6`
- `2fd841ed-0ccb-4ab6-ae39-4586c38a5ec5`
- `e2e528a9-2983-499a-af51-a23ed05149ed`
**What this does and does not prove:**
- It proves the *stack* can enforce specific invariants (language + formatting) across these model tiers for these prompts, with auditable corrections when needed.
- It does not prove the models are equivalent on clinical judgment, crisis handling, or longhorizon reasoning. Those require separate validation and are intentionally not claimed here.
---
<a id="bridge-submission-pack"></a>