Add model economics (Dave-proof) section

2025-12-22 20:09:35 +00:00 · 2025-12-22 20:09:35 +00:00 · fc2b9032a2
commit fc2b9032a2
parent a7ec61725b
1 changed files with 46 additions and 0 deletions
--- a/DANNY_STOCKER_INFRAFABRIC_DOSSIER.md
+++ b/DANNY_STOCKER_INFRAFABRIC_DOSSIER.md
@ -91,6 +91,52 @@ If yes: fund this as a research program in citizen‑auditable operational gover

 If no: reject it, but name the boundary explicitly.

+### Operational finding: Model economics (Dave-proof run)
+
+A recurring objection (“Dave”) is that mental‑health‑adjacent UX requires the largest, most expensive models; therefore governance is secondary to model scale. InfraFabric treats that as a falsifiable claim: if the governance stack can enforce output invariants across model tiers, then model choice becomes an economic routing decision (default small, escalate when TRIAGE demands).
+
+This dossier includes one public, reproducible proof run:
+
+**What we tested (public, reproducible):**
+- 5 prompts × 3 Codex models = 15 traced runs:
+  - `gpt-5.2-codex`
+  - `gpt-5.1-codex-mini`
+  - `gpt-5.1-codex-max`
+- Two “polish invariants” checked on the *final, user-visible* response:
+  - language matches the question (`spanish? = N`)
+  - no bullet lists (`bullets? = N`)
+- Every run exports an evidence bundle + SHA256 + offline verifier.
+
+**Result (reference run `20251222T164352Z`):**
+- 15/15 responses returned HTTP 200 and produced verifiable evidence bundles.
+- 0/15 language leaks in the final output.
+- 0/15 bullet‑list violations in the final output.
+- 6/15 traces contain a `postprocess_applied` event with before/after SHA256, showing deterministic correction when needed (the correction itself is audited).
+
+**How to verify (no insider access):**
+- Bundle: https://infrafabric.io/static/hosted/emo_dave_proof_bundle_20251222T164352Z.tar.gz
+- Instructions: https://infrafabric.io/static/hosted/EMO_DAVE_PROOF_MODEL_COMPARE_20251222T164352Z.md
+- Trace IDs (15):
+  - `f3a2cc9d-4028-4e1c-a37a-dd1dc107e135`
+  - `209ec997-86d8-47c6-b92a-d2cce7de4cec`
+  - `b6ff776c-2c00-41d3-a1d7-20ff9ec2d7ee`
+  - `96c406a0-be8f-4d6b-87ae-245aad3f700c`
+  - `03cce206-d9c2-4883-838c-c871437185c6`
+  - `2a48a19f-2593-4110-9bfb-e47fa24fe462`
+  - `39551088-70c2-4977-9c75-50307ce92933`
+  - `9c3c7002-1e8f-430f-b50e-f5633b8963db`
+  - `50b756df-74dc-450c-bd46-524d799ee8db`
+  - `7ed26f37-fac7-4a5e-a352-4815411f4933`
+  - `9841f968-0b30-4916-985e-c5c82968a09c`
+  - `e907e2f5-a1e0-4050-963b-f244f3aacffb`
+  - `f2a1a46d-d242-4838-8122-d0eb8f87dfd6`
+  - `2fd841ed-0ccb-4ab6-ae39-4586c38a5ec5`
+  - `e2e528a9-2983-499a-af51-a23ed05149ed`
+
+**What this does and does not prove:**
+- It proves the *stack* can enforce specific invariants (language + formatting) across these model tiers for these prompts, with auditable corrections when needed.
+- It does not prove the models are equivalent on clinical judgment, crisis handling, or long‑horizon reasoning. Those require separate validation and are intentionally not claimed here.
+
 ---

 <a id="bridge-submission-pack"></a>