From fc2b9032a234aa21443ed4abdd7a752f3ec590b5 Mon Sep 17 00:00:00 2001 From: root Date: Mon, 22 Dec 2025 20:09:35 +0000 Subject: [PATCH] Add model economics (Dave-proof) section --- DANNY_STOCKER_INFRAFABRIC_DOSSIER.md | 46 ++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/DANNY_STOCKER_INFRAFABRIC_DOSSIER.md b/DANNY_STOCKER_INFRAFABRIC_DOSSIER.md index 0f7d271..033833e 100644 --- a/DANNY_STOCKER_INFRAFABRIC_DOSSIER.md +++ b/DANNY_STOCKER_INFRAFABRIC_DOSSIER.md @@ -91,6 +91,52 @@ If yes: fund this as a research program in citizen‑auditable operational gover If no: reject it, but name the boundary explicitly. +### Operational finding: Model economics (Dave-proof run) + +A recurring objection (“Dave”) is that mental‑health‑adjacent UX requires the largest, most expensive models; therefore governance is secondary to model scale. InfraFabric treats that as a falsifiable claim: if the governance stack can enforce output invariants across model tiers, then model choice becomes an economic routing decision (default small, escalate when TRIAGE demands). + +This dossier includes one public, reproducible proof run: + +**What we tested (public, reproducible):** +- 5 prompts × 3 Codex models = 15 traced runs: + - `gpt-5.2-codex` + - `gpt-5.1-codex-mini` + - `gpt-5.1-codex-max` +- Two “polish invariants” checked on the *final, user-visible* response: + - language matches the question (`spanish? = N`) + - no bullet lists (`bullets? = N`) +- Every run exports an evidence bundle + SHA256 + offline verifier. + +**Result (reference run `20251222T164352Z`):** +- 15/15 responses returned HTTP 200 and produced verifiable evidence bundles. +- 0/15 language leaks in the final output. +- 0/15 bullet‑list violations in the final output. +- 6/15 traces contain a `postprocess_applied` event with before/after SHA256, showing deterministic correction when needed (the correction itself is audited). + +**How to verify (no insider access):** +- Bundle: https://infrafabric.io/static/hosted/emo_dave_proof_bundle_20251222T164352Z.tar.gz +- Instructions: https://infrafabric.io/static/hosted/EMO_DAVE_PROOF_MODEL_COMPARE_20251222T164352Z.md +- Trace IDs (15): + - `f3a2cc9d-4028-4e1c-a37a-dd1dc107e135` + - `209ec997-86d8-47c6-b92a-d2cce7de4cec` + - `b6ff776c-2c00-41d3-a1d7-20ff9ec2d7ee` + - `96c406a0-be8f-4d6b-87ae-245aad3f700c` + - `03cce206-d9c2-4883-838c-c871437185c6` + - `2a48a19f-2593-4110-9bfb-e47fa24fe462` + - `39551088-70c2-4977-9c75-50307ce92933` + - `9c3c7002-1e8f-430f-b50e-f5633b8963db` + - `50b756df-74dc-450c-bd46-524d799ee8db` + - `7ed26f37-fac7-4a5e-a352-4815411f4933` + - `9841f968-0b30-4916-985e-c5c82968a09c` + - `e907e2f5-a1e0-4050-963b-f244f3aacffb` + - `f2a1a46d-d242-4838-8122-d0eb8f87dfd6` + - `2fd841ed-0ccb-4ab6-ae39-4586c38a5ec5` + - `e2e528a9-2983-499a-af51-a23ed05149ed` + +**What this does and does not prove:** +- It proves the *stack* can enforce specific invariants (language + formatting) across these model tiers for these prompts, with auditable corrections when needed. +- It does not prove the models are equivalent on clinical judgment, crisis handling, or long‑horizon reasoning. Those require separate validation and are intentionally not claimed here. + ---