navidocs/EVALUATION_QUICKSTART.md

# InfraFabric Evaluation - Quick Start

## TL;DR

**Goal:** Get brutal, comparable feedback from 3 AI evaluators (Codex, Gemini, Claude) on InfraFabric

**Time:** 3-6 hours (evaluations run in parallel)

**Output:** Consensus report showing what all evaluators agree on

---

## 3-Step Process

### Step 1: Copy Prompt (5 seconds)

```bash
cat /home/setup/navidocs/INFRAFABRIC_EVAL_PASTE_PROMPT.txt
```

### Step 2: Paste into 3 Sessions (3-6 hours total, run in parallel)

1. **Codex session** → Save output as `codex_infrafabric_eval_2025-11-14.yaml`
2. **Gemini session** → Save output as `gemini_infrafabric_eval_2025-11-14.yaml`
3. **Claude Code session** → Save output as `claude_infrafabric_eval_2025-11-14.yaml`

### Step 3: Merge Results (10 seconds)

```bash
cd /home/setup/navidocs
./merge_evaluations.py codex_*.yaml gemini_*.yaml claude_*.yaml
```

**Output:** `INFRAFABRIC_CONSENSUS_REPORT.md`

---

## What You'll Get

### 1. Score Consensus
```yaml
overall_score: 6.5/10 (average across 3 evaluators)
variance: 0.25 (low variance = high agreement)
```

### 2. IF.* Component Status
```
IF.guard: ✅ Implemented (3/3 agree, 73% complete)
IF.citate: ✅ Implemented (3/3 agree, 58% complete)
IF.sam: 🟡 Partial (3/3 agree - has design, no code)
IF.swarm: ❌ Vaporware (2/3 agree - mentioned but no spec)
```

### 3. Critical Issues (Ranked by Consensus)
```
P0: API keys exposed (3/3 evaluators - 100% consensus) - 1 hour fix
P0: No authentication (3/3 evaluators - 100% consensus) - 3-5 days
P1: IF.sam not implemented (3/3 evaluators - 100% consensus) - 1-2 weeks
```

### 4. Buyer Persona Fit
```
1. Academic AI Safety: Fit 7.7/10, WTP 3.3/10 (loves it, won't pay)
2. Enterprise Governance: Fit 6.0/10, WTP 7.0/10 (will pay if production-ready)
```

---

## Why This Works

✅ **YAML format** → Easy to diff, merge, filter programmatically
✅ **Mandatory schema** → All evaluators use same structure
✅ **Quantified scores** → No vague assessments, everything is 0-10 or percentage
✅ **Consensus ranking** → Focus on what all evaluators agree on first
✅ **File citations** → Every finding links to `file:line` for traceability

---

## Files Reference

| File | Size | Purpose |
|------|------|---------|
| `INFRAFABRIC_EVAL_PASTE_PROMPT.txt` | 9.4KB | Paste this into Codex/Gemini/Claude |
| `INFRAFABRIC_COMPREHENSIVE_EVALUATION_PROMPT.md` | 15KB | Full methodology (reference) |
| `merge_evaluations.py` | 8.9KB | Merges YAML outputs |
| `EVALUATION_WORKFLOW_README.md` | 6.6KB | Detailed workflow guide |
| `EVALUATION_QUICKSTART.md` | This file | Quick reference |

---

## Expected Timeline

| Phase | Duration | Parallelizable? |
|-------|----------|-----------------|
| Start 3 evaluation sessions | 1 minute | Yes |
| Wait for evaluations to complete | 3-6 hours | Yes (all 3 run simultaneously) |
| Download YAML files | 2 minutes | No |
| Run merger | 10 seconds | No |
| Review consensus report | 15-30 minutes | No |
| **Total elapsed time** | **3-6 hours** | (mostly waiting) |

---

## Troubleshooting

**Q: Evaluator isn't following YAML format**
```bash
# Show them the schema again (it's in the prompt)
grep -A 100 "YAML Schema:" INFRAFABRIC_EVAL_PASTE_PROMPT.txt
```

**Q: Merger script fails**
```bash
# Check YAML syntax
python3 -c "import yaml; yaml.safe_load(open('codex_eval.yaml'))"

# Install PyYAML if needed
pip install pyyaml
```

**Q: Want to see just P0 blockers**
```bash
grep -A 5 "P0 Blockers" INFRAFABRIC_CONSENSUS_REPORT.md
```

---

## What to Do with Results

### Priority 1: 100% Consensus P0 Blockers
- **Everyone agrees these are critical**
- Fix immediately before anything else

### Priority 2: IF.* Components (Vaporware → Implemented)
- Components all 3 evaluators flagged as vaporware = remove from docs or build
- Components all 3 flagged as partial = finish implementation

### Priority 3: Market Focus
- Buyer persona with highest `fit_score * willingness_to_pay` = your target customer
- Ignore personas with high fit but low WTP (interesting but won't make money)

### Priority 4: Documentation Cleanup
- Issues with 100% consensus on docs = definitely fix
- Issues with <67% consensus = might be evaluator bias, investigate

---

## Next Session Prompt

After you have the consensus report, create a debug session:

```markdown
# InfraFabric Debug Session

Based on consensus evaluation from Codex, Gemini, and Claude (2025-11-14):

**P0 Blockers (100% consensus):**
1. API keys exposed in docs (1 hour fix)
2. No authentication system (3-5 days)

**IF.* Components to implement:**
1. IF.sam (design exists, no code - 1-2 weeks)
2. [...]

Please implement fixes in priority order, starting with P0s.
```

---

## Key Insight

**Focus on 100% consensus findings first.**

If all 3 evaluators (different architectures, different training data, different biases) independently flag the same issue → it's real and important.

---

**Ready to get brutally honest feedback. Copy the prompt and run 3 evaluations in parallel.**