navidocs/EVALUATION_WORKFLOW_README.md

# InfraFabric Multi-Evaluator Workflow

This directory contains prompts and tools for evaluating InfraFabric using multiple AI evaluators (Codex, Gemini, Claude) and automatically merging their feedback.

## Files

### 1. Prompts
- **`INFRAFABRIC_COMPREHENSIVE_EVALUATION_PROMPT.md`** - Full evaluation framework (7.5KB)
- **`INFRAFABRIC_EVAL_PASTE_PROMPT.txt`** - Concise paste-ready version (3.4KB)

### 2. Tools
- **`merge_evaluations.py`** - Python script to compare and merge YAML outputs

## Workflow

### Step 1: Run Evaluations in Parallel

Copy the paste-ready prompt and run in 3 separate sessions:

**Session A: Codex**
```bash
# Copy prompt
cat INFRAFABRIC_EVAL_PASTE_PROMPT.txt

# Paste into Codex session
# Save output as: codex_infrafabric_eval_2025-11-14.yaml
```

**Session B: Gemini**
```bash
# Copy prompt
cat INFRAFABRIC_EVAL_PASTE_PROMPT.txt

# Paste into Gemini session
# Save output as: gemini_infrafabric_eval_2025-11-14.yaml
```

**Session C: Claude Code**
```bash
# Copy prompt
cat INFRAFABRIC_EVAL_PASTE_PROMPT.txt

# Paste into Claude Code session
# Save output as: claude_infrafabric_eval_2025-11-14.yaml
```

### Step 2: Merge Results

Once you have all 3 YAML files:

```bash
./merge_evaluations.py codex_*.yaml gemini_*.yaml claude_*.yaml
```

This generates: **`INFRAFABRIC_CONSENSUS_REPORT.md`**

## What the Merger Does

The `merge_evaluations.py` script:

1. **Score Consensus**
   - Averages scores across evaluators (overall, conceptual, technical, etc.)
   - Calculates variance and identifies outliers
   - Shows individual scores for comparison

2. **IF.* Component Status**
   - Merges component assessments (implemented/partial/vaporware)
   - Shows consensus level (e.g., "3/3 evaluators agree")
   - Averages completeness percentages for implemented components

3. **Critical Issues (P0/P1/P2)**
   - Aggregates issues across evaluators
   - Ranks by consensus (how many evaluators identified it)
   - Merges effort estimates

4. **Buyer Persona Analysis**
   - Averages fit scores and willingness-to-pay
   - Identifies consensus on target markets
   - Ranks by aggregate fit score

## Example Output Structure

```markdown
# InfraFabric Evaluation Consensus Report

**Evaluators:** Codex, Gemini, Claude
**Generated:** 2025-11-14

## Score Consensus

### overall_score
- **Average:** 6.5/10
- **Variance:** 0.25
- **Individual scores:**
  - Codex: 6.0
  - Gemini: 7.0
  - Claude: 6.5
- **Outliers:** None

## IF.* Component Status (Consensus)

### IMPLEMENTED

**IF.guard** (3/3 evaluators agree - 100% consensus)
- Evaluators: Codex, Gemini, Claude
- Average completeness: 73%

**IF.citate** (3/3 evaluators agree - 100% consensus)
- Evaluators: Codex, Gemini, Claude
- Average completeness: 58%

### PARTIAL

**IF.sam** (3/3 evaluators agree - 100% consensus)
- Evaluators: Codex, Gemini, Claude

**IF.optimize** (2/3 evaluators agree - 67% consensus)
- Evaluators: Codex, Claude

### VAPORWARE

**IF.swarm** (2/3 evaluators agree - 67% consensus)
- Evaluators: Gemini, Claude

## P0 Blockers (Consensus)

**API keys exposed in documentation** (3/3 evaluators - 100% consensus)
- Identified by: Codex, Gemini, Claude
- Effort estimates: 1 hour, 30 minutes

**No authentication system** (3/3 evaluators - 100% consensus)
- Identified by: Codex, Gemini, Claude
- Effort estimates: 3-5 days, 1 week

## Buyer Persona Consensus

**Academic AI Safety Researchers**
- Avg Fit Score: 7.7/10
- Avg Willingness to Pay: 3.3/10
- Identified by: Codex, Gemini, Claude

**Enterprise AI Governance Teams**
- Avg Fit Score: 6.0/10
- Avg Willingness to Pay: 7.0/10
- Identified by: Codex, Gemini, Claude
```

## Benefits of This Approach

### 1. Consensus Validation
- **100% consensus** = High-confidence finding (all evaluators agree)
- **67% consensus** = Worth investigating (2/3 agree)
- **33% consensus** = Possible blind spot or edge case (1/3 unique finding)

### 2. Outlier Detection
- Identifies when one evaluator is significantly different from others
- Helps spot biases or unique insights

### 3. Easy Comparison
- YAML format makes `diff` and `grep` trivial
- Programmatic filtering: `yq '.gaps_and_issues.p0_blockers' codex_eval.yaml`

### 4. Aggregated Metrics
- Average scores reduce individual evaluator bias
- Variance shows agreement level

### 5. Actionable Prioritization
- Issues ranked by consensus (how many evaluators flagged it)
- Effort estimates from multiple perspectives

## Advanced Usage

### Filter by Consensus Level

Show only issues with 100% consensus:
```bash
python3 -c "
import yaml
with open('INFRAFABRIC_CONSENSUS_REPORT.md') as f:
    content = f.read()
    for line in content.split('\n'):
        if '100% consensus' in line:
            print(line)
"
```

### Extract P0 Blockers Only

```bash
grep -A 3 "P0 Blockers" INFRAFABRIC_CONSENSUS_REPORT.md
```

### Compare Individual Scores

```bash
for file in *_eval.yaml; do
    echo "=== $file ==="
    yq '.executive_summary.overall_score' "$file"
done
```

## Tips

1. **Run evaluations in parallel** - All 3 can run simultaneously
2. **Use exact YAML schema** - Don't modify the structure
3. **Save raw outputs** - Keep individual evaluations for reference
4. **Version control consensus reports** - Track how assessments evolve over time
5. **Focus on 100% consensus items first** - These are highest-confidence findings

## Next Steps After Consensus Report

1. **P0 Blockers with 100% consensus** → Fix immediately
2. **IF.* components with 100% "vaporware" consensus** → Remove from docs or implement
3. **Buyer personas with highest avg fit + WTP** → Focus GTM strategy
4. **Issues with <67% consensus** → Investigate (might be edge cases or evaluator blind spots)

## Troubleshooting

**Issue:** YAML parse error
- **Fix:** Ensure evaluators used exact schema (no custom fields at top level)

**Issue:** Missing scores
- **Fix:** Check all evaluators filled in all sections (use schema as checklist)

**Issue:** Consensus report empty
- **Fix:** Verify YAML files are in current directory and named correctly

## Example Session

```bash
# 1. Start evaluations (paste prompt into 3 sessions)
cat INFRAFABRIC_EVAL_PASTE_PROMPT.txt

# 2. Wait for all 3 to complete (1-2 hours each)

# 3. Download YAML outputs to current directory
# codex_infrafabric_eval_2025-11-14.yaml
# gemini_infrafabric_eval_2025-11-14.yaml
# claude_infrafabric_eval_2025-11-14.yaml

# 4. Merge
./merge_evaluations.py *.yaml

# 5. Review consensus
cat INFRAFABRIC_CONSENSUS_REPORT.md

# 6. Act on high-consensus findings
grep -A 3 "100% consensus" INFRAFABRIC_CONSENSUS_REPORT.md
```

---

**Ready to evaluate InfraFabric with brutal honesty and scientific rigor.**