navidocs/EVALUATION_WORKFLOW_README.md
Danny Stocker 841c9ac92e docs(audit): Add complete forensic audit reports and remediation toolkit
Phase 1: Git Repository Audit (4 Agents, 2,438 files)
- GLOBAL_VISION_REPORT.md - Master audit synthesis (health score 8/10)
- ARCHAEOLOGIST_REPORT.md - Roadmap reconstruction (3 phases, no abandonments)
- INSPECTOR_REPORT.md - Wiring analysis (9/10, zero broken imports)
- SEGMENTER_REPORT.md - Functionality matrix (6/6 core features complete)
- GITEA_SYNC_STATUS_REPORT.md - Sync gap analysis (67 commits behind)

Phase 2: Multi-Environment Audit (3 Agents, 991 files)
- LOCAL_FILESYSTEM_ARTIFACTS_REPORT.md - 949 files scanned, 27 ghost files
- STACKCP_REMOTE_ARTIFACTS_REPORT.md - 14 deployment files, 12 missing from Git
- WINDOWS_DOWNLOADS_ARTIFACTS_REPORT.md - 28 strategic docs recovered
- PHASE_2_DELTA_REPORT.md - Cross-environment delta analysis

Remediation Kit (3 Agents)
- restore_chaos.sh - Master recovery script (1,785 lines, 23 functions)
- test_search_wiring.sh - Integration test suite (10 comprehensive tests)
- ELECTRICIAN_INDEX.md - Wiring fixes documentation
- REMEDIATION_COMMANDS.md - CLI command reference

Redis Knowledge Base
- redis_ingest.py - Automated ingestion (397 lines)
- forensic_surveyor.py - Filesystem scanner with Redis integration
- REDIS_INGESTION_*.md - Complete usage documentation
- Total indexed: 3,432 artifacts across 4 namespaces (1.43 GB)

Dockerfile Updates
- Enabled wkhtmltopdf for PDF export
- Multi-stage Alpine Linux build
- Health check endpoint configured

Security Updates
- Updated .env.example with comprehensive variable documentation
- server/index.js modified for api_search route integration

Audit Summary:
- Total files analyzed: 3,429
- Total execution time: 27 minutes
- Agents deployed: 7 (4 Phase 1 + 3 Phase 2)
- Health score: 8/10 (production ready)
- No lost work detected
- No abandoned features
- Zero critical blockers

Launch Status: APPROVED for December 10, 2025

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 15:18:15 +01:00

6.5 KiB

InfraFabric Multi-Evaluator Workflow

This directory contains prompts and tools for evaluating InfraFabric using multiple AI evaluators (Codex, Gemini, Claude) and automatically merging their feedback.

Files

1. Prompts

  • INFRAFABRIC_COMPREHENSIVE_EVALUATION_PROMPT.md - Full evaluation framework (7.5KB)
  • INFRAFABRIC_EVAL_PASTE_PROMPT.txt - Concise paste-ready version (3.4KB)

2. Tools

  • merge_evaluations.py - Python script to compare and merge YAML outputs

Workflow

Step 1: Run Evaluations in Parallel

Copy the paste-ready prompt and run in 3 separate sessions:

Session A: Codex

# Copy prompt
cat INFRAFABRIC_EVAL_PASTE_PROMPT.txt

# Paste into Codex session
# Save output as: codex_infrafabric_eval_2025-11-14.yaml

Session B: Gemini

# Copy prompt
cat INFRAFABRIC_EVAL_PASTE_PROMPT.txt

# Paste into Gemini session
# Save output as: gemini_infrafabric_eval_2025-11-14.yaml

Session C: Claude Code

# Copy prompt
cat INFRAFABRIC_EVAL_PASTE_PROMPT.txt

# Paste into Claude Code session
# Save output as: claude_infrafabric_eval_2025-11-14.yaml

Step 2: Merge Results

Once you have all 3 YAML files:

./merge_evaluations.py codex_*.yaml gemini_*.yaml claude_*.yaml

This generates: INFRAFABRIC_CONSENSUS_REPORT.md

What the Merger Does

The merge_evaluations.py script:

  1. Score Consensus

    • Averages scores across evaluators (overall, conceptual, technical, etc.)
    • Calculates variance and identifies outliers
    • Shows individual scores for comparison
  2. IF. Component Status*

    • Merges component assessments (implemented/partial/vaporware)
    • Shows consensus level (e.g., "3/3 evaluators agree")
    • Averages completeness percentages for implemented components
  3. Critical Issues (P0/P1/P2)

    • Aggregates issues across evaluators
    • Ranks by consensus (how many evaluators identified it)
    • Merges effort estimates
  4. Buyer Persona Analysis

    • Averages fit scores and willingness-to-pay
    • Identifies consensus on target markets
    • Ranks by aggregate fit score

Example Output Structure

# InfraFabric Evaluation Consensus Report

**Evaluators:** Codex, Gemini, Claude
**Generated:** 2025-11-14

## Score Consensus

### overall_score
- **Average:** 6.5/10
- **Variance:** 0.25
- **Individual scores:**
  - Codex: 6.0
  - Gemini: 7.0
  - Claude: 6.5
- **Outliers:** None

## IF.* Component Status (Consensus)

### IMPLEMENTED

**IF.guard** (3/3 evaluators agree - 100% consensus)
- Evaluators: Codex, Gemini, Claude
- Average completeness: 73%

**IF.citate** (3/3 evaluators agree - 100% consensus)
- Evaluators: Codex, Gemini, Claude
- Average completeness: 58%

### PARTIAL

**IF.sam** (3/3 evaluators agree - 100% consensus)
- Evaluators: Codex, Gemini, Claude

**IF.optimize** (2/3 evaluators agree - 67% consensus)
- Evaluators: Codex, Claude

### VAPORWARE

**IF.swarm** (2/3 evaluators agree - 67% consensus)
- Evaluators: Gemini, Claude

## P0 Blockers (Consensus)

**API keys exposed in documentation** (3/3 evaluators - 100% consensus)
- Identified by: Codex, Gemini, Claude
- Effort estimates: 1 hour, 30 minutes

**No authentication system** (3/3 evaluators - 100% consensus)
- Identified by: Codex, Gemini, Claude
- Effort estimates: 3-5 days, 1 week

## Buyer Persona Consensus

**Academic AI Safety Researchers**
- Avg Fit Score: 7.7/10
- Avg Willingness to Pay: 3.3/10
- Identified by: Codex, Gemini, Claude

**Enterprise AI Governance Teams**
- Avg Fit Score: 6.0/10
- Avg Willingness to Pay: 7.0/10
- Identified by: Codex, Gemini, Claude

Benefits of This Approach

1. Consensus Validation

  • 100% consensus = High-confidence finding (all evaluators agree)
  • 67% consensus = Worth investigating (2/3 agree)
  • 33% consensus = Possible blind spot or edge case (1/3 unique finding)

2. Outlier Detection

  • Identifies when one evaluator is significantly different from others
  • Helps spot biases or unique insights

3. Easy Comparison

  • YAML format makes diff and grep trivial
  • Programmatic filtering: yq '.gaps_and_issues.p0_blockers' codex_eval.yaml

4. Aggregated Metrics

  • Average scores reduce individual evaluator bias
  • Variance shows agreement level

5. Actionable Prioritization

  • Issues ranked by consensus (how many evaluators flagged it)
  • Effort estimates from multiple perspectives

Advanced Usage

Filter by Consensus Level

Show only issues with 100% consensus:

python3 -c "
import yaml
with open('INFRAFABRIC_CONSENSUS_REPORT.md') as f:
    content = f.read()
    for line in content.split('\n'):
        if '100% consensus' in line:
            print(line)
"

Extract P0 Blockers Only

grep -A 3 "P0 Blockers" INFRAFABRIC_CONSENSUS_REPORT.md

Compare Individual Scores

for file in *_eval.yaml; do
    echo "=== $file ==="
    yq '.executive_summary.overall_score' "$file"
done

Tips

  1. Run evaluations in parallel - All 3 can run simultaneously
  2. Use exact YAML schema - Don't modify the structure
  3. Save raw outputs - Keep individual evaluations for reference
  4. Version control consensus reports - Track how assessments evolve over time
  5. Focus on 100% consensus items first - These are highest-confidence findings

Next Steps After Consensus Report

  1. P0 Blockers with 100% consensus → Fix immediately
  2. IF. components with 100% "vaporware" consensus* → Remove from docs or implement
  3. Buyer personas with highest avg fit + WTP → Focus GTM strategy
  4. Issues with <67% consensus → Investigate (might be edge cases or evaluator blind spots)

Troubleshooting

Issue: YAML parse error

  • Fix: Ensure evaluators used exact schema (no custom fields at top level)

Issue: Missing scores

  • Fix: Check all evaluators filled in all sections (use schema as checklist)

Issue: Consensus report empty

  • Fix: Verify YAML files are in current directory and named correctly

Example Session

# 1. Start evaluations (paste prompt into 3 sessions)
cat INFRAFABRIC_EVAL_PASTE_PROMPT.txt

# 2. Wait for all 3 to complete (1-2 hours each)

# 3. Download YAML outputs to current directory
# codex_infrafabric_eval_2025-11-14.yaml
# gemini_infrafabric_eval_2025-11-14.yaml
# claude_infrafabric_eval_2025-11-14.yaml

# 4. Merge
./merge_evaluations.py *.yaml

# 5. Review consensus
cat INFRAFABRIC_CONSENSUS_REPORT.md

# 6. Act on high-consensus findings
grep -A 3 "100% consensus" INFRAFABRIC_CONSENSUS_REPORT.md

Ready to evaluate InfraFabric with brutal honesty and scientific rigor.