mcp-multiagent-bridge/scripts/production/README.md

# MCP Bridge Production Hardening Scripts

Production-ready deployment tools for running MCP bridge at scale with multiple agents.

## Overview

These scripts solve common production issues when running multiple Claude sessions coordinated via MCP bridge:

- **Idle session detection** - Workers can miss messages when sessions go idle
- **Keep-alive reliability** - Continuous polling ensures 100% message delivery
- **External monitoring** - Watchdog detects silent agents and triggers alerts
- **Task reassignment** - Automated recovery when workers fail
- **Push notifications** - Filesystem watchers eliminate polling delay

## Scripts

### For Workers

#### `keepalive-daemon.sh`
Background daemon that polls for new messages every 30 seconds.

**Usage:**
```bash
./keepalive-daemon.sh <conversation_id> <worker_token>
```

**Example:**
```bash
./keepalive-daemon.sh conv_abc123def456 token_xyz789abc123 &
```

**Logs:** `/tmp/mcp-keepalive.log`

#### `keepalive-client.py`
Python client that updates heartbeat and checks for messages.

**Usage:**
```bash
python3 keepalive-client.py \
  --conversation-id conv_abc123 \
  --token token_xyz789 \
  --db-path /tmp/claude_bridge_coordinator.db
```

#### `check-messages.py`
Standalone script to check for new messages.

**Usage:**
```bash
python3 check-messages.py \
  --conversation-id conv_abc123 \
  --token token_xyz789
```

#### `fs-watcher.sh`
Filesystem watcher using inotify for push-based notifications (<50ms latency).

**Requirements:** `inotify-tools` (Linux) or `fswatch` (macOS)

**Usage:**
```bash
# Install inotify-tools first
sudo apt-get install -y inotify-tools

# Run watcher
./fs-watcher.sh <conversation_id> <worker_token> &
```

**Benefits:**
- Message latency: <50ms (vs 15-30s with polling)
- Lower CPU usage
- Immediate notification when messages arrive

---

### For Orchestrator

#### `watchdog-monitor.sh`
External monitoring daemon that detects silent workers.

**Usage:**
```bash
./watchdog-monitor.sh &
```

**Configuration:**
- `CHECK_INTERVAL=60` - Check every 60 seconds
- `TIMEOUT_THRESHOLD=300` - Alert if no heartbeat for 5 minutes

**Logs:** `/tmp/mcp-watchdog.log`

**Expected output:**
```
[16:00:00] ✅ All workers healthy
[16:01:00] ✅ All workers healthy
[16:07:00] 🚨 ALERT: Silent workers detected!
            conv_worker5 | session_b | 2025-11-13 16:02:45 | 315
[16:07:00] 🔄 Triggering task reassignment...
```

#### `reassign-tasks.py`
Task reassignment script triggered by watchdog when workers fail.

**Usage:**
```bash
python3 reassign-tasks.py --silent-workers "<worker_list>"
```

**Logs:** Writes to `audit_log` table in SQLite database

---

## Architecture

### Multi-Agent Coordination

```
┌─────────────────────────────────────────────────────────┐
│                   ORCHESTRATOR                          │
│                                                         │
│  • Creates conversations for N workers                  │
│  • Distributes tasks                                    │
│  • Runs watchdog-monitor.sh (monitors heartbeats)       │
│  • Triggers task reassignment on failures               │
└─────────────────┬───────────────────────────────────────┘
                  │
      ┌───────────┴───────────┬───────────┬───────────┐
      │                       │           │           │
┌─────▼─────┐  ┌──────▼──────┐  ┌───▼───┐  ┌───▼───┐
│ Worker 1  │  │  Worker 2   │  │Worker │  │Worker │
│           │  │             │  │  3    │  │  N    │
│           │  │             │  │       │  │       │
└───────────┘  └─────────────┘  └───────┘  └───────┘
     │              │                │          │
     │              │                │          │
  keepalive     keepalive        keepalive  keepalive
   daemon         daemon           daemon     daemon
     │              │                │          │
     └──────────────┴────────────────┴──────────┘
                     │
         Updates heartbeat every 30s
```

### Database Schema

The scripts use the following additional table:

```sql
CREATE TABLE IF NOT EXISTS session_status (
    conversation_id TEXT PRIMARY KEY,
    session_id TEXT NOT NULL,
    last_heartbeat TEXT NOT NULL,
    status TEXT DEFAULT 'active'
);
```

---

## Quick Start

### Setup Workers

On each worker machine:

```bash
# 1. Extract credentials from your conversation
CONV_ID="conv_abc123"
WORKER_TOKEN="token_xyz789"

# 2. Start keep-alive daemon
./keepalive-daemon.sh "$CONV_ID" "$WORKER_TOKEN" &

# 3. Verify running
tail -f /tmp/mcp-keepalive.log
```

### Setup Orchestrator

On orchestrator machine:

```bash
# Start external watchdog
./watchdog-monitor.sh &

# Monitor all workers
tail -f /tmp/mcp-watchdog.log
```

---

## Production Deployment Checklist

- [ ] All workers have keep-alive daemons running
- [ ] Orchestrator has external watchdog running
- [ ] SQLite database has `session_status` table created
- [ ] Rate limits increased to 100 req/min (for multi-agent)
- [ ] Logs are being rotated (logrotate)
- [ ] Monitoring alerts configured for watchdog failures

---

## Troubleshooting

### Worker not sending heartbeats

**Symptom:** Watchdog reports worker silent for >5 minutes

**Diagnosis:**
```bash
# Check if daemon is running
ps aux | grep keepalive-daemon

# Check daemon logs
tail -f /tmp/mcp-keepalive.log
```

**Solution:**
```bash
# Restart keep-alive daemon
pkill -f keepalive-daemon
./keepalive-daemon.sh "$CONV_ID" "$WORKER_TOKEN" &
```

### High message latency

**Symptom:** Messages taking >60 seconds to deliver

**Solution:** Switch from polling to filesystem watcher

```bash
# Stop polling daemon
pkill -f keepalive-daemon

# Start filesystem watcher (requires inotify-tools)
./fs-watcher.sh "$CONV_ID" "$WORKER_TOKEN" &
```

**Expected improvement:** 15-30s → <50ms latency

### Database locked errors

**Symptom:** `database is locked` errors in logs

**Solution:** Ensure SQLite WAL mode is enabled

```python
import sqlite3
conn = sqlite3.connect('/tmp/claude_bridge_coordinator.db')
conn.execute('PRAGMA journal_mode=WAL')
conn.close()
```

---

## Performance Metrics

Based on testing with 10 concurrent agents:

| Metric | Polling (30s) | Filesystem Watcher |
|--------|---------------|-------------------|
| Message latency | 15-30s avg | <50ms avg |
| CPU usage | Low (0.1%) | Very Low (0.05%) |
| Message delivery | 100% | 100% |
| Idle detection | 2-5 min | 2-5 min |
| Recovery time | <5 min | <5 min |

---

## Testing

Run the test suite to validate production hardening:

```bash
# Test keep-alive reliability (30 minutes)
python3 test_keepalive_reliability.py

# Test watchdog detection (5 minutes)
python3 test_watchdog_monitoring.py

# Test filesystem watcher latency (1 minute)
python3 test_fs_watcher_latency.py
```

---

## Contributing

See `CONTRIBUTING.md` in the root directory.

---

## License

Same as parent project (see `LICENSE`).

---

**Last Updated:** 2025-11-13
**Status:** Production-ready
**Tested with:** 10 concurrent Claude sessions over 30 minutes