navidocs/intelligence/session-4/deployment-runbook.md
Claude 765f9b7be3
Session 4 complete: Implementation planning with 10 Haiku agents
Session 4 (Implementation Planning) has completed comprehensive 4-week sprint planning:

Deliverables:
- Week 1-4 detailed schedules (162 total hours)
- 24 API endpoints (OpenAPI 3.0 specification)
- 5 database migrations (100% rollback coverage)
- Testing strategy (70% unit, 50% integration, 10 E2E flows)
- 28 Gherkin acceptance criteria scenarios
- Dependency graph with critical path analysis
- Zero-downtime deployment runbook

Agents: S4-H01 through S4-H10 (all complete)
Token Cost: $2.66 (82% under $15 budget)
Efficiency: 82% Haiku delegation
Status: Ready for Week 1 implementation kickoff
2025-11-13 01:57:59 +00:00

30 KiB

NaviDocs Deployment Runbook

4-Week Sprint Production Deployment Guide

Document Version: 1.0 Last Updated: 2025-11-13 Status: Phase 1 - Ready for Implementation Owner: S4-H10 (Deployment Checklist Creator & Synthesis Agent)


Executive Summary

This runbook provides step-by-step procedures for deploying the NaviDocs 4-week sprint (Nov 13 - Dec 10, 2025) to production. It covers:

  • Pre-deployment validation (tests, backups, configuration)
  • Zero-downtime deployment (rolling updates, worker coordination)
  • Post-deployment smoke tests (critical flow validation)
  • Rollback procedures (emergency recovery)
  • Monitoring & logging (incident response)

Target Deployment Window: December 8-10, 2025 (after Week 4 completion) Estimated Deployment Time: 30-45 minutes Expected Downtime: <2 minutes (for database migration only)


Part 1: Pre-Deployment Checklist

A. Test Coverage Validation

Objective: Ensure code quality and feature completeness before deploying to production.

Unit Tests

# Run all unit tests
npm run test:unit

# Expected output
# PASS  test/services/warranty.service.test.js
# PASS  test/services/event-bus.service.test.js
# PASS  test/services/webhook.service.test.js
# PASS  test/services/notification.service.test.js
# PASS  test/services/sale-workflow.service.test.js
# PASS  test/services/home-assistant.service.test.js
# PASS  test/services/yachtworld.service.test.js
# ============================================
# Test Suites: 7 passed, 7 total
# Tests: 87 passed, 87 total
# Coverage: 75% statements, 82% branches, 68% functions

Pass Criteria:

  • All test suites passing
  • Coverage >70% statements
  • Zero critical failures

Integration Tests

# Run integration tests (requires test database)
npm run test:integration

# Expected output
# PASS  test/routes/warranty.routes.test.js
# PASS  test/routes/integrations.routes.test.js
# PASS  test/routes/sales.routes.test.js
# PASS  test/workers/warranty-expiration.worker.test.js
# ============================================
# Test Suites: 4 passed, 4 total
# Tests: 42 passed, 42 total
# Coverage: 68% statements, 75% branches

Pass Criteria:

  • All API routes tested
  • Database operations verified
  • Background workers functional

E2E Tests

# Run end-to-end tests against staging environment
npm run test:e2e

# Expected output
# PASS  e2e/warranty-tracking.spec.js (warranty creation, alerts, claim package)
# PASS  e2e/sale-workflow.spec.js (initiate, package generation, transfer)
# PASS  e2e/home-assistant.spec.js (webhook registration, event delivery)
# PASS  e2e/critical-flows.spec.js (login, document upload, export)
# ============================================
# Test Suites: 4 passed, 4 total
# Tests: 18 passed, 18 total
# Duration: 2m 34s

Pass Criteria:

  • All critical user flows pass
  • No timeout failures
  • Performance within acceptable ranges

Security Audit

# Check dependencies for vulnerabilities
npm audit

# Expected output
# 0 vulnerabilities (after fixes applied)

# If vulnerabilities found, fix them:
npm audit fix
npm audit fix --force  # Only if necessary and reviewed

Pass Criteria:

  • Zero critical vulnerabilities
  • Zero high severity vulnerabilities
  • All audits passed

B. Database & Environment Setup

Database Backup

# Create timestamped backup before any operations
BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S)
cp /var/www/navidocs/navidocs.db \
   /var/www/navidocs/backups/navidocs.db.backup-${BACKUP_TIMESTAMP}

# Verify backup integrity
sqlite3 /var/www/navidocs/backups/navidocs.db.backup-${BACKUP_TIMESTAMP} ".tables"

# Expected output: Should list all existing tables
# boats documents organization_settings organizations users warranty_tracking webhooks

Backup Verification Checklist:

  • Backup file created successfully
  • Backup file size > 100KB (contains data)
  • Backup file readable (sqlite3 can open it)
  • Backup location: /var/www/navidocs/backups/

Environment Variables Configuration

File: .env.production

# Required for production deployment
cat > .env.production << 'EOF'
# Application
NODE_ENV=production
PORT=3000
API_BASE_URL=https://api.navidocs.app
APP_BASE_URL=https://app.navidocs.app

# Database
DATABASE_URL=/var/www/navidocs/navidocs.db
DATABASE_BACKUP_DIR=/var/www/navidocs/backups

# Authentication
JWT_SECRET=<use_strong_secret_from_vault>
JWT_EXPIRATION=24h
REFRESH_TOKEN_EXPIRATION=7d

# Email Configuration
SMTP_HOST=<email_provider_host>
SMTP_PORT=587
SMTP_USER=<email_service_account>
SMTP_PASSWORD=<use_password_from_vault>
SMTP_FROM=notifications@navidocs.app
SMTP_FROM_NAME=NaviDocs Notifications

# Webhook Configuration
WEBHOOK_SIGNATURE_SECRET=<use_strong_secret_from_vault>
WEBHOOK_TIMEOUT_MS=30000
WEBHOOK_MAX_RETRIES=3

# Home Assistant Integration
HOME_ASSISTANT_WEBHOOK_TIMEOUT=5000

# Redis/Queue Configuration
REDIS_URL=redis://<redis_host>:6379/0
QUEUE_PREFIX=navidocs:queue:

# MLS Integrations
YACHTWORLD_API_KEY=<get_from_partner>
YACHTWORLD_API_BASE=https://api.yachtworld.com
BOAT_TRADER_API_KEY=<get_from_partner>
BOAT_TRADER_API_BASE=https://api.boattrader.com

# Logging & Monitoring
LOG_LEVEL=info
SENTRY_DSN=<get_from_sentry>
NEW_RELIC_LICENSE_KEY=<get_from_new_relic>

# Security
CORS_ORIGIN=https://app.navidocs.app
RATE_LIMIT_WINDOW_MS=900000
RATE_LIMIT_MAX_REQUESTS=100

# Deployment
DEPLOYMENT_VERSION=$(git rev-parse --short HEAD)
DEPLOYMENT_TIMESTAMP=$(date -u +'%Y-%m-%dT%H:%M:%SZ')
EOF

Environment Validation Checklist:

  • All required variables defined
  • No hardcoded secrets in code
  • Secrets sourced from vault/secret manager
  • SSL certificate path configured
  • CORS origins correct

SSL Certificate Verification

# Check certificate expiration date
openssl x509 -in /etc/ssl/certs/navidocs.crt -noout -dates

# Expected output similar to:
# notBefore=Nov 13 00:00:00 2024 GMT
# notAfter=Nov 13 23:59:59 2025 GMT

# If certificate expires within 30 days, renew immediately
# Renew using Let's Encrypt (automated)
certbot renew

SSL Checklist:

  • Certificate valid (not expired)
  • Certificate expires >30 days in future
  • Private key exists and is readable
  • Certificate matches domain

C. Code Review & Quality Gates

Code Review Checklist

  • All pull requests reviewed (minimum 2 reviewers)
  • All review comments resolved
  • No blocking feedback remaining
  • Approval from tech lead obtained

Linting & Format Check

# Check code style
npm run lint

# Expected: 0 errors, 0 warnings

# Auto-format code if needed
npm run format

Linting Checklist:

  • No ESLint errors
  • No Prettier formatting issues
  • No TypeScript type errors (if using TS)

Dependency Check

# Review dependency updates
npm outdated

# Update minor/patch versions if safe
npm update

# Document major version updates for next sprint
npm ls | grep -E "UNMET|peer"

Dependency Checklist:

  • No unmet peer dependencies
  • Critical security patches applied
  • Major version updates documented for future

Part 2: Deployment Procedure (Zero-Downtime)

Pre-Deployment Verification (5 minutes)

# 1. Confirm current production state
pm2 list
# Should show both navidocs-api and navidocs-worker running

# 2. Check production database size (to estimate backup/migration time)
du -sh /var/www/navidocs/navidocs.db

# 3. Check system resources
free -h  # RAM available
df -h    # Disk space available (minimum 1GB for backup)
uptime   # System load

Pre-Deployment Criteria:

  • Both services running
  • >1GB disk space available
  • System load <80%
  • No active user sessions (off-peak deployment recommended)

Step 1: Notify Stakeholders & Prepare (2 minutes)

# Send deployment notification to monitoring/alerting
# Notify users of upcoming maintenance window (if necessary)

# Example notification:
cat > /tmp/deployment_notice.txt << 'EOF'
DEPLOYMENT IN PROGRESS
Time: 2025-12-08 02:00 UTC
Duration: ~30 minutes
Services: Will be briefly unavailable (~2 minutes for DB migration)
Impact: All users affected during migration window
Status Page: https://status.navidocs.app
EOF

# Post to Slack/Teams if integrated
# curl -X POST -H 'Content-type: application/json' \
#   --data @/tmp/deployment_notice.txt \
#   https://hooks.slack.com/services/YOUR/WEBHOOK/URL

Step 2: Stop Background Workers (3 minutes)

# CRITICAL: Stop workers first to prevent job processing during migration
pm2 stop navidocs-worker

# Verify workers are stopped
pm2 list | grep navidocs-worker
# Should show: stopped

# Wait for any in-flight jobs to complete (max 2 minutes)
sleep 120

# Check for any stuck jobs
redis-cli LLEN navidocs:queue:default

# If queue length > 0, wait additional 30 seconds
# redis-cli LLEN navidocs:queue:default

Worker Stop Checklist:

  • navidocs-worker process stopped
  • No new jobs being queued
  • In-flight jobs completed or timed out
  • Queue is empty or nearly empty

Step 3: Create Production Backup (5 minutes)

# Create timestamped backup with full verification
BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_DIR=/var/www/navidocs/backups
BACKUP_FILE="${BACKUP_DIR}/navidocs.db.backup-${BACKUP_TIMESTAMP}"

# Backup with file locking (SQLite safe copy)
sqlite3 /var/www/navidocs/navidocs.db ".backup '${BACKUP_FILE}'"

# Verify backup size
BACKUP_SIZE=$(du -s "${BACKUP_FILE}" | cut -f1)
ORIGINAL_SIZE=$(du -s /var/www/navidocs/navidocs.db | cut -f1)

echo "Original DB: ${ORIGINAL_SIZE}KB"
echo "Backup File: ${BACKUP_SIZE}KB"

# Verify backup integrity (attempt to query)
BACKUP_TABLES=$(sqlite3 "${BACKUP_FILE}" ".tables" 2>/dev/null | wc -w)
ORIGINAL_TABLES=$(sqlite3 /var/www/navidocs/navidocs.db ".tables" 2>/dev/null | wc -w)

echo "Original tables: ${ORIGINAL_TABLES}"
echo "Backup tables: ${BACKUP_TABLES}"

if [ "${BACKUP_TABLES}" -ne "${ORIGINAL_TABLES}" ]; then
  echo "ERROR: Backup verification failed!"
  exit 1
fi

# Keep only last 5 backups (clean up old ones)
cd "${BACKUP_DIR}"
ls -t navidocs.db.backup-* | tail -n +6 | xargs rm -f

echo "Backup created successfully: ${BACKUP_FILE}"

Backup Verification Checklist:

  • Backup file created
  • Backup size reasonable (within 90-110% of original)
  • Backup integrity verified (same table count)
  • Old backups cleaned up (keeping last 5)
  • Backup timestamp recorded for rollback

Step 4: Deploy Code (8 minutes)

# Navigate to production directory
cd /var/www/navidocs

# Fetch latest code from repository
git fetch origin main
git status
# Should show "Your branch is behind 'origin/main'"

# Review changes before merging
git diff HEAD origin/main --stat
# Shows files changed

# Checkout main and pull (assuming CI/CD passed)
git checkout main
git pull origin main

# Expected: "Fast-forward" message

# Verify deployment branch
git log -1 --oneline
# Should match the release commit hash

Code Deployment Checklist:

  • git fetch successful
  • Changes reviewed (diff --stat)
  • No merge conflicts
  • Correct branch deployed (main)
  • Deployment commit hash recorded

Step 5: Install/Update Dependencies (4 minutes)

# Install production dependencies only
npm install --production

# Verify installation
npm list --depth=0
# Should show all required packages

# Check for any installation errors
npm ls --all 2>&1 | grep -i "error\|unmet"

# If errors found, investigate before proceeding

Dependency Installation Checklist:

  • npm install completes without errors
  • No peer dependency warnings
  • node_modules directory created
  • package-lock.json consistent

Step 6: Build Application (3 minutes)

# Build frontend/backend assets if applicable
npm run build

# Verify build output
ls -la dist/
# Should contain compiled assets

# Check build size (ensure no unexpected bloat)
du -sh dist/
# Should be <50MB for typical Node.js app

# If build fails, abort deployment
if [ $? -ne 0 ]; then
  echo "Build failed! Rolling back..."
  git revert HEAD
  npm install --production
  exit 1
fi

Build Verification Checklist:

  • Build completes successfully
  • Dist directory created with assets
  • Build size reasonable (<50MB)
  • No build warnings (or documented)

Step 7: Run Database Migrations (5 minutes) - CRITICAL

# List pending migrations
npm run migrate:status

# Expected output showing 5 new migrations:
# Pending migrations:
# 1. migrations/20251113_add_warranty_tracking.sql
# 2. migrations/20251113_add_webhooks.sql
# 3. migrations/20251113_add_sale_workflows.sql
# 4. migrations/20251113_add_notification_templates.sql
# 5. migrations/20251120_add_home_assistant_config.sql

# Apply migrations (this is the brief downtime window ~2 minutes)
echo "=== MIGRATION START TIME: $(date) ==="
npm run migrate:up

# Expected output:
# Running migration: 20251113_add_warranty_tracking.sql
# Running migration: 20251113_add_webhooks.sql
# Running migration: 20251113_add_sale_workflows.sql
# Running migration: 20251113_add_notification_templates.sql
# Running migration: 20251120_add_home_assistant_config.sql
# ✓ All migrations completed successfully
echo "=== MIGRATION END TIME: $(date) ==="

# Verify migration success
sqlite3 /var/www/navidocs/navidocs.db ".schema warranty_tracking"
# Should output warranty_tracking schema

# If migration fails, rollback:
if [ $? -ne 0 ]; then
  echo "ERROR: Migration failed! Rolling back..."
  npm run migrate:down
  exit 1
fi

Migration Verification Checklist:

  • All migrations listed (npm run migrate:status)
  • Migration execution successful
  • New tables created (verify with sqlite3 .schema)
  • New indexes created
  • Data integrity maintained (row counts match)

Step 8: Restart API Server (2 minutes)

# Clear Node.js module cache (optional but recommended)
# Restart the API with graceful shutdown
pm2 restart navidocs-api --wait-ready --listen-timeout 5000

# Verify API is running
pm2 list | grep navidocs-api
# Should show: "online"

# Wait for server to be ready (health check)
RETRY_COUNT=0
MAX_RETRIES=30  # 30 * 2 seconds = 60 seconds max wait

while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
  if curl -sf http://localhost:3000/api/health > /dev/null; then
    echo "✓ API server is responding to health checks"
    break
  fi
  RETRY_COUNT=$((RETRY_COUNT+1))
  echo "Waiting for API server... ($RETRY_COUNT/$MAX_RETRIES)"
  sleep 2
done

if [ $RETRY_COUNT -eq $MAX_RETRIES ]; then
  echo "ERROR: API server failed to start!"
  exit 1
fi

API Server Startup Checklist:

  • Process restarted (pm2 restart)
  • Process shows "online" status
  • Health check endpoint returns 200
  • No errors in logs (pm2 logs)

Step 9: Restart Background Workers (2 minutes)

# Restart workers with the new code
pm2 restart navidocs-worker --wait-ready --listen-timeout 5000

# Verify worker is running
pm2 list | grep navidocs-worker
# Should show: "online"

# Check worker logs for startup messages
pm2 logs navidocs-worker --lines 10 --nostream
# Should show "Worker started" messages

# Monitor queue for 30 seconds (verify jobs are being processed)
for i in {1..15}; do
  QUEUE_SIZE=$(redis-cli LLEN navidocs:queue:default 2>/dev/null || echo "0")
  echo "Queue size: $QUEUE_SIZE (check $i/15)"
  sleep 2
done

Worker Startup Checklist:

  • Process restarted (pm2 restart)
  • Process shows "online" status
  • No errors in logs
  • Jobs being processed from queue

Part 3: Post-Deployment Validation (10 minutes)

A. Health Check (2 minutes)

# 1. Health endpoint
curl -v http://localhost:3000/api/health

# Expected response:
# HTTP/1.1 200 OK
# Content-Type: application/json
# {
#   "status": "ok",
#   "timestamp": "2025-12-08T02:35:00Z",
#   "database": "connected",
#   "redis": "connected",
#   "workers": "running"
# }

Health Check Criteria:

  • HTTP 200 response
  • All services showing as "connected" or "running"
  • No error messages in response

B. Critical Endpoint Tests (3 minutes)

# Test authentication endpoints
curl -X POST http://localhost:3000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"demo@navidocs.app","password":"test"}' \
  | jq '.'

# Expected: { "token": "...", "user": {...} }
# HTTP 200-401 (depending on demo account)

# Test boat listing endpoint
curl -H "Authorization: Bearer ${AUTH_TOKEN}" \
  http://localhost:3000/api/boats \
  | jq '.length'

# Expected: Numeric count (could be 0 if no boats)

# Test warranty endpoint
curl -H "Authorization: Bearer ${AUTH_TOKEN}" \
  http://localhost:3000/api/warranties/expiring \
  | jq '.'

# Expected: Array of warranties (could be empty [])

# Test warranty creation
curl -X POST -H "Authorization: Bearer ${AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "boat_id":"test-boat",
    "item_name":"Engine",
    "purchase_date":"2023-01-15",
    "warranty_period_months":24
  }' \
  http://localhost:3000/api/warranties

# Expected: { "id": "...", "expiration_date": "2025-01-15" }

Endpoint Test Checklist:

  • /api/health returns 200
  • /api/auth/login responds (200 or 401)
  • /api/boats returns data or empty array
  • /api/warranties/expiring returns array
  • POST /api/warranties creates warranty successfully

C. Database Verification (2 minutes)

# Verify all new tables exist
sqlite3 /var/www/navidocs/navidocs.db << 'EOF'
.mode column
.headers on

-- Check warranty_tracking table
SELECT COUNT(*) as warranty_count FROM warranty_tracking;

-- Check webhooks table
SELECT COUNT(*) as webhook_count FROM webhooks;

-- Check sale_workflows table
SELECT COUNT(*) as sale_count FROM sale_workflows;

-- Check notification_templates table
SELECT COUNT(*) as template_count FROM notification_templates;

-- Verify indexes created
SELECT COUNT(*) as index_count FROM sqlite_master
WHERE type='index' AND tbl_name IN (
  'warranty_tracking', 'webhooks', 'sale_workflows'
);
EOF

# Expected output:
# warranty_count: 0 (or >0 if test data inserted)
# webhook_count: 0
# sale_count: 0
# template_count: >0 (seed templates inserted)
# index_count: >5 (all required indexes)

Database Verification Checklist:

  • warranty_tracking table exists
  • webhooks table exists
  • sale_workflows table exists
  • notification_templates table exists
  • All indexes created successfully

D. Smoke Tests (3 minutes)

# Run critical smoke tests
npm run test:smoke

# Expected output:
# PASS  smoke-tests/warranty-creation.spec.js
# PASS  smoke-tests/webhook-delivery.spec.js
# PASS  smoke-tests/notification-sending.spec.js
# PASS  smoke-tests/database-operations.spec.js
# ============================================
# Smoke Tests: 4 passed, 4 total
# Duration: 1m 30s

# If smoke tests fail, check logs:
pm2 logs navidocs-api --lines 50

Smoke Test Criteria:

  • All smoke tests pass
  • No timeout errors
  • No database connectivity errors
  • No authentication errors

E. Error Rate & Logs Monitoring (Continuous for 30 minutes)

# Monitor application logs for errors
pm2 logs navidocs-api --lines 20

# Monitor worker logs for failed jobs
pm2 logs navidocs-worker --lines 20

# Check error rate in monitoring system
# Example query (if using Sentry/New Relic):
# SELECT COUNT(*) FROM errors WHERE timestamp > now() - 30 minutes

# Alert if:
# - Error rate > 1% of requests
# - Any critical errors in logs
# - Worker jobs consistently failing

# If issues detected:
#   1. Check logs for root cause
#   2. If severe, proceed to ROLLBACK
#   3. If minor, create incident ticket for next sprint

Log Monitoring Checklist:

  • No critical errors in logs
  • Error rate <1% of requests
  • Worker processing jobs successfully
  • No database connection errors
  • No memory leaks (consistent RAM usage)

Part 4: Rollback Procedure (Emergency Recovery)

When to Rollback

Initiate rollback immediately if:

  • API server won't start (after 5 minutes)
  • Database migrations fail
  • Health check endpoints fail
  • Critical business logic broken
  • Error rate >5% of requests
  • Database corrupted or locked

Do NOT rollback for:

  • Minor UI bugs
  • Non-critical feature failures
  • Cosmetic issues
  • Warnings in logs (errors must be critical)

Rollback Steps (Automated Script)

#!/bin/bash
# File: /var/www/navidocs/scripts/rollback.sh
# Emergency rollback script

set -e  # Exit on any error

ROLLBACK_TIME=$(date +%Y-%m-%dT%H:%M:%SZ)
CURRENT_VERSION=$(git rev-parse --short HEAD)
BACKUP_DIR=/var/www/navidocs/backups

echo "============================================"
echo "EMERGENCY ROLLBACK INITIATED"
echo "Time: $ROLLBACK_TIME"
echo "Current Version: $CURRENT_VERSION"
echo "============================================"

# Step 1: Stop all services
echo "Step 1: Stopping services..."
pm2 stop navidocs-api navidocs-worker
sleep 3

# Step 2: Find most recent backup
echo "Step 2: Finding latest backup..."
LATEST_BACKUP=$(ls -t "${BACKUP_DIR}"/navidocs.db.backup-* 2>/dev/null | head -1)

if [ -z "$LATEST_BACKUP" ]; then
  echo "ERROR: No backup found! Manual recovery required."
  exit 1
fi

echo "Using backup: $LATEST_BACKUP"

# Step 3: Verify backup before restore
echo "Step 3: Verifying backup integrity..."
BACKUP_TABLES=$(sqlite3 "$LATEST_BACKUP" ".tables" 2>/dev/null | wc -w)
if [ "$BACKUP_TABLES" -lt 10 ]; then
  echo "ERROR: Backup appears corrupted (only $BACKUP_TABLES tables)"
  exit 1
fi

# Step 4: Restore database
echo "Step 4: Restoring database from backup..."
cp "$LATEST_BACKUP" /var/www/navidocs/navidocs.db

# Verify restore
RESTORED_TABLES=$(sqlite3 /var/www/navidocs/navidocs.db ".tables" 2>/dev/null | wc -w)
echo "Restored database has $RESTORED_TABLES tables"

# Step 5: Revert code to previous version
echo "Step 5: Reverting code..."
cd /var/www/navidocs
PREVIOUS_VERSION=$(git rev-parse HEAD~1)
git reset --hard $PREVIOUS_VERSION

# Step 6: Reinstall dependencies
echo "Step 6: Reinstalling dependencies..."
npm install --production

# Step 7: Restart services
echo "Step 7: Restarting services..."
pm2 start navidocs-api navidocs-worker

# Step 8: Health check
echo "Step 8: Verifying services..."
sleep 5
pm2 list

# Step 9: Final verification
echo "Step 9: Running health checks..."
RETRY_COUNT=0
while [ $RETRY_COUNT -lt 30 ]; do
  if curl -sf http://localhost:3000/api/health > /dev/null; then
    echo "✓ Rollback successful - API is responding"
    break
  fi
  RETRY_COUNT=$((RETRY_COUNT+1))
  sleep 2
done

if [ $RETRY_COUNT -eq 30 ]; then
  echo "ERROR: Rollback failed - API not responding"
  exit 1
fi

echo "============================================"
echo "ROLLBACK COMPLETE"
echo "Previous Version: $CURRENT_VERSION"
echo "Rolled Back To: $PREVIOUS_VERSION"
echo "Database Restored From: $LATEST_BACKUP"
echo "Time: $(date +%Y-%m-%dT%H:%M:%SZ)"
echo "============================================"

# Send notification to Slack/email
# curl -X POST ... # notification code

Manual Rollback (If Automated Fails)

# 1. Stop services
pm2 stop navidocs-api navidocs-worker

# 2. Restore database (replace TIMESTAMP with actual backup timestamp)
TIMESTAMP="20251208-020000"  # Example from backup
cp /var/www/navidocs/backups/navidocs.db.backup-${TIMESTAMP} \
   /var/www/navidocs/navidocs.db

# 3. Verify database integrity
sqlite3 /var/www/navidocs/navidocs.db ".tables"

# 4. Revert code
cd /var/www/navidocs
git log --oneline -5  # Find previous good commit
git reset --hard <previous-commit-hash>

# 5. Reinstall dependencies
npm install --production

# 6. Restart services
pm2 start navidocs-api navidocs-worker

# 7. Monitor logs
pm2 logs navidocs-api --lines 50
pm2 logs navidocs-worker --lines 50

# 8. Verify health
curl http://localhost:3000/api/health

Rollback Verification Checklist:

  • Services stopped cleanly
  • Database restored from backup
  • Database integrity verified
  • Code reverted to previous version
  • Dependencies reinstalled
  • Services restarted successfully
  • Health check passes
  • No errors in logs

Post-Rollback Actions

# After rollback is complete and verified:

# 1. Document the incident
cat > /tmp/rollback_incident.log << 'EOF'
ROLLBACK INCIDENT REPORT
Date: 2025-12-08
Time: 02:35 UTC
Duration: 25 minutes
Reason: [Root cause analysis]
Version Rolled Back From: [commit hash]
Version Restored To: [commit hash]
Data Loss: None (database restored from backup)
Actions Taken: [List all steps]
Root Cause: [Analysis]
Prevention: [How to avoid in future]
EOF

# 2. Notify team
# Email incident report to team
# Post to incident channel in Slack

# 3. Create post-mortem ticket
# Add to sprint backlog: "Post-mortem: Deployment failure on 2025-12-08"

# 4. Review deployment process
# Schedule review meeting for next day
# Document lessons learned

Part 5: Monitoring & Support

Real-Time Monitoring Dashboard

Tools to Monitor:

  1. Error Tracking: Sentry/New Relic

    • Alert if error rate >1% within 5 minutes
    • Critical errors require immediate investigation
  2. Performance Monitoring: New Relic/DataDog

    • API response time <200ms (p95)
    • Database query time <100ms (p95)
    • Worker job processing time <5s (p95)
  3. Infrastructure Monitoring: CloudWatch/Datadog

    • CPU usage <80%
    • Memory usage <85%
    • Disk usage <90%
    • Network throughput normal
  4. Application Logs: PM2/ELK Stack

    • Check for "ERROR" and "CRITICAL" messages
    • Monitor for "OutOfMemory" warnings
    • Check for "Database locked" errors

Incident Response

If Issues Detected During First 30 Minutes:

# Immediate steps:
# 1. Check if issue is configuration (env var, network) or code
pm2 logs navidocs-api --lines 100
pm2 logs navidocs-worker --lines 100

# 2. If quick fix available (< 5 minutes):
#    - Apply fix
#    - Restart services
#    - Monitor for 10 minutes

# 3. If issue is critical or fix takes >5 minutes:
#    - Execute rollback (see Part 4)
#    - Create incident ticket
#    - Plan hotfix for next deployment

# 4. If issue is intermittent:
#    - Monitor for 15 additional minutes
#    - Check system resources (memory, disk, CPU)
#    - If issue persists, rollback

Deployment Success Criteria

Deployment is SUCCESSFUL if:

  • All tests pass (unit, integration, E2E)
  • Deployment completes without errors
  • All health checks pass
  • All smoke tests pass
  • Error rate <0.1% during first 24 hours
  • No critical issues in logs
  • Database integrity verified
  • All new features working as expected

Deployment is FAILED if:

  • Tests fail before deployment
  • Deployment process errors
  • Health checks fail after deployment
  • Smoke tests fail
  • Error rate >1% during first hour
  • Critical errors in logs
  • Database corruption detected
  • Rollback required

Appendix A: Quick Reference

Deployment Timeline

Pre-Deployment Checks:    5 min (tests, backups)
Stakeholder Notification: 2 min
Stop Workers:             3 min
Database Backup:          5 min
Code Deploy:              8 min
Dependencies:             4 min
Build:                    3 min
Migrations:               5 min
API Restart:              2 min
Worker Restart:           2 min
─────────────────────────────
TOTAL DOWNTIME:           ~2 min (migration window)
TOTAL TIME:              ~39 min (with all steps)

Post-Deployment Validation: 10 min
Monitoring Period:         30 min (continuous)

Critical Commands

# Health check
curl http://localhost:3000/api/health

# View logs
pm2 logs navidocs-api
pm2 logs navidocs-worker

# View process status
pm2 list

# Restart services
pm2 restart navidocs-api navidocs-worker

# Emergency rollback
/var/www/navidocs/scripts/rollback.sh

# Database backup
sqlite3 /var/www/navidocs/navidocs.db ".backup '/var/www/navidocs/backups/backup.db'"

# Check queue size
redis-cli LLEN navidocs:queue:default

Emergency Contacts

Tech Lead: [Name/Email]
DevOps: [Name/Email]
On-Call: [Phone/Email]
Incident Channel: #incident-response (Slack)

Appendix B: Testing Checklist Template

Use this template for deployment day:

#!/bin/bash
# Pre-Deployment Checklist - Copy and use on deployment day

DEPLOYMENT_DATE=$(date +%Y-%m-%d)
DEPLOYMENT_TIME=$(date +%H:%M:%S)

echo "NaviDocs Deployment Checklist"
echo "Date: $DEPLOYMENT_DATE"
echo "Time: $DEPLOYMENT_TIME"
echo "========================================"

# Tests
echo "[  ] Unit tests passing"
echo "[  ] Integration tests passing"
echo "[  ] E2E tests passing"
echo "[  ] Security audit passed"

# Backups
echo "[  ] Database backup created"
echo "[  ] Backup verified"

# Environment
echo "[  ] .env.production configured"
echo "[  ] SSL certificate valid"
echo "[  ] Secrets in vault, not in code"

# Deployment
echo "[  ] Code reviewed and approved"
echo "[  ] Dependencies check passed"
echo "[  ] Build successful"
echo "[  ] Migrations ready"

# Deployment Steps
echo "[  ] Pre-deployment checks complete"
echo "[  ] Workers stopped"
echo "[  ] Database backed up"
echo "[  ] Code deployed"
echo "[  ] Dependencies installed"
echo "[  ] Build completed"
echo "[  ] Migrations applied"
echo "[  ] API restarted"
echo "[  ] Workers restarted"

# Post-Deployment
echo "[  ] Health checks pass"
echo "[  ] Smoke tests pass"
echo "[  ] Critical endpoints responding"
echo "[  ] Database verified"
echo "[  ] No errors in logs"

# Sign-Off
echo "========================================"
echo "Deployed by: [Your Name]"
echo "Approved by: [Tech Lead Name]"
echo "Timestamp: $DEPLOYMENT_DATE $DEPLOYMENT_TIME UTC"

Document Status: Ready for Phase 2 Synthesis Next Steps: Await completion of agents S4-H01 through S4-H09, then synthesize all outputs in session-4-handoff.md