Session 4 (Implementation Planning) has completed comprehensive 4-week sprint planning: Deliverables: - Week 1-4 detailed schedules (162 total hours) - 24 API endpoints (OpenAPI 3.0 specification) - 5 database migrations (100% rollback coverage) - Testing strategy (70% unit, 50% integration, 10 E2E flows) - 28 Gherkin acceptance criteria scenarios - Dependency graph with critical path analysis - Zero-downtime deployment runbook Agents: S4-H01 through S4-H10 (all complete) Token Cost: $2.66 (82% under $15 budget) Efficiency: 82% Haiku delegation Status: Ready for Week 1 implementation kickoff
30 KiB
NaviDocs Deployment Runbook
4-Week Sprint Production Deployment Guide
Document Version: 1.0 Last Updated: 2025-11-13 Status: Phase 1 - Ready for Implementation Owner: S4-H10 (Deployment Checklist Creator & Synthesis Agent)
Executive Summary
This runbook provides step-by-step procedures for deploying the NaviDocs 4-week sprint (Nov 13 - Dec 10, 2025) to production. It covers:
- Pre-deployment validation (tests, backups, configuration)
- Zero-downtime deployment (rolling updates, worker coordination)
- Post-deployment smoke tests (critical flow validation)
- Rollback procedures (emergency recovery)
- Monitoring & logging (incident response)
Target Deployment Window: December 8-10, 2025 (after Week 4 completion) Estimated Deployment Time: 30-45 minutes Expected Downtime: <2 minutes (for database migration only)
Part 1: Pre-Deployment Checklist
A. Test Coverage Validation
Objective: Ensure code quality and feature completeness before deploying to production.
Unit Tests
# Run all unit tests
npm run test:unit
# Expected output
# PASS test/services/warranty.service.test.js
# PASS test/services/event-bus.service.test.js
# PASS test/services/webhook.service.test.js
# PASS test/services/notification.service.test.js
# PASS test/services/sale-workflow.service.test.js
# PASS test/services/home-assistant.service.test.js
# PASS test/services/yachtworld.service.test.js
# ============================================
# Test Suites: 7 passed, 7 total
# Tests: 87 passed, 87 total
# Coverage: 75% statements, 82% branches, 68% functions
Pass Criteria:
- All test suites passing
- Coverage >70% statements
- Zero critical failures
Integration Tests
# Run integration tests (requires test database)
npm run test:integration
# Expected output
# PASS test/routes/warranty.routes.test.js
# PASS test/routes/integrations.routes.test.js
# PASS test/routes/sales.routes.test.js
# PASS test/workers/warranty-expiration.worker.test.js
# ============================================
# Test Suites: 4 passed, 4 total
# Tests: 42 passed, 42 total
# Coverage: 68% statements, 75% branches
Pass Criteria:
- All API routes tested
- Database operations verified
- Background workers functional
E2E Tests
# Run end-to-end tests against staging environment
npm run test:e2e
# Expected output
# PASS e2e/warranty-tracking.spec.js (warranty creation, alerts, claim package)
# PASS e2e/sale-workflow.spec.js (initiate, package generation, transfer)
# PASS e2e/home-assistant.spec.js (webhook registration, event delivery)
# PASS e2e/critical-flows.spec.js (login, document upload, export)
# ============================================
# Test Suites: 4 passed, 4 total
# Tests: 18 passed, 18 total
# Duration: 2m 34s
Pass Criteria:
- All critical user flows pass
- No timeout failures
- Performance within acceptable ranges
Security Audit
# Check dependencies for vulnerabilities
npm audit
# Expected output
# 0 vulnerabilities (after fixes applied)
# If vulnerabilities found, fix them:
npm audit fix
npm audit fix --force # Only if necessary and reviewed
Pass Criteria:
- Zero critical vulnerabilities
- Zero high severity vulnerabilities
- All audits passed
B. Database & Environment Setup
Database Backup
# Create timestamped backup before any operations
BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S)
cp /var/www/navidocs/navidocs.db \
/var/www/navidocs/backups/navidocs.db.backup-${BACKUP_TIMESTAMP}
# Verify backup integrity
sqlite3 /var/www/navidocs/backups/navidocs.db.backup-${BACKUP_TIMESTAMP} ".tables"
# Expected output: Should list all existing tables
# boats documents organization_settings organizations users warranty_tracking webhooks
Backup Verification Checklist:
- Backup file created successfully
- Backup file size > 100KB (contains data)
- Backup file readable (sqlite3 can open it)
- Backup location:
/var/www/navidocs/backups/
Environment Variables Configuration
File: .env.production
# Required for production deployment
cat > .env.production << 'EOF'
# Application
NODE_ENV=production
PORT=3000
API_BASE_URL=https://api.navidocs.app
APP_BASE_URL=https://app.navidocs.app
# Database
DATABASE_URL=/var/www/navidocs/navidocs.db
DATABASE_BACKUP_DIR=/var/www/navidocs/backups
# Authentication
JWT_SECRET=<use_strong_secret_from_vault>
JWT_EXPIRATION=24h
REFRESH_TOKEN_EXPIRATION=7d
# Email Configuration
SMTP_HOST=<email_provider_host>
SMTP_PORT=587
SMTP_USER=<email_service_account>
SMTP_PASSWORD=<use_password_from_vault>
SMTP_FROM=notifications@navidocs.app
SMTP_FROM_NAME=NaviDocs Notifications
# Webhook Configuration
WEBHOOK_SIGNATURE_SECRET=<use_strong_secret_from_vault>
WEBHOOK_TIMEOUT_MS=30000
WEBHOOK_MAX_RETRIES=3
# Home Assistant Integration
HOME_ASSISTANT_WEBHOOK_TIMEOUT=5000
# Redis/Queue Configuration
REDIS_URL=redis://<redis_host>:6379/0
QUEUE_PREFIX=navidocs:queue:
# MLS Integrations
YACHTWORLD_API_KEY=<get_from_partner>
YACHTWORLD_API_BASE=https://api.yachtworld.com
BOAT_TRADER_API_KEY=<get_from_partner>
BOAT_TRADER_API_BASE=https://api.boattrader.com
# Logging & Monitoring
LOG_LEVEL=info
SENTRY_DSN=<get_from_sentry>
NEW_RELIC_LICENSE_KEY=<get_from_new_relic>
# Security
CORS_ORIGIN=https://app.navidocs.app
RATE_LIMIT_WINDOW_MS=900000
RATE_LIMIT_MAX_REQUESTS=100
# Deployment
DEPLOYMENT_VERSION=$(git rev-parse --short HEAD)
DEPLOYMENT_TIMESTAMP=$(date -u +'%Y-%m-%dT%H:%M:%SZ')
EOF
Environment Validation Checklist:
- All required variables defined
- No hardcoded secrets in code
- Secrets sourced from vault/secret manager
- SSL certificate path configured
- CORS origins correct
SSL Certificate Verification
# Check certificate expiration date
openssl x509 -in /etc/ssl/certs/navidocs.crt -noout -dates
# Expected output similar to:
# notBefore=Nov 13 00:00:00 2024 GMT
# notAfter=Nov 13 23:59:59 2025 GMT
# If certificate expires within 30 days, renew immediately
# Renew using Let's Encrypt (automated)
certbot renew
SSL Checklist:
- Certificate valid (not expired)
- Certificate expires >30 days in future
- Private key exists and is readable
- Certificate matches domain
C. Code Review & Quality Gates
Code Review Checklist
- All pull requests reviewed (minimum 2 reviewers)
- All review comments resolved
- No blocking feedback remaining
- Approval from tech lead obtained
Linting & Format Check
# Check code style
npm run lint
# Expected: 0 errors, 0 warnings
# Auto-format code if needed
npm run format
Linting Checklist:
- No ESLint errors
- No Prettier formatting issues
- No TypeScript type errors (if using TS)
Dependency Check
# Review dependency updates
npm outdated
# Update minor/patch versions if safe
npm update
# Document major version updates for next sprint
npm ls | grep -E "UNMET|peer"
Dependency Checklist:
- No unmet peer dependencies
- Critical security patches applied
- Major version updates documented for future
Part 2: Deployment Procedure (Zero-Downtime)
Pre-Deployment Verification (5 minutes)
# 1. Confirm current production state
pm2 list
# Should show both navidocs-api and navidocs-worker running
# 2. Check production database size (to estimate backup/migration time)
du -sh /var/www/navidocs/navidocs.db
# 3. Check system resources
free -h # RAM available
df -h # Disk space available (minimum 1GB for backup)
uptime # System load
Pre-Deployment Criteria:
- Both services running
- >1GB disk space available
- System load <80%
- No active user sessions (off-peak deployment recommended)
Step 1: Notify Stakeholders & Prepare (2 minutes)
# Send deployment notification to monitoring/alerting
# Notify users of upcoming maintenance window (if necessary)
# Example notification:
cat > /tmp/deployment_notice.txt << 'EOF'
DEPLOYMENT IN PROGRESS
Time: 2025-12-08 02:00 UTC
Duration: ~30 minutes
Services: Will be briefly unavailable (~2 minutes for DB migration)
Impact: All users affected during migration window
Status Page: https://status.navidocs.app
EOF
# Post to Slack/Teams if integrated
# curl -X POST -H 'Content-type: application/json' \
# --data @/tmp/deployment_notice.txt \
# https://hooks.slack.com/services/YOUR/WEBHOOK/URL
Step 2: Stop Background Workers (3 minutes)
# CRITICAL: Stop workers first to prevent job processing during migration
pm2 stop navidocs-worker
# Verify workers are stopped
pm2 list | grep navidocs-worker
# Should show: stopped
# Wait for any in-flight jobs to complete (max 2 minutes)
sleep 120
# Check for any stuck jobs
redis-cli LLEN navidocs:queue:default
# If queue length > 0, wait additional 30 seconds
# redis-cli LLEN navidocs:queue:default
Worker Stop Checklist:
- navidocs-worker process stopped
- No new jobs being queued
- In-flight jobs completed or timed out
- Queue is empty or nearly empty
Step 3: Create Production Backup (5 minutes)
# Create timestamped backup with full verification
BACKUP_TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_DIR=/var/www/navidocs/backups
BACKUP_FILE="${BACKUP_DIR}/navidocs.db.backup-${BACKUP_TIMESTAMP}"
# Backup with file locking (SQLite safe copy)
sqlite3 /var/www/navidocs/navidocs.db ".backup '${BACKUP_FILE}'"
# Verify backup size
BACKUP_SIZE=$(du -s "${BACKUP_FILE}" | cut -f1)
ORIGINAL_SIZE=$(du -s /var/www/navidocs/navidocs.db | cut -f1)
echo "Original DB: ${ORIGINAL_SIZE}KB"
echo "Backup File: ${BACKUP_SIZE}KB"
# Verify backup integrity (attempt to query)
BACKUP_TABLES=$(sqlite3 "${BACKUP_FILE}" ".tables" 2>/dev/null | wc -w)
ORIGINAL_TABLES=$(sqlite3 /var/www/navidocs/navidocs.db ".tables" 2>/dev/null | wc -w)
echo "Original tables: ${ORIGINAL_TABLES}"
echo "Backup tables: ${BACKUP_TABLES}"
if [ "${BACKUP_TABLES}" -ne "${ORIGINAL_TABLES}" ]; then
echo "ERROR: Backup verification failed!"
exit 1
fi
# Keep only last 5 backups (clean up old ones)
cd "${BACKUP_DIR}"
ls -t navidocs.db.backup-* | tail -n +6 | xargs rm -f
echo "Backup created successfully: ${BACKUP_FILE}"
Backup Verification Checklist:
- Backup file created
- Backup size reasonable (within 90-110% of original)
- Backup integrity verified (same table count)
- Old backups cleaned up (keeping last 5)
- Backup timestamp recorded for rollback
Step 4: Deploy Code (8 minutes)
# Navigate to production directory
cd /var/www/navidocs
# Fetch latest code from repository
git fetch origin main
git status
# Should show "Your branch is behind 'origin/main'"
# Review changes before merging
git diff HEAD origin/main --stat
# Shows files changed
# Checkout main and pull (assuming CI/CD passed)
git checkout main
git pull origin main
# Expected: "Fast-forward" message
# Verify deployment branch
git log -1 --oneline
# Should match the release commit hash
Code Deployment Checklist:
- git fetch successful
- Changes reviewed (diff --stat)
- No merge conflicts
- Correct branch deployed (main)
- Deployment commit hash recorded
Step 5: Install/Update Dependencies (4 minutes)
# Install production dependencies only
npm install --production
# Verify installation
npm list --depth=0
# Should show all required packages
# Check for any installation errors
npm ls --all 2>&1 | grep -i "error\|unmet"
# If errors found, investigate before proceeding
Dependency Installation Checklist:
- npm install completes without errors
- No peer dependency warnings
- node_modules directory created
- package-lock.json consistent
Step 6: Build Application (3 minutes)
# Build frontend/backend assets if applicable
npm run build
# Verify build output
ls -la dist/
# Should contain compiled assets
# Check build size (ensure no unexpected bloat)
du -sh dist/
# Should be <50MB for typical Node.js app
# If build fails, abort deployment
if [ $? -ne 0 ]; then
echo "Build failed! Rolling back..."
git revert HEAD
npm install --production
exit 1
fi
Build Verification Checklist:
- Build completes successfully
- Dist directory created with assets
- Build size reasonable (<50MB)
- No build warnings (or documented)
Step 7: Run Database Migrations (5 minutes) - CRITICAL
# List pending migrations
npm run migrate:status
# Expected output showing 5 new migrations:
# Pending migrations:
# 1. migrations/20251113_add_warranty_tracking.sql
# 2. migrations/20251113_add_webhooks.sql
# 3. migrations/20251113_add_sale_workflows.sql
# 4. migrations/20251113_add_notification_templates.sql
# 5. migrations/20251120_add_home_assistant_config.sql
# Apply migrations (this is the brief downtime window ~2 minutes)
echo "=== MIGRATION START TIME: $(date) ==="
npm run migrate:up
# Expected output:
# Running migration: 20251113_add_warranty_tracking.sql
# Running migration: 20251113_add_webhooks.sql
# Running migration: 20251113_add_sale_workflows.sql
# Running migration: 20251113_add_notification_templates.sql
# Running migration: 20251120_add_home_assistant_config.sql
# ✓ All migrations completed successfully
echo "=== MIGRATION END TIME: $(date) ==="
# Verify migration success
sqlite3 /var/www/navidocs/navidocs.db ".schema warranty_tracking"
# Should output warranty_tracking schema
# If migration fails, rollback:
if [ $? -ne 0 ]; then
echo "ERROR: Migration failed! Rolling back..."
npm run migrate:down
exit 1
fi
Migration Verification Checklist:
- All migrations listed (npm run migrate:status)
- Migration execution successful
- New tables created (verify with sqlite3 .schema)
- New indexes created
- Data integrity maintained (row counts match)
Step 8: Restart API Server (2 minutes)
# Clear Node.js module cache (optional but recommended)
# Restart the API with graceful shutdown
pm2 restart navidocs-api --wait-ready --listen-timeout 5000
# Verify API is running
pm2 list | grep navidocs-api
# Should show: "online"
# Wait for server to be ready (health check)
RETRY_COUNT=0
MAX_RETRIES=30 # 30 * 2 seconds = 60 seconds max wait
while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
if curl -sf http://localhost:3000/api/health > /dev/null; then
echo "✓ API server is responding to health checks"
break
fi
RETRY_COUNT=$((RETRY_COUNT+1))
echo "Waiting for API server... ($RETRY_COUNT/$MAX_RETRIES)"
sleep 2
done
if [ $RETRY_COUNT -eq $MAX_RETRIES ]; then
echo "ERROR: API server failed to start!"
exit 1
fi
API Server Startup Checklist:
- Process restarted (pm2 restart)
- Process shows "online" status
- Health check endpoint returns 200
- No errors in logs (pm2 logs)
Step 9: Restart Background Workers (2 minutes)
# Restart workers with the new code
pm2 restart navidocs-worker --wait-ready --listen-timeout 5000
# Verify worker is running
pm2 list | grep navidocs-worker
# Should show: "online"
# Check worker logs for startup messages
pm2 logs navidocs-worker --lines 10 --nostream
# Should show "Worker started" messages
# Monitor queue for 30 seconds (verify jobs are being processed)
for i in {1..15}; do
QUEUE_SIZE=$(redis-cli LLEN navidocs:queue:default 2>/dev/null || echo "0")
echo "Queue size: $QUEUE_SIZE (check $i/15)"
sleep 2
done
Worker Startup Checklist:
- Process restarted (pm2 restart)
- Process shows "online" status
- No errors in logs
- Jobs being processed from queue
Part 3: Post-Deployment Validation (10 minutes)
A. Health Check (2 minutes)
# 1. Health endpoint
curl -v http://localhost:3000/api/health
# Expected response:
# HTTP/1.1 200 OK
# Content-Type: application/json
# {
# "status": "ok",
# "timestamp": "2025-12-08T02:35:00Z",
# "database": "connected",
# "redis": "connected",
# "workers": "running"
# }
Health Check Criteria:
- HTTP 200 response
- All services showing as "connected" or "running"
- No error messages in response
B. Critical Endpoint Tests (3 minutes)
# Test authentication endpoints
curl -X POST http://localhost:3000/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"demo@navidocs.app","password":"test"}' \
| jq '.'
# Expected: { "token": "...", "user": {...} }
# HTTP 200-401 (depending on demo account)
# Test boat listing endpoint
curl -H "Authorization: Bearer ${AUTH_TOKEN}" \
http://localhost:3000/api/boats \
| jq '.length'
# Expected: Numeric count (could be 0 if no boats)
# Test warranty endpoint
curl -H "Authorization: Bearer ${AUTH_TOKEN}" \
http://localhost:3000/api/warranties/expiring \
| jq '.'
# Expected: Array of warranties (could be empty [])
# Test warranty creation
curl -X POST -H "Authorization: Bearer ${AUTH_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"boat_id":"test-boat",
"item_name":"Engine",
"purchase_date":"2023-01-15",
"warranty_period_months":24
}' \
http://localhost:3000/api/warranties
# Expected: { "id": "...", "expiration_date": "2025-01-15" }
Endpoint Test Checklist:
- /api/health returns 200
- /api/auth/login responds (200 or 401)
- /api/boats returns data or empty array
- /api/warranties/expiring returns array
- POST /api/warranties creates warranty successfully
C. Database Verification (2 minutes)
# Verify all new tables exist
sqlite3 /var/www/navidocs/navidocs.db << 'EOF'
.mode column
.headers on
-- Check warranty_tracking table
SELECT COUNT(*) as warranty_count FROM warranty_tracking;
-- Check webhooks table
SELECT COUNT(*) as webhook_count FROM webhooks;
-- Check sale_workflows table
SELECT COUNT(*) as sale_count FROM sale_workflows;
-- Check notification_templates table
SELECT COUNT(*) as template_count FROM notification_templates;
-- Verify indexes created
SELECT COUNT(*) as index_count FROM sqlite_master
WHERE type='index' AND tbl_name IN (
'warranty_tracking', 'webhooks', 'sale_workflows'
);
EOF
# Expected output:
# warranty_count: 0 (or >0 if test data inserted)
# webhook_count: 0
# sale_count: 0
# template_count: >0 (seed templates inserted)
# index_count: >5 (all required indexes)
Database Verification Checklist:
- warranty_tracking table exists
- webhooks table exists
- sale_workflows table exists
- notification_templates table exists
- All indexes created successfully
D. Smoke Tests (3 minutes)
# Run critical smoke tests
npm run test:smoke
# Expected output:
# PASS smoke-tests/warranty-creation.spec.js
# PASS smoke-tests/webhook-delivery.spec.js
# PASS smoke-tests/notification-sending.spec.js
# PASS smoke-tests/database-operations.spec.js
# ============================================
# Smoke Tests: 4 passed, 4 total
# Duration: 1m 30s
# If smoke tests fail, check logs:
pm2 logs navidocs-api --lines 50
Smoke Test Criteria:
- All smoke tests pass
- No timeout errors
- No database connectivity errors
- No authentication errors
E. Error Rate & Logs Monitoring (Continuous for 30 minutes)
# Monitor application logs for errors
pm2 logs navidocs-api --lines 20
# Monitor worker logs for failed jobs
pm2 logs navidocs-worker --lines 20
# Check error rate in monitoring system
# Example query (if using Sentry/New Relic):
# SELECT COUNT(*) FROM errors WHERE timestamp > now() - 30 minutes
# Alert if:
# - Error rate > 1% of requests
# - Any critical errors in logs
# - Worker jobs consistently failing
# If issues detected:
# 1. Check logs for root cause
# 2. If severe, proceed to ROLLBACK
# 3. If minor, create incident ticket for next sprint
Log Monitoring Checklist:
- No critical errors in logs
- Error rate <1% of requests
- Worker processing jobs successfully
- No database connection errors
- No memory leaks (consistent RAM usage)
Part 4: Rollback Procedure (Emergency Recovery)
When to Rollback
Initiate rollback immediately if:
- API server won't start (after 5 minutes)
- Database migrations fail
- Health check endpoints fail
- Critical business logic broken
- Error rate >5% of requests
- Database corrupted or locked
Do NOT rollback for:
- Minor UI bugs
- Non-critical feature failures
- Cosmetic issues
- Warnings in logs (errors must be critical)
Rollback Steps (Automated Script)
#!/bin/bash
# File: /var/www/navidocs/scripts/rollback.sh
# Emergency rollback script
set -e # Exit on any error
ROLLBACK_TIME=$(date +%Y-%m-%dT%H:%M:%SZ)
CURRENT_VERSION=$(git rev-parse --short HEAD)
BACKUP_DIR=/var/www/navidocs/backups
echo "============================================"
echo "EMERGENCY ROLLBACK INITIATED"
echo "Time: $ROLLBACK_TIME"
echo "Current Version: $CURRENT_VERSION"
echo "============================================"
# Step 1: Stop all services
echo "Step 1: Stopping services..."
pm2 stop navidocs-api navidocs-worker
sleep 3
# Step 2: Find most recent backup
echo "Step 2: Finding latest backup..."
LATEST_BACKUP=$(ls -t "${BACKUP_DIR}"/navidocs.db.backup-* 2>/dev/null | head -1)
if [ -z "$LATEST_BACKUP" ]; then
echo "ERROR: No backup found! Manual recovery required."
exit 1
fi
echo "Using backup: $LATEST_BACKUP"
# Step 3: Verify backup before restore
echo "Step 3: Verifying backup integrity..."
BACKUP_TABLES=$(sqlite3 "$LATEST_BACKUP" ".tables" 2>/dev/null | wc -w)
if [ "$BACKUP_TABLES" -lt 10 ]; then
echo "ERROR: Backup appears corrupted (only $BACKUP_TABLES tables)"
exit 1
fi
# Step 4: Restore database
echo "Step 4: Restoring database from backup..."
cp "$LATEST_BACKUP" /var/www/navidocs/navidocs.db
# Verify restore
RESTORED_TABLES=$(sqlite3 /var/www/navidocs/navidocs.db ".tables" 2>/dev/null | wc -w)
echo "Restored database has $RESTORED_TABLES tables"
# Step 5: Revert code to previous version
echo "Step 5: Reverting code..."
cd /var/www/navidocs
PREVIOUS_VERSION=$(git rev-parse HEAD~1)
git reset --hard $PREVIOUS_VERSION
# Step 6: Reinstall dependencies
echo "Step 6: Reinstalling dependencies..."
npm install --production
# Step 7: Restart services
echo "Step 7: Restarting services..."
pm2 start navidocs-api navidocs-worker
# Step 8: Health check
echo "Step 8: Verifying services..."
sleep 5
pm2 list
# Step 9: Final verification
echo "Step 9: Running health checks..."
RETRY_COUNT=0
while [ $RETRY_COUNT -lt 30 ]; do
if curl -sf http://localhost:3000/api/health > /dev/null; then
echo "✓ Rollback successful - API is responding"
break
fi
RETRY_COUNT=$((RETRY_COUNT+1))
sleep 2
done
if [ $RETRY_COUNT -eq 30 ]; then
echo "ERROR: Rollback failed - API not responding"
exit 1
fi
echo "============================================"
echo "ROLLBACK COMPLETE"
echo "Previous Version: $CURRENT_VERSION"
echo "Rolled Back To: $PREVIOUS_VERSION"
echo "Database Restored From: $LATEST_BACKUP"
echo "Time: $(date +%Y-%m-%dT%H:%M:%SZ)"
echo "============================================"
# Send notification to Slack/email
# curl -X POST ... # notification code
Manual Rollback (If Automated Fails)
# 1. Stop services
pm2 stop navidocs-api navidocs-worker
# 2. Restore database (replace TIMESTAMP with actual backup timestamp)
TIMESTAMP="20251208-020000" # Example from backup
cp /var/www/navidocs/backups/navidocs.db.backup-${TIMESTAMP} \
/var/www/navidocs/navidocs.db
# 3. Verify database integrity
sqlite3 /var/www/navidocs/navidocs.db ".tables"
# 4. Revert code
cd /var/www/navidocs
git log --oneline -5 # Find previous good commit
git reset --hard <previous-commit-hash>
# 5. Reinstall dependencies
npm install --production
# 6. Restart services
pm2 start navidocs-api navidocs-worker
# 7. Monitor logs
pm2 logs navidocs-api --lines 50
pm2 logs navidocs-worker --lines 50
# 8. Verify health
curl http://localhost:3000/api/health
Rollback Verification Checklist:
- Services stopped cleanly
- Database restored from backup
- Database integrity verified
- Code reverted to previous version
- Dependencies reinstalled
- Services restarted successfully
- Health check passes
- No errors in logs
Post-Rollback Actions
# After rollback is complete and verified:
# 1. Document the incident
cat > /tmp/rollback_incident.log << 'EOF'
ROLLBACK INCIDENT REPORT
Date: 2025-12-08
Time: 02:35 UTC
Duration: 25 minutes
Reason: [Root cause analysis]
Version Rolled Back From: [commit hash]
Version Restored To: [commit hash]
Data Loss: None (database restored from backup)
Actions Taken: [List all steps]
Root Cause: [Analysis]
Prevention: [How to avoid in future]
EOF
# 2. Notify team
# Email incident report to team
# Post to incident channel in Slack
# 3. Create post-mortem ticket
# Add to sprint backlog: "Post-mortem: Deployment failure on 2025-12-08"
# 4. Review deployment process
# Schedule review meeting for next day
# Document lessons learned
Part 5: Monitoring & Support
Real-Time Monitoring Dashboard
Tools to Monitor:
-
Error Tracking: Sentry/New Relic
- Alert if error rate >1% within 5 minutes
- Critical errors require immediate investigation
-
Performance Monitoring: New Relic/DataDog
- API response time <200ms (p95)
- Database query time <100ms (p95)
- Worker job processing time <5s (p95)
-
Infrastructure Monitoring: CloudWatch/Datadog
- CPU usage <80%
- Memory usage <85%
- Disk usage <90%
- Network throughput normal
-
Application Logs: PM2/ELK Stack
- Check for "ERROR" and "CRITICAL" messages
- Monitor for "OutOfMemory" warnings
- Check for "Database locked" errors
Incident Response
If Issues Detected During First 30 Minutes:
# Immediate steps:
# 1. Check if issue is configuration (env var, network) or code
pm2 logs navidocs-api --lines 100
pm2 logs navidocs-worker --lines 100
# 2. If quick fix available (< 5 minutes):
# - Apply fix
# - Restart services
# - Monitor for 10 minutes
# 3. If issue is critical or fix takes >5 minutes:
# - Execute rollback (see Part 4)
# - Create incident ticket
# - Plan hotfix for next deployment
# 4. If issue is intermittent:
# - Monitor for 15 additional minutes
# - Check system resources (memory, disk, CPU)
# - If issue persists, rollback
Deployment Success Criteria
Deployment is SUCCESSFUL if:
- All tests pass (unit, integration, E2E)
- Deployment completes without errors
- All health checks pass
- All smoke tests pass
- Error rate <0.1% during first 24 hours
- No critical issues in logs
- Database integrity verified
- All new features working as expected
Deployment is FAILED if:
- Tests fail before deployment
- Deployment process errors
- Health checks fail after deployment
- Smoke tests fail
- Error rate >1% during first hour
- Critical errors in logs
- Database corruption detected
- Rollback required
Appendix A: Quick Reference
Deployment Timeline
Pre-Deployment Checks: 5 min (tests, backups)
Stakeholder Notification: 2 min
Stop Workers: 3 min
Database Backup: 5 min
Code Deploy: 8 min
Dependencies: 4 min
Build: 3 min
Migrations: 5 min
API Restart: 2 min
Worker Restart: 2 min
─────────────────────────────
TOTAL DOWNTIME: ~2 min (migration window)
TOTAL TIME: ~39 min (with all steps)
Post-Deployment Validation: 10 min
Monitoring Period: 30 min (continuous)
Critical Commands
# Health check
curl http://localhost:3000/api/health
# View logs
pm2 logs navidocs-api
pm2 logs navidocs-worker
# View process status
pm2 list
# Restart services
pm2 restart navidocs-api navidocs-worker
# Emergency rollback
/var/www/navidocs/scripts/rollback.sh
# Database backup
sqlite3 /var/www/navidocs/navidocs.db ".backup '/var/www/navidocs/backups/backup.db'"
# Check queue size
redis-cli LLEN navidocs:queue:default
Emergency Contacts
Tech Lead: [Name/Email]
DevOps: [Name/Email]
On-Call: [Phone/Email]
Incident Channel: #incident-response (Slack)
Appendix B: Testing Checklist Template
Use this template for deployment day:
#!/bin/bash
# Pre-Deployment Checklist - Copy and use on deployment day
DEPLOYMENT_DATE=$(date +%Y-%m-%d)
DEPLOYMENT_TIME=$(date +%H:%M:%S)
echo "NaviDocs Deployment Checklist"
echo "Date: $DEPLOYMENT_DATE"
echo "Time: $DEPLOYMENT_TIME"
echo "========================================"
# Tests
echo "[ ] Unit tests passing"
echo "[ ] Integration tests passing"
echo "[ ] E2E tests passing"
echo "[ ] Security audit passed"
# Backups
echo "[ ] Database backup created"
echo "[ ] Backup verified"
# Environment
echo "[ ] .env.production configured"
echo "[ ] SSL certificate valid"
echo "[ ] Secrets in vault, not in code"
# Deployment
echo "[ ] Code reviewed and approved"
echo "[ ] Dependencies check passed"
echo "[ ] Build successful"
echo "[ ] Migrations ready"
# Deployment Steps
echo "[ ] Pre-deployment checks complete"
echo "[ ] Workers stopped"
echo "[ ] Database backed up"
echo "[ ] Code deployed"
echo "[ ] Dependencies installed"
echo "[ ] Build completed"
echo "[ ] Migrations applied"
echo "[ ] API restarted"
echo "[ ] Workers restarted"
# Post-Deployment
echo "[ ] Health checks pass"
echo "[ ] Smoke tests pass"
echo "[ ] Critical endpoints responding"
echo "[ ] Database verified"
echo "[ ] No errors in logs"
# Sign-Off
echo "========================================"
echo "Deployed by: [Your Name]"
echo "Approved by: [Tech Lead Name]"
echo "Timestamp: $DEPLOYMENT_DATE $DEPLOYMENT_TIME UTC"
Document Status: Ready for Phase 2 Synthesis
Next Steps: Await completion of agents S4-H01 through S4-H09, then synthesize all outputs in session-4-handoff.md