## Backend (server/) - Express 5 API with security middleware (helmet, rate limiting) - SQLite database with WAL mode (schema from docs/architecture/) - Meilisearch integration with tenant tokens - BullMQ + Redis background job queue - OCR pipeline with Tesseract.js - File safety validation (extension, MIME, size) - 4 API route modules: upload, jobs, search, documents ## Frontend (client/) - Vue 3 with Composition API (<script setup>) - Vite 5 build system with HMR - Tailwind CSS (Meilisearch-inspired design) - UploadModal with drag-and-drop - FigureZoom component (ported from lilian1) - Meilisearch search integration with tenant tokens - Job polling composable - Clean SVG icons (no emojis) ## Code Extraction - ✅ manuals.js → UploadModal.vue, useJobPolling.js - ✅ figure-zoom.js → FigureZoom.vue - ✅ service-worker.js → client/public/service-worker.js (TODO) - ✅ glossary.json → Merged into Meilisearch synonyms - ❌ Discarded: quiz.js, persona.js, gamification.js (Frank-AI junk) ## Documentation - Complete extraction plan in docs/analysis/ - README with quick start guide - Architecture summary in docs/architecture/ ## Build Status - Server dependencies: ✅ Installed (234 packages) - Client dependencies: ✅ Installed (160 packages) - Client build: ✅ Successful (2.63s) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
8.2 KiB
8.2 KiB
NaviDocs OCR Pipeline
Overview
The OCR pipeline processes PDF documents in the background, extracting text from each page and indexing it in Meilisearch for fast, searchable access.
Architecture
Upload PDF → Create OCR Job → BullMQ Queue → OCR Worker → Database + Meilisearch
Components
-
OCR Service (
services/ocr.js)- Converts PDF pages to images using external tools (pdftoppm or ImageMagick)
- Runs Tesseract.js OCR on each image
- Returns structured data with text and confidence scores
-
Search Service (
services/search.js)- Indexes document pages in Meilisearch
- Builds proper document structure with metadata
- Supports multi-vertical indexing (boat, marina, property)
-
OCR Worker (
workers/ocr-worker.js)- BullMQ background worker processing jobs from 'ocr-jobs' queue
- Updates job progress in real-time (0-100%)
- Saves OCR results to
document_pagestable - Indexes pages in Meilisearch with full metadata
- Updates document status to 'indexed' when complete
Setup
1. Install System Dependencies
The OCR pipeline requires PDF to image conversion tools:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y poppler-utils imagemagick tesseract-ocr
# macOS
brew install poppler imagemagick tesseract
# Verify installation
which pdftoppm
which convert
which tesseract
2. Install Node Dependencies
cd server
npm install
3. Start Redis
BullMQ requires Redis for job queue management:
# Using Docker
docker run -d -p 6379:6379 redis:alpine
# Or install locally
sudo apt-get install redis-server
redis-server
4. Start Meilisearch
# Using Docker
docker run -d -p 7700:7700 \
-e MEILI_MASTER_KEY=masterKey \
-v $(pwd)/data.ms:/data.ms \
getmeili/meilisearch:latest
# Or download binary
curl -L https://install.meilisearch.com | sh
./meilisearch --master-key=masterKey
5. Start the OCR Worker
# Run worker directly
node workers/ocr-worker.js
# Or use process manager
pm2 start workers/ocr-worker.js --name ocr-worker
Usage
Creating an OCR Job
import { Queue } from 'bullmq';
import { v4 as uuidv4 } from 'uuid';
const ocrQueue = new Queue('ocr-jobs', {
connection: { host: '127.0.0.1', port: 6379 }
});
// Create job in database
const jobId = uuidv4();
db.prepare(`
INSERT INTO ocr_jobs (id, document_id, status, created_at)
VALUES (?, ?, 'pending', ?)
`).run(jobId, documentId, Date.now() / 1000);
// Add job to queue
await ocrQueue.add('process-document', {
documentId: documentId,
jobId: jobId,
filePath: '/path/to/document.pdf'
});
Monitoring Job Progress
// Get job from queue
const job = await ocrQueue.getJob(jobId);
// Check progress
const progress = await job.progress(); // 0-100
// Check database for status
const jobStatus = db.prepare(`
SELECT status, progress, error FROM ocr_jobs WHERE id = ?
`).get(jobId);
Searching Indexed Pages
import { searchPages } from './services/search.js';
// Search all pages
const results = await searchPages('bilge pump maintenance', {
limit: 20,
offset: 0
});
// Search with filters (user-specific)
const results = await searchPages('electrical system', {
filter: `userId = "${userId}" AND vertical = "boating"`,
limit: 10
});
// Search with organization access
const results = await searchPages('generator', {
filter: `organizationId IN ["org1", "org2"]`,
sort: ['pageNumber:asc']
});
Database Schema
ocr_jobs Table
CREATE TABLE ocr_jobs (
id TEXT PRIMARY KEY,
document_id TEXT NOT NULL,
status TEXT DEFAULT 'pending', -- pending, processing, completed, failed
progress INTEGER DEFAULT 0, -- 0-100
error TEXT,
started_at INTEGER,
completed_at INTEGER,
created_at INTEGER NOT NULL,
FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);
document_pages Table
CREATE TABLE document_pages (
id TEXT PRIMARY KEY,
document_id TEXT NOT NULL,
page_number INTEGER NOT NULL,
-- OCR data
ocr_text TEXT,
ocr_confidence REAL,
ocr_language TEXT DEFAULT 'en',
ocr_completed_at INTEGER,
-- Search indexing
search_indexed_at INTEGER,
meilisearch_id TEXT,
metadata TEXT, -- JSON
created_at INTEGER NOT NULL,
UNIQUE(document_id, page_number),
FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);
Meilisearch Document Structure
Each indexed page follows this structure:
{
"id": "page_doc_abc123_p7",
"vertical": "boating",
"organizationId": "org_xyz789",
"organizationName": "Smith Family Boats",
"entityId": "boat_prestige_f49_001",
"entityName": "Sea Breeze",
"entityType": "boat",
"docId": "doc_abc123",
"userId": "user_456",
"documentType": "component-manual",
"title": "8.7 Blackwater System - Maintenance",
"pageNumber": 7,
"text": "The blackwater pump is located...",
"systems": ["plumbing", "waste-management"],
"categories": ["maintenance", "troubleshooting"],
"tags": ["bilge", "pump", "blackwater"],
"boatName": "Sea Breeze",
"boatMake": "Prestige",
"boatModel": "F4.9",
"boatYear": 2024,
"language": "en",
"ocrConfidence": 0.94,
"createdAt": 1740234567,
"updatedAt": 1740234567
}
Error Handling
The OCR pipeline handles errors gracefully:
- PDF Conversion Errors: Falls back to alternative tools or returns blank page
- OCR Errors: Stores page with empty text and confidence = 0
- Indexing Errors: Logs error but continues processing other pages
- Worker Errors: Updates job status to 'failed' and stores error message
Performance
Optimization Tips
- Concurrency: Adjust
OCR_CONCURRENCYenvironment variable (default: 2) - Rate Limiting: Worker processes max 5 jobs per minute
- Image Quality: Uses 300 DPI for optimal OCR accuracy
- Cleanup: Temporary image files are automatically deleted
Benchmarks
- Small PDF (10 pages): ~30-60 seconds
- Medium PDF (50 pages): ~2-5 minutes
- Large PDF (200 pages): ~10-20 minutes
Troubleshooting
PDF Conversion Fails
# Check if tools are installed
node -e "import('./services/ocr.js').then(m => console.log(m.checkPDFTools()))"
# Install missing tools
sudo apt-get install poppler-utils imagemagick
Tesseract Language Data Missing
# Install language data
sudo apt-get install tesseract-ocr-eng tesseract-ocr-fra
# For multiple languages
sudo apt-get install tesseract-ocr-all
Redis Connection Errors
# Check Redis status
redis-cli ping
# Set Redis host/port
export REDIS_HOST=localhost
export REDIS_PORT=6379
Meilisearch Indexing Fails
# Check Meilisearch is running
curl http://localhost:7700/health
# Set environment variables
export MEILISEARCH_HOST=http://localhost:7700
export MEILISEARCH_MASTER_KEY=masterKey
Development
Running Tests
# Test OCR service
node -e "
import('./services/ocr.js').then(async (ocr) => {
const results = await ocr.extractTextFromPDF('/path/to/test.pdf');
console.log(results);
});
"
# Test search service
node -e "
import('./services/search.js').then(async (search) => {
const results = await search.searchPages('test query');
console.log(results);
});
"
Monitoring Worker
# View worker logs
tail -f logs/ocr-worker.log
# Monitor with PM2
pm2 logs ocr-worker
# View queue status
redis-cli
> KEYS bull:ocr-jobs:*
> LLEN bull:ocr-jobs:wait
Production Deployment
Using PM2
# Start worker with PM2
pm2 start workers/ocr-worker.js --name ocr-worker --instances 2
# Save PM2 config
pm2 save
# Auto-start on boot
pm2 startup
Using Docker
FROM node:20-alpine
# Install system dependencies
RUN apk add --no-cache \
poppler-utils \
imagemagick \
tesseract-ocr \
tesseract-ocr-data-eng
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
CMD ["node", "workers/ocr-worker.js"]
Environment Variables
# Redis
REDIS_HOST=localhost
REDIS_PORT=6379
# Meilisearch
MEILISEARCH_HOST=http://localhost:7700
MEILISEARCH_MASTER_KEY=masterKey
MEILISEARCH_INDEX_NAME=navidocs-pages
# Database
DATABASE_PATH=/data/navidocs.db
# Worker
OCR_CONCURRENCY=2
License
MIT