## Backend (server/) - Express 5 API with security middleware (helmet, rate limiting) - SQLite database with WAL mode (schema from docs/architecture/) - Meilisearch integration with tenant tokens - BullMQ + Redis background job queue - OCR pipeline with Tesseract.js - File safety validation (extension, MIME, size) - 4 API route modules: upload, jobs, search, documents ## Frontend (client/) - Vue 3 with Composition API (<script setup>) - Vite 5 build system with HMR - Tailwind CSS (Meilisearch-inspired design) - UploadModal with drag-and-drop - FigureZoom component (ported from lilian1) - Meilisearch search integration with tenant tokens - Job polling composable - Clean SVG icons (no emojis) ## Code Extraction - ✅ manuals.js → UploadModal.vue, useJobPolling.js - ✅ figure-zoom.js → FigureZoom.vue - ✅ service-worker.js → client/public/service-worker.js (TODO) - ✅ glossary.json → Merged into Meilisearch synonyms - ❌ Discarded: quiz.js, persona.js, gamification.js (Frank-AI junk) ## Documentation - Complete extraction plan in docs/analysis/ - README with quick start guide - Architecture summary in docs/architecture/ ## Build Status - Server dependencies: ✅ Installed (234 packages) - Client dependencies: ✅ Installed (160 packages) - Client build: ✅ Successful (2.63s) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
409 lines
8.2 KiB
Markdown
409 lines
8.2 KiB
Markdown
# NaviDocs OCR Pipeline
|
|
|
|
## Overview
|
|
|
|
The OCR pipeline processes PDF documents in the background, extracting text from each page and indexing it in Meilisearch for fast, searchable access.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Upload PDF → Create OCR Job → BullMQ Queue → OCR Worker → Database + Meilisearch
|
|
```
|
|
|
|
### Components
|
|
|
|
1. **OCR Service** (`services/ocr.js`)
|
|
- Converts PDF pages to images using external tools (pdftoppm or ImageMagick)
|
|
- Runs Tesseract.js OCR on each image
|
|
- Returns structured data with text and confidence scores
|
|
|
|
2. **Search Service** (`services/search.js`)
|
|
- Indexes document pages in Meilisearch
|
|
- Builds proper document structure with metadata
|
|
- Supports multi-vertical indexing (boat, marina, property)
|
|
|
|
3. **OCR Worker** (`workers/ocr-worker.js`)
|
|
- BullMQ background worker processing jobs from 'ocr-jobs' queue
|
|
- Updates job progress in real-time (0-100%)
|
|
- Saves OCR results to `document_pages` table
|
|
- Indexes pages in Meilisearch with full metadata
|
|
- Updates document status to 'indexed' when complete
|
|
|
|
## Setup
|
|
|
|
### 1. Install System Dependencies
|
|
|
|
The OCR pipeline requires PDF to image conversion tools:
|
|
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo apt-get update
|
|
sudo apt-get install -y poppler-utils imagemagick tesseract-ocr
|
|
|
|
# macOS
|
|
brew install poppler imagemagick tesseract
|
|
|
|
# Verify installation
|
|
which pdftoppm
|
|
which convert
|
|
which tesseract
|
|
```
|
|
|
|
### 2. Install Node Dependencies
|
|
|
|
```bash
|
|
cd server
|
|
npm install
|
|
```
|
|
|
|
### 3. Start Redis
|
|
|
|
BullMQ requires Redis for job queue management:
|
|
|
|
```bash
|
|
# Using Docker
|
|
docker run -d -p 6379:6379 redis:alpine
|
|
|
|
# Or install locally
|
|
sudo apt-get install redis-server
|
|
redis-server
|
|
```
|
|
|
|
### 4. Start Meilisearch
|
|
|
|
```bash
|
|
# Using Docker
|
|
docker run -d -p 7700:7700 \
|
|
-e MEILI_MASTER_KEY=masterKey \
|
|
-v $(pwd)/data.ms:/data.ms \
|
|
getmeili/meilisearch:latest
|
|
|
|
# Or download binary
|
|
curl -L https://install.meilisearch.com | sh
|
|
./meilisearch --master-key=masterKey
|
|
```
|
|
|
|
### 5. Start the OCR Worker
|
|
|
|
```bash
|
|
# Run worker directly
|
|
node workers/ocr-worker.js
|
|
|
|
# Or use process manager
|
|
pm2 start workers/ocr-worker.js --name ocr-worker
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Creating an OCR Job
|
|
|
|
```javascript
|
|
import { Queue } from 'bullmq';
|
|
import { v4 as uuidv4 } from 'uuid';
|
|
|
|
const ocrQueue = new Queue('ocr-jobs', {
|
|
connection: { host: '127.0.0.1', port: 6379 }
|
|
});
|
|
|
|
// Create job in database
|
|
const jobId = uuidv4();
|
|
db.prepare(`
|
|
INSERT INTO ocr_jobs (id, document_id, status, created_at)
|
|
VALUES (?, ?, 'pending', ?)
|
|
`).run(jobId, documentId, Date.now() / 1000);
|
|
|
|
// Add job to queue
|
|
await ocrQueue.add('process-document', {
|
|
documentId: documentId,
|
|
jobId: jobId,
|
|
filePath: '/path/to/document.pdf'
|
|
});
|
|
```
|
|
|
|
### Monitoring Job Progress
|
|
|
|
```javascript
|
|
// Get job from queue
|
|
const job = await ocrQueue.getJob(jobId);
|
|
|
|
// Check progress
|
|
const progress = await job.progress(); // 0-100
|
|
|
|
// Check database for status
|
|
const jobStatus = db.prepare(`
|
|
SELECT status, progress, error FROM ocr_jobs WHERE id = ?
|
|
`).get(jobId);
|
|
```
|
|
|
|
### Searching Indexed Pages
|
|
|
|
```javascript
|
|
import { searchPages } from './services/search.js';
|
|
|
|
// Search all pages
|
|
const results = await searchPages('bilge pump maintenance', {
|
|
limit: 20,
|
|
offset: 0
|
|
});
|
|
|
|
// Search with filters (user-specific)
|
|
const results = await searchPages('electrical system', {
|
|
filter: `userId = "${userId}" AND vertical = "boating"`,
|
|
limit: 10
|
|
});
|
|
|
|
// Search with organization access
|
|
const results = await searchPages('generator', {
|
|
filter: `organizationId IN ["org1", "org2"]`,
|
|
sort: ['pageNumber:asc']
|
|
});
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
### ocr_jobs Table
|
|
|
|
```sql
|
|
CREATE TABLE ocr_jobs (
|
|
id TEXT PRIMARY KEY,
|
|
document_id TEXT NOT NULL,
|
|
status TEXT DEFAULT 'pending', -- pending, processing, completed, failed
|
|
progress INTEGER DEFAULT 0, -- 0-100
|
|
error TEXT,
|
|
started_at INTEGER,
|
|
completed_at INTEGER,
|
|
created_at INTEGER NOT NULL,
|
|
FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
|
|
);
|
|
```
|
|
|
|
### document_pages Table
|
|
|
|
```sql
|
|
CREATE TABLE document_pages (
|
|
id TEXT PRIMARY KEY,
|
|
document_id TEXT NOT NULL,
|
|
page_number INTEGER NOT NULL,
|
|
|
|
-- OCR data
|
|
ocr_text TEXT,
|
|
ocr_confidence REAL,
|
|
ocr_language TEXT DEFAULT 'en',
|
|
ocr_completed_at INTEGER,
|
|
|
|
-- Search indexing
|
|
search_indexed_at INTEGER,
|
|
meilisearch_id TEXT,
|
|
|
|
metadata TEXT, -- JSON
|
|
created_at INTEGER NOT NULL,
|
|
|
|
UNIQUE(document_id, page_number),
|
|
FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
|
|
);
|
|
```
|
|
|
|
## Meilisearch Document Structure
|
|
|
|
Each indexed page follows this structure:
|
|
|
|
```json
|
|
{
|
|
"id": "page_doc_abc123_p7",
|
|
"vertical": "boating",
|
|
|
|
"organizationId": "org_xyz789",
|
|
"organizationName": "Smith Family Boats",
|
|
|
|
"entityId": "boat_prestige_f49_001",
|
|
"entityName": "Sea Breeze",
|
|
"entityType": "boat",
|
|
|
|
"docId": "doc_abc123",
|
|
"userId": "user_456",
|
|
|
|
"documentType": "component-manual",
|
|
"title": "8.7 Blackwater System - Maintenance",
|
|
"pageNumber": 7,
|
|
"text": "The blackwater pump is located...",
|
|
|
|
"systems": ["plumbing", "waste-management"],
|
|
"categories": ["maintenance", "troubleshooting"],
|
|
"tags": ["bilge", "pump", "blackwater"],
|
|
|
|
"boatName": "Sea Breeze",
|
|
"boatMake": "Prestige",
|
|
"boatModel": "F4.9",
|
|
"boatYear": 2024,
|
|
|
|
"language": "en",
|
|
"ocrConfidence": 0.94,
|
|
|
|
"createdAt": 1740234567,
|
|
"updatedAt": 1740234567
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
The OCR pipeline handles errors gracefully:
|
|
|
|
- **PDF Conversion Errors**: Falls back to alternative tools or returns blank page
|
|
- **OCR Errors**: Stores page with empty text and confidence = 0
|
|
- **Indexing Errors**: Logs error but continues processing other pages
|
|
- **Worker Errors**: Updates job status to 'failed' and stores error message
|
|
|
|
## Performance
|
|
|
|
### Optimization Tips
|
|
|
|
1. **Concurrency**: Adjust `OCR_CONCURRENCY` environment variable (default: 2)
|
|
2. **Rate Limiting**: Worker processes max 5 jobs per minute
|
|
3. **Image Quality**: Uses 300 DPI for optimal OCR accuracy
|
|
4. **Cleanup**: Temporary image files are automatically deleted
|
|
|
|
### Benchmarks
|
|
|
|
- Small PDF (10 pages): ~30-60 seconds
|
|
- Medium PDF (50 pages): ~2-5 minutes
|
|
- Large PDF (200 pages): ~10-20 minutes
|
|
|
|
## Troubleshooting
|
|
|
|
### PDF Conversion Fails
|
|
|
|
```bash
|
|
# Check if tools are installed
|
|
node -e "import('./services/ocr.js').then(m => console.log(m.checkPDFTools()))"
|
|
|
|
# Install missing tools
|
|
sudo apt-get install poppler-utils imagemagick
|
|
```
|
|
|
|
### Tesseract Language Data Missing
|
|
|
|
```bash
|
|
# Install language data
|
|
sudo apt-get install tesseract-ocr-eng tesseract-ocr-fra
|
|
|
|
# For multiple languages
|
|
sudo apt-get install tesseract-ocr-all
|
|
```
|
|
|
|
### Redis Connection Errors
|
|
|
|
```bash
|
|
# Check Redis status
|
|
redis-cli ping
|
|
|
|
# Set Redis host/port
|
|
export REDIS_HOST=localhost
|
|
export REDIS_PORT=6379
|
|
```
|
|
|
|
### Meilisearch Indexing Fails
|
|
|
|
```bash
|
|
# Check Meilisearch is running
|
|
curl http://localhost:7700/health
|
|
|
|
# Set environment variables
|
|
export MEILISEARCH_HOST=http://localhost:7700
|
|
export MEILISEARCH_MASTER_KEY=masterKey
|
|
```
|
|
|
|
## Development
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Test OCR service
|
|
node -e "
|
|
import('./services/ocr.js').then(async (ocr) => {
|
|
const results = await ocr.extractTextFromPDF('/path/to/test.pdf');
|
|
console.log(results);
|
|
});
|
|
"
|
|
|
|
# Test search service
|
|
node -e "
|
|
import('./services/search.js').then(async (search) => {
|
|
const results = await search.searchPages('test query');
|
|
console.log(results);
|
|
});
|
|
"
|
|
```
|
|
|
|
### Monitoring Worker
|
|
|
|
```bash
|
|
# View worker logs
|
|
tail -f logs/ocr-worker.log
|
|
|
|
# Monitor with PM2
|
|
pm2 logs ocr-worker
|
|
|
|
# View queue status
|
|
redis-cli
|
|
> KEYS bull:ocr-jobs:*
|
|
> LLEN bull:ocr-jobs:wait
|
|
```
|
|
|
|
## Production Deployment
|
|
|
|
### Using PM2
|
|
|
|
```bash
|
|
# Start worker with PM2
|
|
pm2 start workers/ocr-worker.js --name ocr-worker --instances 2
|
|
|
|
# Save PM2 config
|
|
pm2 save
|
|
|
|
# Auto-start on boot
|
|
pm2 startup
|
|
```
|
|
|
|
### Using Docker
|
|
|
|
```dockerfile
|
|
FROM node:20-alpine
|
|
|
|
# Install system dependencies
|
|
RUN apk add --no-cache \
|
|
poppler-utils \
|
|
imagemagick \
|
|
tesseract-ocr \
|
|
tesseract-ocr-data-eng
|
|
|
|
WORKDIR /app
|
|
COPY package*.json ./
|
|
RUN npm ci --production
|
|
|
|
COPY . .
|
|
|
|
CMD ["node", "workers/ocr-worker.js"]
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# Redis
|
|
REDIS_HOST=localhost
|
|
REDIS_PORT=6379
|
|
|
|
# Meilisearch
|
|
MEILISEARCH_HOST=http://localhost:7700
|
|
MEILISEARCH_MASTER_KEY=masterKey
|
|
MEILISEARCH_INDEX_NAME=navidocs-pages
|
|
|
|
# Database
|
|
DATABASE_PATH=/data/navidocs.db
|
|
|
|
# Worker
|
|
OCR_CONCURRENCY=2
|
|
```
|
|
|
|
## License
|
|
|
|
MIT
|