Implements complete TOC feature for document navigation with bilingual support.
## TOC Detection & Extraction
- Pattern-based TOC detection with 3 regex patterns
- Heuristic validation (30%+ match ratio, 5+ entries, sequential pages)
- Hierarchical section key parsing (e.g., "4.1.2" → level 3, parent "4.1")
- Database schema with parent-child relationships
- Automatic extraction during OCR post-processing
- Server-side LRU caching (200 entries, 30min TTL)
## UI Components
- TocSidebar: Collapsible sidebar (320px) with auto-open on TOC presence
- TocEntry: Recursive component for hierarchical rendering
- Flex layout: Sidebar + PDF viewer side-by-side
- Active page highlighting with real-time sync
- localStorage persistence for sidebar state
## Navigation Features
- Click TOC entry → PDF jumps to page
- Deep link support: URL hash format #p=12
- Page change events: navidocs:pagechange custom event
- URL hash updates on all navigation (next/prev/goTo/TOC)
- Hash change listener for external navigation
- Page clamping and validation
## Search Integration
- "Jump to section" button in search results
- Shows when result has section field
- Navigates to document with page number and hash
## Accessibility
- ARIA attributes: role, aria-label, aria-expanded, aria-current
- Keyboard navigation: Enter/Space on entries, Tab focus
- Screen reader support with aria-live regions
- Semantic HTML with proper list/listitem roles
## Internationalization (i18n)
- Vue I18n integration with vue-i18n package
- English and French translations
- 8 TOC-specific translation keys
- Language switcher component in document viewer
- Locale persistence in localStorage
## Error Handling
- Specific error messages for each failure case
- Validation before processing (doc exists, has pages, has OCR)
- Non-blocking TOC extraction (doesn't fail OCR jobs)
- Detailed error returns: {success, error, entriesCount, pages}
## API Endpoints
- GET /api/documents/:id/toc?format=flat|tree
- POST /api/documents/:id/toc/extract
- Cache invalidation on re-extraction
## Testing
- Smoke test script: 9 comprehensive tests
- E2E testing guide with 5 manual scenarios
- Tests cover: API, caching, validation, navigation, search
## Database
- Migration 002: document_toc table
- Fields: id, document_id, title, section_key, page_start, level, parent_id, order_index
- Foreign keys with CASCADE delete
## Files Changed
- New: TocSidebar.vue, TocEntry.vue, LanguageSwitcher.vue
- New: toc-extractor.js, toc.js routes, i18n setup
- Modified: DocumentView.vue (sidebar, deep links, events)
- Modified: SearchView.vue (Jump to section button)
- Modified: ocr-worker.js (TOC post-processing)
- New: toc-smoke-test.sh, TOC_E2E_TEST.md
Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| image-extractor.js | ||
| ocr-worker.js | ||
| README.md | ||
NaviDocs OCR Pipeline
Overview
The OCR pipeline processes PDF documents in the background, extracting text from each page and indexing it in Meilisearch for fast, searchable access.
Architecture
Upload PDF → Create OCR Job → BullMQ Queue → OCR Worker → Database + Meilisearch
Components
-
OCR Service (
services/ocr.js)- Converts PDF pages to images using external tools (pdftoppm or ImageMagick)
- Runs Tesseract.js OCR on each image
- Returns structured data with text and confidence scores
-
Search Service (
services/search.js)- Indexes document pages in Meilisearch
- Builds proper document structure with metadata
- Supports multi-vertical indexing (boat, marina, property)
-
OCR Worker (
workers/ocr-worker.js)- BullMQ background worker processing jobs from 'ocr-jobs' queue
- Updates job progress in real-time (0-100%)
- Saves OCR results to
document_pagestable - Indexes pages in Meilisearch with full metadata
- Updates document status to 'indexed' when complete
Setup
1. Install System Dependencies
The OCR pipeline requires PDF to image conversion tools:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y poppler-utils imagemagick tesseract-ocr
# macOS
brew install poppler imagemagick tesseract
# Verify installation
which pdftoppm
which convert
which tesseract
2. Install Node Dependencies
cd server
npm install
3. Start Redis
BullMQ requires Redis for job queue management:
# Using Docker
docker run -d -p 6379:6379 redis:alpine
# Or install locally
sudo apt-get install redis-server
redis-server
4. Start Meilisearch
# Using Docker
docker run -d -p 7700:7700 \
-e MEILI_MASTER_KEY=masterKey \
-v $(pwd)/data.ms:/data.ms \
getmeili/meilisearch:latest
# Or download binary
curl -L https://install.meilisearch.com | sh
./meilisearch --master-key=masterKey
5. Start the OCR Worker
# Run worker directly
node workers/ocr-worker.js
# Or use process manager
pm2 start workers/ocr-worker.js --name ocr-worker
Usage
Creating an OCR Job
import { Queue } from 'bullmq';
import { v4 as uuidv4 } from 'uuid';
const ocrQueue = new Queue('ocr-jobs', {
connection: { host: '127.0.0.1', port: 6379 }
});
// Create job in database
const jobId = uuidv4();
db.prepare(`
INSERT INTO ocr_jobs (id, document_id, status, created_at)
VALUES (?, ?, 'pending', ?)
`).run(jobId, documentId, Date.now() / 1000);
// Add job to queue
await ocrQueue.add('process-document', {
documentId: documentId,
jobId: jobId,
filePath: '/path/to/document.pdf'
});
Monitoring Job Progress
// Get job from queue
const job = await ocrQueue.getJob(jobId);
// Check progress
const progress = await job.progress(); // 0-100
// Check database for status
const jobStatus = db.prepare(`
SELECT status, progress, error FROM ocr_jobs WHERE id = ?
`).get(jobId);
Searching Indexed Pages
import { searchPages } from './services/search.js';
// Search all pages
const results = await searchPages('bilge pump maintenance', {
limit: 20,
offset: 0
});
// Search with filters (user-specific)
const results = await searchPages('electrical system', {
filter: `userId = "${userId}" AND vertical = "boating"`,
limit: 10
});
// Search with organization access
const results = await searchPages('generator', {
filter: `organizationId IN ["org1", "org2"]`,
sort: ['pageNumber:asc']
});
Database Schema
ocr_jobs Table
CREATE TABLE ocr_jobs (
id TEXT PRIMARY KEY,
document_id TEXT NOT NULL,
status TEXT DEFAULT 'pending', -- pending, processing, completed, failed
progress INTEGER DEFAULT 0, -- 0-100
error TEXT,
started_at INTEGER,
completed_at INTEGER,
created_at INTEGER NOT NULL,
FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);
document_pages Table
CREATE TABLE document_pages (
id TEXT PRIMARY KEY,
document_id TEXT NOT NULL,
page_number INTEGER NOT NULL,
-- OCR data
ocr_text TEXT,
ocr_confidence REAL,
ocr_language TEXT DEFAULT 'en',
ocr_completed_at INTEGER,
-- Search indexing
search_indexed_at INTEGER,
meilisearch_id TEXT,
metadata TEXT, -- JSON
created_at INTEGER NOT NULL,
UNIQUE(document_id, page_number),
FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);
Meilisearch Document Structure
Each indexed page follows this structure:
{
"id": "page_doc_abc123_p7",
"vertical": "boating",
"organizationId": "org_xyz789",
"organizationName": "Smith Family Boats",
"entityId": "boat_prestige_f49_001",
"entityName": "Sea Breeze",
"entityType": "boat",
"docId": "doc_abc123",
"userId": "user_456",
"documentType": "component-manual",
"title": "8.7 Blackwater System - Maintenance",
"pageNumber": 7,
"text": "The blackwater pump is located...",
"systems": ["plumbing", "waste-management"],
"categories": ["maintenance", "troubleshooting"],
"tags": ["bilge", "pump", "blackwater"],
"boatName": "Sea Breeze",
"boatMake": "Prestige",
"boatModel": "F4.9",
"boatYear": 2024,
"language": "en",
"ocrConfidence": 0.94,
"createdAt": 1740234567,
"updatedAt": 1740234567
}
Error Handling
The OCR pipeline handles errors gracefully:
- PDF Conversion Errors: Falls back to alternative tools or returns blank page
- OCR Errors: Stores page with empty text and confidence = 0
- Indexing Errors: Logs error but continues processing other pages
- Worker Errors: Updates job status to 'failed' and stores error message
Performance
Optimization Tips
- Concurrency: Adjust
OCR_CONCURRENCYenvironment variable (default: 2) - Rate Limiting: Worker processes max 5 jobs per minute
- Image Quality: Uses 300 DPI for optimal OCR accuracy
- Cleanup: Temporary image files are automatically deleted
Benchmarks
- Small PDF (10 pages): ~30-60 seconds
- Medium PDF (50 pages): ~2-5 minutes
- Large PDF (200 pages): ~10-20 minutes
Troubleshooting
PDF Conversion Fails
# Check if tools are installed
node -e "import('./services/ocr.js').then(m => console.log(m.checkPDFTools()))"
# Install missing tools
sudo apt-get install poppler-utils imagemagick
Tesseract Language Data Missing
# Install language data
sudo apt-get install tesseract-ocr-eng tesseract-ocr-fra
# For multiple languages
sudo apt-get install tesseract-ocr-all
Redis Connection Errors
# Check Redis status
redis-cli ping
# Set Redis host/port
export REDIS_HOST=localhost
export REDIS_PORT=6379
Meilisearch Indexing Fails
# Check Meilisearch is running
curl http://localhost:7700/health
# Set environment variables
export MEILISEARCH_HOST=http://localhost:7700
export MEILISEARCH_MASTER_KEY=masterKey
Development
Running Tests
# Test OCR service
node -e "
import('./services/ocr.js').then(async (ocr) => {
const results = await ocr.extractTextFromPDF('/path/to/test.pdf');
console.log(results);
});
"
# Test search service
node -e "
import('./services/search.js').then(async (search) => {
const results = await search.searchPages('test query');
console.log(results);
});
"
Monitoring Worker
# View worker logs
tail -f logs/ocr-worker.log
# Monitor with PM2
pm2 logs ocr-worker
# View queue status
redis-cli
> KEYS bull:ocr-jobs:*
> LLEN bull:ocr-jobs:wait
Production Deployment
Using PM2
# Start worker with PM2
pm2 start workers/ocr-worker.js --name ocr-worker --instances 2
# Save PM2 config
pm2 save
# Auto-start on boot
pm2 startup
Using Docker
FROM node:20-alpine
# Install system dependencies
RUN apk add --no-cache \
poppler-utils \
imagemagick \
tesseract-ocr \
tesseract-ocr-data-eng
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
CMD ["node", "workers/ocr-worker.js"]
Environment Variables
# Redis
REDIS_HOST=localhost
REDIS_PORT=6379
# Meilisearch
MEILISEARCH_HOST=http://localhost:7700
MEILISEARCH_MASTER_KEY=masterKey
MEILISEARCH_INDEX_NAME=navidocs-pages
# Database
DATABASE_PATH=/data/navidocs.db
# Worker
OCR_CONCURRENCY=2
License
MIT