History

ggq-admin fb88b291de feat: Add interactive Table of Contents navigation with i18n support Implements complete TOC feature for document navigation with bilingual support. ## TOC Detection & Extraction - Pattern-based TOC detection with 3 regex patterns - Heuristic validation (30%+ match ratio, 5+ entries, sequential pages) - Hierarchical section key parsing (e.g., "4.1.2" → level 3, parent "4.1") - Database schema with parent-child relationships - Automatic extraction during OCR post-processing - Server-side LRU caching (200 entries, 30min TTL) ## UI Components - TocSidebar: Collapsible sidebar (320px) with auto-open on TOC presence - TocEntry: Recursive component for hierarchical rendering - Flex layout: Sidebar + PDF viewer side-by-side - Active page highlighting with real-time sync - localStorage persistence for sidebar state ## Navigation Features - Click TOC entry → PDF jumps to page - Deep link support: URL hash format #p=12 - Page change events: navidocs:pagechange custom event - URL hash updates on all navigation (next/prev/goTo/TOC) - Hash change listener for external navigation - Page clamping and validation ## Search Integration - "Jump to section" button in search results - Shows when result has section field - Navigates to document with page number and hash ## Accessibility - ARIA attributes: role, aria-label, aria-expanded, aria-current - Keyboard navigation: Enter/Space on entries, Tab focus - Screen reader support with aria-live regions - Semantic HTML with proper list/listitem roles ## Internationalization (i18n) - Vue I18n integration with vue-i18n package - English and French translations - 8 TOC-specific translation keys - Language switcher component in document viewer - Locale persistence in localStorage ## Error Handling - Specific error messages for each failure case - Validation before processing (doc exists, has pages, has OCR) - Non-blocking TOC extraction (doesn't fail OCR jobs) - Detailed error returns: {success, error, entriesCount, pages} ## API Endpoints - GET /api/documents/:id/toc?format=flat\|tree - POST /api/documents/:id/toc/extract - Cache invalidation on re-extraction ## Testing - Smoke test script: 9 comprehensive tests - E2E testing guide with 5 manual scenarios - Tests cover: API, caching, validation, navigation, search ## Database - Migration 002: document_toc table - Fields: id, document_id, title, section_key, page_start, level, parent_id, order_index - Foreign keys with CASCADE delete ## Files Changed - New: TocSidebar.vue, TocEntry.vue, LanguageSwitcher.vue - New: toc-extractor.js, toc.js routes, i18n setup - Modified: DocumentView.vue (sidebar, deep links, events) - Modified: SearchView.vue (Jump to section button) - Modified: ocr-worker.js (TOC post-processing) - New: toc-smoke-test.sh, TOC_E2E_TEST.md Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>		2025-10-20 13:22:45 +02:00
..
image-extractor.js	Implement PDF image extraction with OCR in OCR worker	2025-10-19 19:54:25 +02:00
ocr-worker.js	feat: Add interactive Table of Contents navigation with i18n support	2025-10-20 13:22:45 +02:00
README.md	feat: NaviDocs MVP - Complete codebase extraction from lilian1	2025-10-19 01:55:44 +02:00

README.md

NaviDocs OCR Pipeline

Overview

The OCR pipeline processes PDF documents in the background, extracting text from each page and indexing it in Meilisearch for fast, searchable access.

Architecture

Upload PDF → Create OCR Job → BullMQ Queue → OCR Worker → Database + Meilisearch

Components

OCR Service (services/ocr.js)
- Converts PDF pages to images using external tools (pdftoppm or ImageMagick)
- Runs Tesseract.js OCR on each image
- Returns structured data with text and confidence scores
Search Service (services/search.js)
- Indexes document pages in Meilisearch
- Builds proper document structure with metadata
- Supports multi-vertical indexing (boat, marina, property)
OCR Worker (workers/ocr-worker.js)
- BullMQ background worker processing jobs from 'ocr-jobs' queue
- Updates job progress in real-time (0-100%)
- Saves OCR results to document_pages table
- Indexes pages in Meilisearch with full metadata
- Updates document status to 'indexed' when complete

Setup

1. Install System Dependencies

The OCR pipeline requires PDF to image conversion tools:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y poppler-utils imagemagick tesseract-ocr

# macOS
brew install poppler imagemagick tesseract

# Verify installation
which pdftoppm
which convert
which tesseract

2. Install Node Dependencies

cd server
npm install

3. Start Redis

BullMQ requires Redis for job queue management:

# Using Docker
docker run -d -p 6379:6379 redis:alpine

# Or install locally
sudo apt-get install redis-server
redis-server

4. Start Meilisearch

# Using Docker
docker run -d -p 7700:7700 \
  -e MEILI_MASTER_KEY=masterKey \
  -v $(pwd)/data.ms:/data.ms \
  getmeili/meilisearch:latest

# Or download binary
curl -L https://install.meilisearch.com | sh
./meilisearch --master-key=masterKey

5. Start the OCR Worker

# Run worker directly
node workers/ocr-worker.js

# Or use process manager
pm2 start workers/ocr-worker.js --name ocr-worker

Usage

Creating an OCR Job

import { Queue } from 'bullmq';
import { v4 as uuidv4 } from 'uuid';

const ocrQueue = new Queue('ocr-jobs', {
  connection: { host: '127.0.0.1', port: 6379 }
});

// Create job in database
const jobId = uuidv4();
db.prepare(`
  INSERT INTO ocr_jobs (id, document_id, status, created_at)
  VALUES (?, ?, 'pending', ?)
`).run(jobId, documentId, Date.now() / 1000);

// Add job to queue
await ocrQueue.add('process-document', {
  documentId: documentId,
  jobId: jobId,
  filePath: '/path/to/document.pdf'
});

Monitoring Job Progress

// Get job from queue
const job = await ocrQueue.getJob(jobId);

// Check progress
const progress = await job.progress(); // 0-100

// Check database for status
const jobStatus = db.prepare(`
  SELECT status, progress, error FROM ocr_jobs WHERE id = ?
`).get(jobId);

Searching Indexed Pages

import { searchPages } from './services/search.js';

// Search all pages
const results = await searchPages('bilge pump maintenance', {
  limit: 20,
  offset: 0
});

// Search with filters (user-specific)
const results = await searchPages('electrical system', {
  filter: `userId = "${userId}" AND vertical = "boating"`,
  limit: 10
});

// Search with organization access
const results = await searchPages('generator', {
  filter: `organizationId IN ["org1", "org2"]`,
  sort: ['pageNumber:asc']
});

Database Schema

ocr_jobs Table

CREATE TABLE ocr_jobs (
  id TEXT PRIMARY KEY,
  document_id TEXT NOT NULL,
  status TEXT DEFAULT 'pending',      -- pending, processing, completed, failed
  progress INTEGER DEFAULT 0,         -- 0-100
  error TEXT,
  started_at INTEGER,
  completed_at INTEGER,
  created_at INTEGER NOT NULL,
  FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);

document_pages Table

CREATE TABLE document_pages (
  id TEXT PRIMARY KEY,
  document_id TEXT NOT NULL,
  page_number INTEGER NOT NULL,

  -- OCR data
  ocr_text TEXT,
  ocr_confidence REAL,
  ocr_language TEXT DEFAULT 'en',
  ocr_completed_at INTEGER,

  -- Search indexing
  search_indexed_at INTEGER,
  meilisearch_id TEXT,

  metadata TEXT,                      -- JSON
  created_at INTEGER NOT NULL,

  UNIQUE(document_id, page_number),
  FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);

Meilisearch Document Structure

Each indexed page follows this structure:

{
  "id": "page_doc_abc123_p7",
  "vertical": "boating",

  "organizationId": "org_xyz789",
  "organizationName": "Smith Family Boats",

  "entityId": "boat_prestige_f49_001",
  "entityName": "Sea Breeze",
  "entityType": "boat",

  "docId": "doc_abc123",
  "userId": "user_456",

  "documentType": "component-manual",
  "title": "8.7 Blackwater System - Maintenance",
  "pageNumber": 7,
  "text": "The blackwater pump is located...",

  "systems": ["plumbing", "waste-management"],
  "categories": ["maintenance", "troubleshooting"],
  "tags": ["bilge", "pump", "blackwater"],

  "boatName": "Sea Breeze",
  "boatMake": "Prestige",
  "boatModel": "F4.9",
  "boatYear": 2024,

  "language": "en",
  "ocrConfidence": 0.94,

  "createdAt": 1740234567,
  "updatedAt": 1740234567
}

Error Handling

The OCR pipeline handles errors gracefully:

PDF Conversion Errors: Falls back to alternative tools or returns blank page
OCR Errors: Stores page with empty text and confidence = 0
Indexing Errors: Logs error but continues processing other pages
Worker Errors: Updates job status to 'failed' and stores error message

Performance

Optimization Tips

Concurrency: Adjust OCR_CONCURRENCY environment variable (default: 2)
Rate Limiting: Worker processes max 5 jobs per minute
Image Quality: Uses 300 DPI for optimal OCR accuracy
Cleanup: Temporary image files are automatically deleted

Benchmarks

Small PDF (10 pages): ~30-60 seconds
Medium PDF (50 pages): ~2-5 minutes
Large PDF (200 pages): ~10-20 minutes

Troubleshooting

PDF Conversion Fails

# Check if tools are installed
node -e "import('./services/ocr.js').then(m => console.log(m.checkPDFTools()))"

# Install missing tools
sudo apt-get install poppler-utils imagemagick

Tesseract Language Data Missing

# Install language data
sudo apt-get install tesseract-ocr-eng tesseract-ocr-fra

# For multiple languages
sudo apt-get install tesseract-ocr-all

Redis Connection Errors

# Check Redis status
redis-cli ping

# Set Redis host/port
export REDIS_HOST=localhost
export REDIS_PORT=6379

Meilisearch Indexing Fails

# Check Meilisearch is running
curl http://localhost:7700/health

# Set environment variables
export MEILISEARCH_HOST=http://localhost:7700
export MEILISEARCH_MASTER_KEY=masterKey

Development

Running Tests

# Test OCR service
node -e "
  import('./services/ocr.js').then(async (ocr) => {
    const results = await ocr.extractTextFromPDF('/path/to/test.pdf');
    console.log(results);
  });
"

# Test search service
node -e "
  import('./services/search.js').then(async (search) => {
    const results = await search.searchPages('test query');
    console.log(results);
  });
"

Monitoring Worker

# View worker logs
tail -f logs/ocr-worker.log

# Monitor with PM2
pm2 logs ocr-worker

# View queue status
redis-cli
> KEYS bull:ocr-jobs:*
> LLEN bull:ocr-jobs:wait

Production Deployment

Using PM2

# Start worker with PM2
pm2 start workers/ocr-worker.js --name ocr-worker --instances 2

# Save PM2 config
pm2 save

# Auto-start on boot
pm2 startup

Using Docker

FROM node:20-alpine

# Install system dependencies
RUN apk add --no-cache \
    poppler-utils \
    imagemagick \
    tesseract-ocr \
    tesseract-ocr-data-eng

WORKDIR /app
COPY package*.json ./
RUN npm ci --production

COPY . .

CMD ["node", "workers/ocr-worker.js"]

Environment Variables

# Redis
REDIS_HOST=localhost
REDIS_PORT=6379

# Meilisearch
MEILISEARCH_HOST=http://localhost:7700
MEILISEARCH_MASTER_KEY=masterKey
MEILISEARCH_INDEX_NAME=navidocs-pages

# Database
DATABASE_PATH=/data/navidocs.db

# Worker
OCR_CONCURRENCY=2

License

MIT