History

ggq-admin fb88b291de feat: Add interactive Table of Contents navigation with i18n support Implements complete TOC feature for document navigation with bilingual support. ## TOC Detection & Extraction - Pattern-based TOC detection with 3 regex patterns - Heuristic validation (30%+ match ratio, 5+ entries, sequential pages) - Hierarchical section key parsing (e.g., "4.1.2" → level 3, parent "4.1") - Database schema with parent-child relationships - Automatic extraction during OCR post-processing - Server-side LRU caching (200 entries, 30min TTL) ## UI Components - TocSidebar: Collapsible sidebar (320px) with auto-open on TOC presence - TocEntry: Recursive component for hierarchical rendering - Flex layout: Sidebar + PDF viewer side-by-side - Active page highlighting with real-time sync - localStorage persistence for sidebar state ## Navigation Features - Click TOC entry → PDF jumps to page - Deep link support: URL hash format #p=12 - Page change events: navidocs:pagechange custom event - URL hash updates on all navigation (next/prev/goTo/TOC) - Hash change listener for external navigation - Page clamping and validation ## Search Integration - "Jump to section" button in search results - Shows when result has section field - Navigates to document with page number and hash ## Accessibility - ARIA attributes: role, aria-label, aria-expanded, aria-current - Keyboard navigation: Enter/Space on entries, Tab focus - Screen reader support with aria-live regions - Semantic HTML with proper list/listitem roles ## Internationalization (i18n) - Vue I18n integration with vue-i18n package - English and French translations - 8 TOC-specific translation keys - Language switcher component in document viewer - Locale persistence in localStorage ## Error Handling - Specific error messages for each failure case - Validation before processing (doc exists, has pages, has OCR) - Non-blocking TOC extraction (doesn't fail OCR jobs) - Detailed error returns: {success, error, entriesCount, pages} ## API Endpoints - GET /api/documents/:id/toc?format=flat\|tree - POST /api/documents/:id/toc/extract - Cache invalidation on re-extraction ## Testing - Smoke test script: 9 comprehensive tests - E2E testing guide with 5 manual scenarios - Tests cover: API, caching, validation, navigation, search ## Database - Migration 002: document_toc table - Fields: id, document_id, title, section_key, page_start, level, parent_id, order_index - Foreign keys with CASCADE delete ## Files Changed - New: TocSidebar.vue, TocEntry.vue, LanguageSwitcher.vue - New: toc-extractor.js, toc.js routes, i18n setup - Modified: DocumentView.vue (sidebar, deep links, events) - Modified: SearchView.vue (Jump to section button) - Modified: ocr-worker.js (TOC post-processing) - New: toc-smoke-test.sh, TOC_E2E_TEST.md Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>		2025-10-20 13:22:45 +02:00
..
file-safety.js	feat: NaviDocs MVP - Complete codebase extraction from lilian1	2025-10-19 01:55:44 +02:00
ocr-google-drive.js	feat: Add Google Drive OCR integration with hybrid fallback system	2025-10-19 09:04:34 +02:00
ocr-google-vision.js	feat: Add Google Cloud Vision API as primary OCR option	2025-10-19 09:08:38 +02:00
ocr-hybrid.js	feat: Add Google Cloud Vision API as primary OCR option	2025-10-19 09:08:38 +02:00
ocr.js	fix: Complete OCR pipeline with language code mapping	2025-10-19 05:09:51 +02:00
queue.js	feat: NaviDocs MVP - Complete codebase extraction from lilian1	2025-10-19 01:55:44 +02:00
README.md	feat: NaviDocs MVP - Complete codebase extraction from lilian1	2025-10-19 01:55:44 +02:00
search.js	feat: NaviDocs MVP - Complete codebase extraction from lilian1	2025-10-19 01:55:44 +02:00
section-extractor.js	feat: Add interactive Table of Contents navigation with i18n support	2025-10-20 13:22:45 +02:00
toc-extractor.js	feat: Add interactive Table of Contents navigation with i18n support	2025-10-20 13:22:45 +02:00

README.md

NaviDocs Services

This directory contains core business logic services for NaviDocs.

Services

OCR Service (`ocr.js`)

Handles text extraction from PDF documents using Tesseract.js OCR.

Key Functions:

import { extractTextFromPDF, extractTextFromImage, checkPDFTools } from './ocr.js';

// Extract text from PDF (all pages)
const results = await extractTextFromPDF('/path/to/document.pdf', {
  language: 'eng',
  onProgress: (pageNum, total) => {
    console.log(`Processing page ${pageNum}/${total}`);
  }
});

// Result format:
// [
//   { pageNumber: 1, text: "Page content...", confidence: 0.94 },
//   { pageNumber: 2, text: "More content...", confidence: 0.89 },
//   ...
// ]

// Extract from single image
const result = await extractTextFromImage('/path/to/image.png', 'eng');

// Check available PDF tools
const tools = checkPDFTools();
// { pdftoppm: true, imagemagick: true }

Requirements:

Tesseract.js (installed via npm)
PDF conversion tool: poppler-utils (pdftoppm) or imagemagick

Features:

Converts PDF pages to high-quality images (300 DPI)
Runs Tesseract OCR on each page
Returns confidence scores for quality assessment
Graceful error handling per page
Progress callbacks for long documents

Search Service (`search.js`)

Manages document indexing and search using Meilisearch.

Key Functions:

import {
  indexDocumentPage,
  bulkIndexPages,
  removePageFromIndex,
  searchPages
} from './search.js';

// Index a single page
await indexDocumentPage({
  pageId: 'page_doc123_1',
  documentId: 'doc123',
  pageNumber: 1,
  text: 'Extracted OCR text...',
  confidence: 0.94
});

// Bulk index multiple pages
await bulkIndexPages([
  { pageId: '...', documentId: '...', pageNumber: 1, text: '...', confidence: 0.94 },
  { pageId: '...', documentId: '...', pageNumber: 2, text: '...', confidence: 0.91 }
]);

// Search with filters
const results = await searchPages('bilge pump maintenance', {
  filter: `userId = "user123" AND vertical = "boating"`,
  limit: 20,
  offset: 0
});

// Remove page from index
await removePageFromIndex('doc123', 5);

Features:

Full metadata enrichment from database
Multi-vertical support (boat, marina, property)
Automatic entity/component linking
Tenant isolation via filters
Real-time indexing

Document Structure:

See docs/architecture/meilisearch-config.json for complete schema.

Key fields:

id: Unique page identifier (page_{docId}_p{pageNum})
vertical: boating | marina | property
organizationId, entityId, userId: Access control
text: Full OCR text content
systems, categories, tags: Metadata arrays
Boat-specific: boatMake, boatModel, boatYear, vesselType
OCR metadata: ocrConfidence, language

Usage Examples

Complete Document Upload Flow

import { v4 as uuidv4 } from 'uuid';
import { Queue } from 'bullmq';

// 1. Upload file and create document record
const documentId = uuidv4();
const filePath = '/uploads/boat-manual.pdf';

db.prepare(`
  INSERT INTO documents (
    id, organization_id, entity_id, uploaded_by,
    title, document_type, file_path, file_name,
    file_size, file_hash, page_count, status, created_at, updated_at
  ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 'processing', ?, ?)
`).run(
  documentId,
  orgId,
  boatId,
  userId,
  'Prestige F4.9 Owner Manual',
  'owner-manual',
  filePath,
  'boat-manual.pdf',
  fileSize,
  fileHash,
  pageCount,
  Date.now() / 1000,
  Date.now() / 1000
);

// 2. Create OCR job
const jobId = uuidv4();
db.prepare(`
  INSERT INTO ocr_jobs (id, document_id, status, created_at)
  VALUES (?, ?, 'pending', ?)
`).run(jobId, documentId, Date.now() / 1000);

// 3. Queue background processing
const ocrQueue = new Queue('ocr-jobs', {
  connection: { host: 'localhost', port: 6379 }
});

await ocrQueue.add('process-document', {
  documentId: documentId,
  jobId: jobId,
  filePath: filePath
});

console.log(`Document ${documentId} queued for OCR processing`);

Search Integration

// User searches for maintenance procedures
const query = 'blackwater pump maintenance';

const results = await searchPages(query, {
  // Only show user's documents
  filter: `userId = "${userId}"`,
  limit: 10
});

// Results include:
results.hits.forEach(hit => {
  console.log(`
    Document: ${hit.title}
    Page: ${hit.pageNumber}
    Boat: ${hit.boatName} (${hit.boatMake} ${hit.boatModel})
    Confidence: ${(hit.ocrConfidence * 100).toFixed(0)}%
    Snippet: ${hit._formatted.text.substring(0, 200)}...
  `);
});

Monitoring OCR Progress

// Poll job status
const jobStatus = db.prepare(`
  SELECT status, progress, error FROM ocr_jobs WHERE id = ?
`).get(jobId);

console.log(`Status: ${jobStatus.status}`);
console.log(`Progress: ${jobStatus.progress}%`);

if (jobStatus.status === 'failed') {
  console.error(`Error: ${jobStatus.error}`);
}

// Or use BullMQ events
const job = await ocrQueue.getJob(jobId);
job.on('progress', (progress) => {
  console.log(`Processing: ${progress}%`);
});

Error Handling

All services use consistent error handling:

try {
  await indexDocumentPage(pageData);
} catch (error) {
  if (error.message.includes('Document not found')) {
    // Handle missing document
  } else if (error.message.includes('Meilisearch')) {
    // Handle search service errors
  } else {
    // Generic error handling
  }
}

Common Errors:

OCR extraction failed: PDF conversion tools missing or file corrupted
Failed to index page: Meilisearch unavailable or configuration issue
Document not found: Database record missing
Search failed: Invalid query or filters

Performance Considerations

OCR Service

Speed: ~3-6 seconds per page (depends on content density)
Quality: 300 DPI provides optimal OCR accuracy
Memory: ~50-100 MB per worker process
Temp Files: Cleaned up automatically after processing

Optimization:

// Process multiple documents in parallel (in worker)
OCR_CONCURRENCY=2  // Process 2 docs at once

Search Service

Indexing: ~10-50ms per page
Search: <50ms for typical queries
Index Size: ~1-2 KB per page

Best Practices:

Use filters for tenant isolation
Limit results with pagination
Bulk index when possible
Use specific search terms

Testing

Run the test suite:

# Test OCR pipeline
node scripts/test-ocr.js

# Test individual service
node -e "
  import('./services/ocr.js').then(async (ocr) => {
    const tools = ocr.checkPDFTools();
    console.log('Available tools:', tools);
  });
"

Configuration

Environment variables:

# Meilisearch
MEILISEARCH_HOST=http://localhost:7700
MEILISEARCH_MASTER_KEY=masterKey
MEILISEARCH_INDEX_NAME=navidocs-pages

# Database
DATABASE_PATH=/data/navidocs.db

# Redis (for BullMQ)
REDIS_HOST=localhost
REDIS_PORT=6379

Development

Adding New Search Filters

Edit search.js and add to buildSearchDocument():

// Add custom metadata field
if (metadata.customField) {
  searchDoc.customField = metadata.customField;
}

Update Meilisearch config in docs/architecture/meilisearch-config.json:

{
  "settings": {
    "filterableAttributes": [
      "customField"  // Add here
    ]
  }
}

Supporting New Languages

// Install Tesseract language data
sudo apt-get install tesseract-ocr-fra  // French
sudo apt-get install tesseract-ocr-spa  // Spanish

// Use in OCR
const results = await extractTextFromPDF(pdfPath, {
  language: 'fra'  // or 'spa', 'deu', etc.
});