Implements complete TOC feature for document navigation with bilingual support.
## TOC Detection & Extraction
- Pattern-based TOC detection with 3 regex patterns
- Heuristic validation (30%+ match ratio, 5+ entries, sequential pages)
- Hierarchical section key parsing (e.g., "4.1.2" → level 3, parent "4.1")
- Database schema with parent-child relationships
- Automatic extraction during OCR post-processing
- Server-side LRU caching (200 entries, 30min TTL)
## UI Components
- TocSidebar: Collapsible sidebar (320px) with auto-open on TOC presence
- TocEntry: Recursive component for hierarchical rendering
- Flex layout: Sidebar + PDF viewer side-by-side
- Active page highlighting with real-time sync
- localStorage persistence for sidebar state
## Navigation Features
- Click TOC entry → PDF jumps to page
- Deep link support: URL hash format #p=12
- Page change events: navidocs:pagechange custom event
- URL hash updates on all navigation (next/prev/goTo/TOC)
- Hash change listener for external navigation
- Page clamping and validation
## Search Integration
- "Jump to section" button in search results
- Shows when result has section field
- Navigates to document with page number and hash
## Accessibility
- ARIA attributes: role, aria-label, aria-expanded, aria-current
- Keyboard navigation: Enter/Space on entries, Tab focus
- Screen reader support with aria-live regions
- Semantic HTML with proper list/listitem roles
## Internationalization (i18n)
- Vue I18n integration with vue-i18n package
- English and French translations
- 8 TOC-specific translation keys
- Language switcher component in document viewer
- Locale persistence in localStorage
## Error Handling
- Specific error messages for each failure case
- Validation before processing (doc exists, has pages, has OCR)
- Non-blocking TOC extraction (doesn't fail OCR jobs)
- Detailed error returns: {success, error, entriesCount, pages}
## API Endpoints
- GET /api/documents/:id/toc?format=flat|tree
- POST /api/documents/:id/toc/extract
- Cache invalidation on re-extraction
## Testing
- Smoke test script: 9 comprehensive tests
- E2E testing guide with 5 manual scenarios
- Tests cover: API, caching, validation, navigation, search
## Database
- Migration 002: document_toc table
- Fields: id, document_id, title, section_key, page_start, level, parent_id, order_index
- Foreign keys with CASCADE delete
## Files Changed
- New: TocSidebar.vue, TocEntry.vue, LanguageSwitcher.vue
- New: toc-extractor.js, toc.js routes, i18n setup
- Modified: DocumentView.vue (sidebar, deep links, events)
- Modified: SearchView.vue (Jump to section button)
- Modified: ocr-worker.js (TOC post-processing)
- New: toc-smoke-test.sh, TOC_E2E_TEST.md
Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| file-safety.js | ||
| ocr-google-drive.js | ||
| ocr-google-vision.js | ||
| ocr-hybrid.js | ||
| ocr.js | ||
| queue.js | ||
| README.md | ||
| search.js | ||
| section-extractor.js | ||
| toc-extractor.js | ||
NaviDocs Services
This directory contains core business logic services for NaviDocs.
Services
OCR Service (ocr.js)
Handles text extraction from PDF documents using Tesseract.js OCR.
Key Functions:
import { extractTextFromPDF, extractTextFromImage, checkPDFTools } from './ocr.js';
// Extract text from PDF (all pages)
const results = await extractTextFromPDF('/path/to/document.pdf', {
language: 'eng',
onProgress: (pageNum, total) => {
console.log(`Processing page ${pageNum}/${total}`);
}
});
// Result format:
// [
// { pageNumber: 1, text: "Page content...", confidence: 0.94 },
// { pageNumber: 2, text: "More content...", confidence: 0.89 },
// ...
// ]
// Extract from single image
const result = await extractTextFromImage('/path/to/image.png', 'eng');
// Check available PDF tools
const tools = checkPDFTools();
// { pdftoppm: true, imagemagick: true }
Requirements:
- Tesseract.js (installed via npm)
- PDF conversion tool:
poppler-utils(pdftoppm) orimagemagick
Features:
- Converts PDF pages to high-quality images (300 DPI)
- Runs Tesseract OCR on each page
- Returns confidence scores for quality assessment
- Graceful error handling per page
- Progress callbacks for long documents
Search Service (search.js)
Manages document indexing and search using Meilisearch.
Key Functions:
import {
indexDocumentPage,
bulkIndexPages,
removePageFromIndex,
searchPages
} from './search.js';
// Index a single page
await indexDocumentPage({
pageId: 'page_doc123_1',
documentId: 'doc123',
pageNumber: 1,
text: 'Extracted OCR text...',
confidence: 0.94
});
// Bulk index multiple pages
await bulkIndexPages([
{ pageId: '...', documentId: '...', pageNumber: 1, text: '...', confidence: 0.94 },
{ pageId: '...', documentId: '...', pageNumber: 2, text: '...', confidence: 0.91 }
]);
// Search with filters
const results = await searchPages('bilge pump maintenance', {
filter: `userId = "user123" AND vertical = "boating"`,
limit: 20,
offset: 0
});
// Remove page from index
await removePageFromIndex('doc123', 5);
Features:
- Full metadata enrichment from database
- Multi-vertical support (boat, marina, property)
- Automatic entity/component linking
- Tenant isolation via filters
- Real-time indexing
Document Structure:
See docs/architecture/meilisearch-config.json for complete schema.
Key fields:
id: Unique page identifier (page_{docId}_p{pageNum})vertical: boating | marina | propertyorganizationId,entityId,userId: Access controltext: Full OCR text contentsystems,categories,tags: Metadata arrays- Boat-specific:
boatMake,boatModel,boatYear,vesselType - OCR metadata:
ocrConfidence,language
Usage Examples
Complete Document Upload Flow
import { v4 as uuidv4 } from 'uuid';
import { Queue } from 'bullmq';
// 1. Upload file and create document record
const documentId = uuidv4();
const filePath = '/uploads/boat-manual.pdf';
db.prepare(`
INSERT INTO documents (
id, organization_id, entity_id, uploaded_by,
title, document_type, file_path, file_name,
file_size, file_hash, page_count, status, created_at, updated_at
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 'processing', ?, ?)
`).run(
documentId,
orgId,
boatId,
userId,
'Prestige F4.9 Owner Manual',
'owner-manual',
filePath,
'boat-manual.pdf',
fileSize,
fileHash,
pageCount,
Date.now() / 1000,
Date.now() / 1000
);
// 2. Create OCR job
const jobId = uuidv4();
db.prepare(`
INSERT INTO ocr_jobs (id, document_id, status, created_at)
VALUES (?, ?, 'pending', ?)
`).run(jobId, documentId, Date.now() / 1000);
// 3. Queue background processing
const ocrQueue = new Queue('ocr-jobs', {
connection: { host: 'localhost', port: 6379 }
});
await ocrQueue.add('process-document', {
documentId: documentId,
jobId: jobId,
filePath: filePath
});
console.log(`Document ${documentId} queued for OCR processing`);
Search Integration
// User searches for maintenance procedures
const query = 'blackwater pump maintenance';
const results = await searchPages(query, {
// Only show user's documents
filter: `userId = "${userId}"`,
limit: 10
});
// Results include:
results.hits.forEach(hit => {
console.log(`
Document: ${hit.title}
Page: ${hit.pageNumber}
Boat: ${hit.boatName} (${hit.boatMake} ${hit.boatModel})
Confidence: ${(hit.ocrConfidence * 100).toFixed(0)}%
Snippet: ${hit._formatted.text.substring(0, 200)}...
`);
});
Monitoring OCR Progress
// Poll job status
const jobStatus = db.prepare(`
SELECT status, progress, error FROM ocr_jobs WHERE id = ?
`).get(jobId);
console.log(`Status: ${jobStatus.status}`);
console.log(`Progress: ${jobStatus.progress}%`);
if (jobStatus.status === 'failed') {
console.error(`Error: ${jobStatus.error}`);
}
// Or use BullMQ events
const job = await ocrQueue.getJob(jobId);
job.on('progress', (progress) => {
console.log(`Processing: ${progress}%`);
});
Error Handling
All services use consistent error handling:
try {
await indexDocumentPage(pageData);
} catch (error) {
if (error.message.includes('Document not found')) {
// Handle missing document
} else if (error.message.includes('Meilisearch')) {
// Handle search service errors
} else {
// Generic error handling
}
}
Common Errors:
OCR extraction failed: PDF conversion tools missing or file corruptedFailed to index page: Meilisearch unavailable or configuration issueDocument not found: Database record missingSearch failed: Invalid query or filters
Performance Considerations
OCR Service
- Speed: ~3-6 seconds per page (depends on content density)
- Quality: 300 DPI provides optimal OCR accuracy
- Memory: ~50-100 MB per worker process
- Temp Files: Cleaned up automatically after processing
Optimization:
// Process multiple documents in parallel (in worker)
OCR_CONCURRENCY=2 // Process 2 docs at once
Search Service
- Indexing: ~10-50ms per page
- Search: <50ms for typical queries
- Index Size: ~1-2 KB per page
Best Practices:
- Use filters for tenant isolation
- Limit results with pagination
- Bulk index when possible
- Use specific search terms
Testing
Run the test suite:
# Test OCR pipeline
node scripts/test-ocr.js
# Test individual service
node -e "
import('./services/ocr.js').then(async (ocr) => {
const tools = ocr.checkPDFTools();
console.log('Available tools:', tools);
});
"
Configuration
Environment variables:
# Meilisearch
MEILISEARCH_HOST=http://localhost:7700
MEILISEARCH_MASTER_KEY=masterKey
MEILISEARCH_INDEX_NAME=navidocs-pages
# Database
DATABASE_PATH=/data/navidocs.db
# Redis (for BullMQ)
REDIS_HOST=localhost
REDIS_PORT=6379
Development
Adding New Search Filters
Edit search.js and add to buildSearchDocument():
// Add custom metadata field
if (metadata.customField) {
searchDoc.customField = metadata.customField;
}
Update Meilisearch config in docs/architecture/meilisearch-config.json:
{
"settings": {
"filterableAttributes": [
"customField" // Add here
]
}
}
Supporting New Languages
// Install Tesseract language data
sudo apt-get install tesseract-ocr-fra // French
sudo apt-get install tesseract-ocr-spa // Spanish
// Use in OCR
const results = await extractTextFromPDF(pdfPath, {
language: 'fra' // or 'spa', 'deu', etc.
});
See Also
- Worker Documentation:
../workers/README.md - Meilisearch Config:
../../docs/architecture/meilisearch-config.json - Database Schema:
../../docs/architecture/database-schema.sql