## Backend (server/) - Express 5 API with security middleware (helmet, rate limiting) - SQLite database with WAL mode (schema from docs/architecture/) - Meilisearch integration with tenant tokens - BullMQ + Redis background job queue - OCR pipeline with Tesseract.js - File safety validation (extension, MIME, size) - 4 API route modules: upload, jobs, search, documents ## Frontend (client/) - Vue 3 with Composition API (<script setup>) - Vite 5 build system with HMR - Tailwind CSS (Meilisearch-inspired design) - UploadModal with drag-and-drop - FigureZoom component (ported from lilian1) - Meilisearch search integration with tenant tokens - Job polling composable - Clean SVG icons (no emojis) ## Code Extraction - ✅ manuals.js → UploadModal.vue, useJobPolling.js - ✅ figure-zoom.js → FigureZoom.vue - ✅ service-worker.js → client/public/service-worker.js (TODO) - ✅ glossary.json → Merged into Meilisearch synonyms - ❌ Discarded: quiz.js, persona.js, gamification.js (Frank-AI junk) ## Documentation - Complete extraction plan in docs/analysis/ - README with quick start guide - Architecture summary in docs/architecture/ ## Build Status - Server dependencies: ✅ Installed (234 packages) - Client dependencies: ✅ Installed (160 packages) - Client build: ✅ Successful (2.63s) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
356 lines
7.7 KiB
Markdown
356 lines
7.7 KiB
Markdown
# NaviDocs Services
|
|
|
|
This directory contains core business logic services for NaviDocs.
|
|
|
|
## Services
|
|
|
|
### OCR Service (`ocr.js`)
|
|
|
|
Handles text extraction from PDF documents using Tesseract.js OCR.
|
|
|
|
**Key Functions:**
|
|
|
|
```javascript
|
|
import { extractTextFromPDF, extractTextFromImage, checkPDFTools } from './ocr.js';
|
|
|
|
// Extract text from PDF (all pages)
|
|
const results = await extractTextFromPDF('/path/to/document.pdf', {
|
|
language: 'eng',
|
|
onProgress: (pageNum, total) => {
|
|
console.log(`Processing page ${pageNum}/${total}`);
|
|
}
|
|
});
|
|
|
|
// Result format:
|
|
// [
|
|
// { pageNumber: 1, text: "Page content...", confidence: 0.94 },
|
|
// { pageNumber: 2, text: "More content...", confidence: 0.89 },
|
|
// ...
|
|
// ]
|
|
|
|
// Extract from single image
|
|
const result = await extractTextFromImage('/path/to/image.png', 'eng');
|
|
|
|
// Check available PDF tools
|
|
const tools = checkPDFTools();
|
|
// { pdftoppm: true, imagemagick: true }
|
|
```
|
|
|
|
**Requirements:**
|
|
- Tesseract.js (installed via npm)
|
|
- PDF conversion tool: `poppler-utils` (pdftoppm) or `imagemagick`
|
|
|
|
**Features:**
|
|
- Converts PDF pages to high-quality images (300 DPI)
|
|
- Runs Tesseract OCR on each page
|
|
- Returns confidence scores for quality assessment
|
|
- Graceful error handling per page
|
|
- Progress callbacks for long documents
|
|
|
|
---
|
|
|
|
### Search Service (`search.js`)
|
|
|
|
Manages document indexing and search using Meilisearch.
|
|
|
|
**Key Functions:**
|
|
|
|
```javascript
|
|
import {
|
|
indexDocumentPage,
|
|
bulkIndexPages,
|
|
removePageFromIndex,
|
|
searchPages
|
|
} from './search.js';
|
|
|
|
// Index a single page
|
|
await indexDocumentPage({
|
|
pageId: 'page_doc123_1',
|
|
documentId: 'doc123',
|
|
pageNumber: 1,
|
|
text: 'Extracted OCR text...',
|
|
confidence: 0.94
|
|
});
|
|
|
|
// Bulk index multiple pages
|
|
await bulkIndexPages([
|
|
{ pageId: '...', documentId: '...', pageNumber: 1, text: '...', confidence: 0.94 },
|
|
{ pageId: '...', documentId: '...', pageNumber: 2, text: '...', confidence: 0.91 }
|
|
]);
|
|
|
|
// Search with filters
|
|
const results = await searchPages('bilge pump maintenance', {
|
|
filter: `userId = "user123" AND vertical = "boating"`,
|
|
limit: 20,
|
|
offset: 0
|
|
});
|
|
|
|
// Remove page from index
|
|
await removePageFromIndex('doc123', 5);
|
|
```
|
|
|
|
**Features:**
|
|
- Full metadata enrichment from database
|
|
- Multi-vertical support (boat, marina, property)
|
|
- Automatic entity/component linking
|
|
- Tenant isolation via filters
|
|
- Real-time indexing
|
|
|
|
**Document Structure:**
|
|
|
|
See `docs/architecture/meilisearch-config.json` for complete schema.
|
|
|
|
Key fields:
|
|
- `id`: Unique page identifier (`page_{docId}_p{pageNum}`)
|
|
- `vertical`: boating | marina | property
|
|
- `organizationId`, `entityId`, `userId`: Access control
|
|
- `text`: Full OCR text content
|
|
- `systems`, `categories`, `tags`: Metadata arrays
|
|
- Boat-specific: `boatMake`, `boatModel`, `boatYear`, `vesselType`
|
|
- OCR metadata: `ocrConfidence`, `language`
|
|
|
|
---
|
|
|
|
## Usage Examples
|
|
|
|
### Complete Document Upload Flow
|
|
|
|
```javascript
|
|
import { v4 as uuidv4 } from 'uuid';
|
|
import { Queue } from 'bullmq';
|
|
|
|
// 1. Upload file and create document record
|
|
const documentId = uuidv4();
|
|
const filePath = '/uploads/boat-manual.pdf';
|
|
|
|
db.prepare(`
|
|
INSERT INTO documents (
|
|
id, organization_id, entity_id, uploaded_by,
|
|
title, document_type, file_path, file_name,
|
|
file_size, file_hash, page_count, status, created_at, updated_at
|
|
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 'processing', ?, ?)
|
|
`).run(
|
|
documentId,
|
|
orgId,
|
|
boatId,
|
|
userId,
|
|
'Prestige F4.9 Owner Manual',
|
|
'owner-manual',
|
|
filePath,
|
|
'boat-manual.pdf',
|
|
fileSize,
|
|
fileHash,
|
|
pageCount,
|
|
Date.now() / 1000,
|
|
Date.now() / 1000
|
|
);
|
|
|
|
// 2. Create OCR job
|
|
const jobId = uuidv4();
|
|
db.prepare(`
|
|
INSERT INTO ocr_jobs (id, document_id, status, created_at)
|
|
VALUES (?, ?, 'pending', ?)
|
|
`).run(jobId, documentId, Date.now() / 1000);
|
|
|
|
// 3. Queue background processing
|
|
const ocrQueue = new Queue('ocr-jobs', {
|
|
connection: { host: 'localhost', port: 6379 }
|
|
});
|
|
|
|
await ocrQueue.add('process-document', {
|
|
documentId: documentId,
|
|
jobId: jobId,
|
|
filePath: filePath
|
|
});
|
|
|
|
console.log(`Document ${documentId} queued for OCR processing`);
|
|
```
|
|
|
|
### Search Integration
|
|
|
|
```javascript
|
|
// User searches for maintenance procedures
|
|
const query = 'blackwater pump maintenance';
|
|
|
|
const results = await searchPages(query, {
|
|
// Only show user's documents
|
|
filter: `userId = "${userId}"`,
|
|
limit: 10
|
|
});
|
|
|
|
// Results include:
|
|
results.hits.forEach(hit => {
|
|
console.log(`
|
|
Document: ${hit.title}
|
|
Page: ${hit.pageNumber}
|
|
Boat: ${hit.boatName} (${hit.boatMake} ${hit.boatModel})
|
|
Confidence: ${(hit.ocrConfidence * 100).toFixed(0)}%
|
|
Snippet: ${hit._formatted.text.substring(0, 200)}...
|
|
`);
|
|
});
|
|
```
|
|
|
|
### Monitoring OCR Progress
|
|
|
|
```javascript
|
|
// Poll job status
|
|
const jobStatus = db.prepare(`
|
|
SELECT status, progress, error FROM ocr_jobs WHERE id = ?
|
|
`).get(jobId);
|
|
|
|
console.log(`Status: ${jobStatus.status}`);
|
|
console.log(`Progress: ${jobStatus.progress}%`);
|
|
|
|
if (jobStatus.status === 'failed') {
|
|
console.error(`Error: ${jobStatus.error}`);
|
|
}
|
|
|
|
// Or use BullMQ events
|
|
const job = await ocrQueue.getJob(jobId);
|
|
job.on('progress', (progress) => {
|
|
console.log(`Processing: ${progress}%`);
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
## Error Handling
|
|
|
|
All services use consistent error handling:
|
|
|
|
```javascript
|
|
try {
|
|
await indexDocumentPage(pageData);
|
|
} catch (error) {
|
|
if (error.message.includes('Document not found')) {
|
|
// Handle missing document
|
|
} else if (error.message.includes('Meilisearch')) {
|
|
// Handle search service errors
|
|
} else {
|
|
// Generic error handling
|
|
}
|
|
}
|
|
```
|
|
|
|
**Common Errors:**
|
|
|
|
- `OCR extraction failed`: PDF conversion tools missing or file corrupted
|
|
- `Failed to index page`: Meilisearch unavailable or configuration issue
|
|
- `Document not found`: Database record missing
|
|
- `Search failed`: Invalid query or filters
|
|
|
|
---
|
|
|
|
## Performance Considerations
|
|
|
|
### OCR Service
|
|
|
|
- **Speed**: ~3-6 seconds per page (depends on content density)
|
|
- **Quality**: 300 DPI provides optimal OCR accuracy
|
|
- **Memory**: ~50-100 MB per worker process
|
|
- **Temp Files**: Cleaned up automatically after processing
|
|
|
|
**Optimization:**
|
|
```javascript
|
|
// Process multiple documents in parallel (in worker)
|
|
OCR_CONCURRENCY=2 // Process 2 docs at once
|
|
```
|
|
|
|
### Search Service
|
|
|
|
- **Indexing**: ~10-50ms per page
|
|
- **Search**: <50ms for typical queries
|
|
- **Index Size**: ~1-2 KB per page
|
|
|
|
**Best Practices:**
|
|
- Use filters for tenant isolation
|
|
- Limit results with pagination
|
|
- Bulk index when possible
|
|
- Use specific search terms
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
Run the test suite:
|
|
|
|
```bash
|
|
# Test OCR pipeline
|
|
node scripts/test-ocr.js
|
|
|
|
# Test individual service
|
|
node -e "
|
|
import('./services/ocr.js').then(async (ocr) => {
|
|
const tools = ocr.checkPDFTools();
|
|
console.log('Available tools:', tools);
|
|
});
|
|
"
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
Environment variables:
|
|
|
|
```bash
|
|
# Meilisearch
|
|
MEILISEARCH_HOST=http://localhost:7700
|
|
MEILISEARCH_MASTER_KEY=masterKey
|
|
MEILISEARCH_INDEX_NAME=navidocs-pages
|
|
|
|
# Database
|
|
DATABASE_PATH=/data/navidocs.db
|
|
|
|
# Redis (for BullMQ)
|
|
REDIS_HOST=localhost
|
|
REDIS_PORT=6379
|
|
```
|
|
|
|
---
|
|
|
|
## Development
|
|
|
|
### Adding New Search Filters
|
|
|
|
Edit `search.js` and add to `buildSearchDocument()`:
|
|
|
|
```javascript
|
|
// Add custom metadata field
|
|
if (metadata.customField) {
|
|
searchDoc.customField = metadata.customField;
|
|
}
|
|
```
|
|
|
|
Update Meilisearch config in `docs/architecture/meilisearch-config.json`:
|
|
|
|
```json
|
|
{
|
|
"settings": {
|
|
"filterableAttributes": [
|
|
"customField" // Add here
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Supporting New Languages
|
|
|
|
```javascript
|
|
// Install Tesseract language data
|
|
sudo apt-get install tesseract-ocr-fra // French
|
|
sudo apt-get install tesseract-ocr-spa // Spanish
|
|
|
|
// Use in OCR
|
|
const results = await extractTextFromPDF(pdfPath, {
|
|
language: 'fra' // or 'spa', 'deu', etc.
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
## See Also
|
|
|
|
- **Worker Documentation**: `../workers/README.md`
|
|
- **Meilisearch Config**: `../../docs/architecture/meilisearch-config.json`
|
|
- **Database Schema**: `../../docs/architecture/database-schema.sql`
|