- Comprehensive image extraction architecture design - Database schema for document_images table - Migration 004: Add document_images table with indexes - Migration runner script - Design and status documentation Prepares foundation for image extraction feature with OCR on images. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8.9 KiB
8.9 KiB
Image Extraction & OCR Design
Purpose: Extract images from PDFs, run OCR on them, and anchor to surrounding text
Last Updated: 2025-10-19
Requirements
- Extract all images from PDF documents
- Run OCR on extracted images (images contain text)
- Anchor images to nearby document text
- Store image positions and relationships
- Display images in document viewer with text
Architecture
1. Image Extraction Pipeline
PDF Upload
↓
OCR Worker Processes PDF
↓
├─ Extract Page Text (existing)
├─ Extract Page Images (NEW)
│ ↓
│ ├─ Save images to: /uploads/{docId}/images/
│ ├─ Run Tesseract OCR on each image
│ └─ Store image metadata + text
└─ Build Image-Text Relationships
↓
Store in Database + Index in Meilisearch
2. Database Schema
-- New table for extracted images
CREATE TABLE document_images (
id TEXT PRIMARY KEY,
documentId TEXT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
pageNumber INTEGER NOT NULL,
imageIndex INTEGER NOT NULL, -- 0, 1, 2 for multiple images per page
imagePath TEXT NOT NULL, -- /uploads/{docId}/images/page-{N}-img-{M}.png
imageFormat TEXT DEFAULT 'png',
width INTEGER,
height INTEGER,
position JSON, -- {x, y, width, height} on page
extractedText TEXT, -- OCR text from the image
textConfidence REAL, -- Average OCR confidence
anchorTextBefore TEXT, -- Text snippet before image
anchorTextAfter TEXT, -- Text snippet after image
createdAt INTEGER NOT NULL,
UNIQUE(documentId, pageNumber, imageIndex)
);
CREATE INDEX idx_document_images_doc ON document_images(documentId);
CREATE INDEX idx_document_images_page ON document_images(documentId, pageNumber);
3. File Storage Structure
/uploads/
{documentId}/
document.pdf -- Original PDF
images/
page-1-img-0.png -- First image on page 1
page-1-img-1.png -- Second image on page 1
page-2-img-0.png -- First image on page 2
...
Implementation Plan
Phase 1: Backend Image Extraction
File: server/workers/image-extractor.js
import { fromPath } from 'pdf2pic';
import Jimp from 'jimp';
import Tesseract from 'tesseract.js';
async function extractImagesFromPDF(pdfPath, documentId) {
// 1. Convert PDF pages to images
// 2. For each page, detect image regions
// 3. Crop out images
// 4. Run OCR on each image
// 5. Save images + metadata
// 6. Return array of image objects
}
Dependencies needed:
pdf2pic- Convert PDF to imagesjimp- Image manipulationpdfjs-dist- More precise PDF parsing (optional)
Phase 2: OCR Worker Integration
File: server/workers/ocr-worker.js
Add after page text extraction:
// Extract images from this page
const pageImages = await extractImagesFromPage(pdfPath, pageNum, documentId);
// For each image
for (const img of pageImages) {
// Run Tesseract OCR
const ocrResult = await Tesseract.recognize(img.path);
// Store in database
await db.run(`
INSERT INTO document_images (
id, documentId, pageNumber, imageIndex,
imagePath, extractedText, textConfidence,
position, width, height
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
`, [
img.id, documentId, pageNum, img.index,
img.path, ocrResult.data.text, ocrResult.data.confidence,
JSON.stringify(img.position), img.width, img.height
]);
// Index image text in Meilisearch
await meilisearch.addDocuments([{
id: img.id,
type: 'image',
documentId,
pageNumber: pageNum,
content: ocrResult.data.text,
...
}]);
}
Phase 3: Image-Text Anchoring
Algorithm:
function findAnchorText(pageText, imagePosition, pageNumber) {
// Parse page text with positions
const textBlocks = parseTextWithPositions(pageText);
// Find text blocks near image
const beforeImage = textBlocks.filter(b =>
b.position.y < imagePosition.y &&
b.position.y > imagePosition.y - 100
);
const afterImage = textBlocks.filter(b =>
b.position.y > imagePosition.y + imagePosition.height &&
b.position.y < imagePosition.y + imagePosition.height + 100
);
return {
before: beforeImage.map(b => b.text).join(' '),
after: afterImage.map(b => b.text).join(' ')
};
}
Phase 4: API Endpoints
New routes:
// Get all images for a document
GET /api/documents/:id/images
Response: [
{
id, pageNumber, imageIndex, imagePath,
extractedText, position,
anchorTextBefore, anchorTextAfter
}
]
// Get specific image file
GET /api/images/:imageId
Response: PNG file (stream)
// Get images for a specific page
GET /api/documents/:id/pages/:pageNum/images
Response: Array of images on that page
Phase 5: Frontend Integration
Document Viewer Updates:
- Fetch images for current page
- Display images at correct positions
- Show extracted text on hover
- Link to anchor text
<template>
<div class="pdf-page">
<canvas ref="pdfCanvas"></canvas>
<!-- Overlay images on canvas -->
<div v-for="img in pageImages" :key="img.id"
class="page-image-overlay"
:style="{
left: img.position.x + 'px',
top: img.position.y + 'px',
width: img.position.width + 'px',
height: img.position.height + 'px'
}">
<img :src="`/api/images/${img.id}`"
:alt="img.extractedText"
@click="showImageDetail(img)" />
</div>
</div>
</template>
Technical Challenges
Challenge 1: Image Detection in PDF
Problem: PDFs can embed images in various ways Solutions:
- Use
pdfjs-distto parse PDF structure and find image objects - Alternative: Use
pdf2image+ image detection algorithms - Fallback: User manual image selection
Challenge 2: Image Quality for OCR
Problem: Extracted images may be low quality Solutions:
- Use high DPI when converting PDF to images (300+ DPI)
- Apply image enhancement before OCR (contrast, sharpening)
- Use Tesseract preprocessing options
Challenge 3: Positioning Accuracy
Problem: Mapping PDF coordinates to canvas coordinates Solutions:
- Store positions as percentages, not absolute pixels
- Scale positions based on viewport
- Test with various PDF sizes
Performance Considerations
Storage
- Images: ~500KB per image (PNG)
- 100-page doc with 5 images/page: ~250MB
- Solution: Store as JPEG with quality 85%, or use WebP
Processing Time
- Image extraction: ~1s per page
- OCR per image: ~2-3s per image
- 100-page doc with 5 images/page: ~15-20 minutes
- Solution: Process in background, show progress
Meilisearch Indexing
- Index image text separately
- Tag with
type: 'image'for filtering - Include
documentId,pageNumberfor joining
Testing Strategy
Test Cases
-
Single image per page
- Extract: ✓
- OCR: ✓
- Display: ✓
-
Multiple images per page
- Extract all: ✓
- Correct order: ✓
- No duplicates: ✓
-
Images with text
- OCR accuracy > 80%: ✓
- Text searchable: ✓
-
Large PDFs (100+ pages)
- Processing completes: ✓
- Progress tracking: ✓
- No memory leaks: ✓
-
Edge cases
- No images: Handle gracefully
- Corrupted images: Skip and log
- Very large images: Resize before OCR
Migration Path
Step 1: Add Database Table
sqlite3 data/navidocs.db < migrations/add_document_images.sql
Step 2: Install Dependencies
npm install pdf2pic jimp tesseract.js
Step 3: Deploy OCR Worker Update
pm2 restart ocr-worker
Step 4: Process Existing Documents
node scripts/reprocess-documents-with-images.js
Future Enhancements
-
Image Classification
- Diagrams vs photos vs charts
- Use ML model for categorization
-
Smart Cropping
- Detect diagram boundaries automatically
- Remove whitespace
-
Image Search
- Search by image content (visual similarity)
- Search text within images
-
Annotations
- Allow users to annotate images
- Link annotations to text
API Examples
Get Document Images
curl http://localhost:8001/api/documents/{id}/images
Response:
{
"images": [
{
"id": "img-uuid-1",
"documentId": "doc-123",
"pageNumber": 1,
"imageIndex": 0,
"imagePath": "/uploads/doc-123/images/page-1-img-0.png",
"extractedText": "Figure 1: System Architecture Diagram showing...",
"textConfidence": 0.89,
"position": { "x": 100, "y": 200, "width": 400, "height": 300 },
"anchorTextBefore": "The following diagram illustrates",
"anchorTextAfter": "As shown in the figure above"
}
]
}
Status: Design Complete - Ready for Implementation Estimated Dev Time: 2-3 days Priority: HIGH (Essential feature)