navidocs/docs/features/IMAGE_EXTRACTION_DESIGN.md
ggq-admin 4b91896838 feat: Add image extraction design, database schema, and migration
- Comprehensive image extraction architecture design
- Database schema for document_images table
- Migration 004: Add document_images table with indexes
- Migration runner script
- Design and status documentation

Prepares foundation for image extraction feature with OCR on images.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 19:47:30 +02:00

382 lines
8.9 KiB
Markdown

# Image Extraction & OCR Design
**Purpose:** Extract images from PDFs, run OCR on them, and anchor to surrounding text
**Last Updated:** 2025-10-19
---
## Requirements
1. **Extract all images from PDF documents**
2. **Run OCR on extracted images** (images contain text)
3. **Anchor images to nearby document text**
4. **Store image positions and relationships**
5. **Display images in document viewer with text**
---
## Architecture
### 1. Image Extraction Pipeline
```
PDF Upload
OCR Worker Processes PDF
├─ Extract Page Text (existing)
├─ Extract Page Images (NEW)
│ ↓
│ ├─ Save images to: /uploads/{docId}/images/
│ ├─ Run Tesseract OCR on each image
│ └─ Store image metadata + text
└─ Build Image-Text Relationships
Store in Database + Index in Meilisearch
```
### 2. Database Schema
```sql
-- New table for extracted images
CREATE TABLE document_images (
id TEXT PRIMARY KEY,
documentId TEXT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
pageNumber INTEGER NOT NULL,
imageIndex INTEGER NOT NULL, -- 0, 1, 2 for multiple images per page
imagePath TEXT NOT NULL, -- /uploads/{docId}/images/page-{N}-img-{M}.png
imageFormat TEXT DEFAULT 'png',
width INTEGER,
height INTEGER,
position JSON, -- {x, y, width, height} on page
extractedText TEXT, -- OCR text from the image
textConfidence REAL, -- Average OCR confidence
anchorTextBefore TEXT, -- Text snippet before image
anchorTextAfter TEXT, -- Text snippet after image
createdAt INTEGER NOT NULL,
UNIQUE(documentId, pageNumber, imageIndex)
);
CREATE INDEX idx_document_images_doc ON document_images(documentId);
CREATE INDEX idx_document_images_page ON document_images(documentId, pageNumber);
```
### 3. File Storage Structure
```
/uploads/
{documentId}/
document.pdf -- Original PDF
images/
page-1-img-0.png -- First image on page 1
page-1-img-1.png -- Second image on page 1
page-2-img-0.png -- First image on page 2
...
```
---
## Implementation Plan
### Phase 1: Backend Image Extraction
**File:** `server/workers/image-extractor.js`
```javascript
import { fromPath } from 'pdf2pic';
import Jimp from 'jimp';
import Tesseract from 'tesseract.js';
async function extractImagesFromPDF(pdfPath, documentId) {
// 1. Convert PDF pages to images
// 2. For each page, detect image regions
// 3. Crop out images
// 4. Run OCR on each image
// 5. Save images + metadata
// 6. Return array of image objects
}
```
**Dependencies needed:**
- `pdf2pic` - Convert PDF to images
- `jimp` - Image manipulation
- `pdfjs-dist` - More precise PDF parsing (optional)
### Phase 2: OCR Worker Integration
**File:** `server/workers/ocr-worker.js`
Add after page text extraction:
```javascript
// Extract images from this page
const pageImages = await extractImagesFromPage(pdfPath, pageNum, documentId);
// For each image
for (const img of pageImages) {
// Run Tesseract OCR
const ocrResult = await Tesseract.recognize(img.path);
// Store in database
await db.run(`
INSERT INTO document_images (
id, documentId, pageNumber, imageIndex,
imagePath, extractedText, textConfidence,
position, width, height
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
`, [
img.id, documentId, pageNum, img.index,
img.path, ocrResult.data.text, ocrResult.data.confidence,
JSON.stringify(img.position), img.width, img.height
]);
// Index image text in Meilisearch
await meilisearch.addDocuments([{
id: img.id,
type: 'image',
documentId,
pageNumber: pageNum,
content: ocrResult.data.text,
...
}]);
}
```
### Phase 3: Image-Text Anchoring
**Algorithm:**
```javascript
function findAnchorText(pageText, imagePosition, pageNumber) {
// Parse page text with positions
const textBlocks = parseTextWithPositions(pageText);
// Find text blocks near image
const beforeImage = textBlocks.filter(b =>
b.position.y < imagePosition.y &&
b.position.y > imagePosition.y - 100
);
const afterImage = textBlocks.filter(b =>
b.position.y > imagePosition.y + imagePosition.height &&
b.position.y < imagePosition.y + imagePosition.height + 100
);
return {
before: beforeImage.map(b => b.text).join(' '),
after: afterImage.map(b => b.text).join(' ')
};
}
```
### Phase 4: API Endpoints
**New routes:**
```javascript
// Get all images for a document
GET /api/documents/:id/images
Response: [
{
id, pageNumber, imageIndex, imagePath,
extractedText, position,
anchorTextBefore, anchorTextAfter
}
]
// Get specific image file
GET /api/images/:imageId
Response: PNG file (stream)
// Get images for a specific page
GET /api/documents/:id/pages/:pageNum/images
Response: Array of images on that page
```
### Phase 5: Frontend Integration
**Document Viewer Updates:**
1. Fetch images for current page
2. Display images at correct positions
3. Show extracted text on hover
4. Link to anchor text
```vue
<template>
<div class="pdf-page">
<canvas ref="pdfCanvas"></canvas>
<!-- Overlay images on canvas -->
<div v-for="img in pageImages" :key="img.id"
class="page-image-overlay"
:style="{
left: img.position.x + 'px',
top: img.position.y + 'px',
width: img.position.width + 'px',
height: img.position.height + 'px'
}">
<img :src="`/api/images/${img.id}`"
:alt="img.extractedText"
@click="showImageDetail(img)" />
</div>
</div>
</template>
```
---
## Technical Challenges
### Challenge 1: Image Detection in PDF
**Problem:** PDFs can embed images in various ways
**Solutions:**
- Use `pdfjs-dist` to parse PDF structure and find image objects
- Alternative: Use `pdf2image` + image detection algorithms
- Fallback: User manual image selection
### Challenge 2: Image Quality for OCR
**Problem:** Extracted images may be low quality
**Solutions:**
- Use high DPI when converting PDF to images (300+ DPI)
- Apply image enhancement before OCR (contrast, sharpening)
- Use Tesseract preprocessing options
### Challenge 3: Positioning Accuracy
**Problem:** Mapping PDF coordinates to canvas coordinates
**Solutions:**
- Store positions as percentages, not absolute pixels
- Scale positions based on viewport
- Test with various PDF sizes
---
## Performance Considerations
### Storage
- **Images:** ~500KB per image (PNG)
- **100-page doc with 5 images/page:** ~250MB
- **Solution:** Store as JPEG with quality 85%, or use WebP
### Processing Time
- **Image extraction:** ~1s per page
- **OCR per image:** ~2-3s per image
- **100-page doc with 5 images/page:** ~15-20 minutes
- **Solution:** Process in background, show progress
### Meilisearch Indexing
- Index image text separately
- Tag with `type: 'image'` for filtering
- Include `documentId`, `pageNumber` for joining
---
## Testing Strategy
### Test Cases
1. **Single image per page**
- Extract: ✓
- OCR: ✓
- Display: ✓
2. **Multiple images per page**
- Extract all: ✓
- Correct order: ✓
- No duplicates: ✓
3. **Images with text**
- OCR accuracy > 80%: ✓
- Text searchable: ✓
4. **Large PDFs (100+ pages)**
- Processing completes: ✓
- Progress tracking: ✓
- No memory leaks: ✓
5. **Edge cases**
- No images: Handle gracefully
- Corrupted images: Skip and log
- Very large images: Resize before OCR
---
## Migration Path
### Step 1: Add Database Table
```bash
sqlite3 data/navidocs.db < migrations/add_document_images.sql
```
### Step 2: Install Dependencies
```bash
npm install pdf2pic jimp tesseract.js
```
### Step 3: Deploy OCR Worker Update
```bash
pm2 restart ocr-worker
```
### Step 4: Process Existing Documents
```bash
node scripts/reprocess-documents-with-images.js
```
---
## Future Enhancements
1. **Image Classification**
- Diagrams vs photos vs charts
- Use ML model for categorization
2. **Smart Cropping**
- Detect diagram boundaries automatically
- Remove whitespace
3. **Image Search**
- Search by image content (visual similarity)
- Search text within images
4. **Annotations**
- Allow users to annotate images
- Link annotations to text
---
## API Examples
### Get Document Images
```bash
curl http://localhost:8001/api/documents/{id}/images
```
Response:
```json
{
"images": [
{
"id": "img-uuid-1",
"documentId": "doc-123",
"pageNumber": 1,
"imageIndex": 0,
"imagePath": "/uploads/doc-123/images/page-1-img-0.png",
"extractedText": "Figure 1: System Architecture Diagram showing...",
"textConfidence": 0.89,
"position": { "x": 100, "y": 200, "width": 400, "height": 300 },
"anchorTextBefore": "The following diagram illustrates",
"anchorTextAfter": "As shown in the figure above"
}
]
}
```
---
**Status:** Design Complete - Ready for Implementation
**Estimated Dev Time:** 2-3 days
**Priority:** HIGH (Essential feature)