- Comprehensive image extraction architecture design - Database schema for document_images table - Migration 004: Add document_images table with indexes - Migration runner script - Design and status documentation Prepares foundation for image extraction feature with OCR on images. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
382 lines
8.9 KiB
Markdown
382 lines
8.9 KiB
Markdown
# Image Extraction & OCR Design
|
|
|
|
**Purpose:** Extract images from PDFs, run OCR on them, and anchor to surrounding text
|
|
|
|
**Last Updated:** 2025-10-19
|
|
|
|
---
|
|
|
|
## Requirements
|
|
|
|
1. **Extract all images from PDF documents**
|
|
2. **Run OCR on extracted images** (images contain text)
|
|
3. **Anchor images to nearby document text**
|
|
4. **Store image positions and relationships**
|
|
5. **Display images in document viewer with text**
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
### 1. Image Extraction Pipeline
|
|
|
|
```
|
|
PDF Upload
|
|
↓
|
|
OCR Worker Processes PDF
|
|
↓
|
|
├─ Extract Page Text (existing)
|
|
├─ Extract Page Images (NEW)
|
|
│ ↓
|
|
│ ├─ Save images to: /uploads/{docId}/images/
|
|
│ ├─ Run Tesseract OCR on each image
|
|
│ └─ Store image metadata + text
|
|
└─ Build Image-Text Relationships
|
|
↓
|
|
Store in Database + Index in Meilisearch
|
|
```
|
|
|
|
### 2. Database Schema
|
|
|
|
```sql
|
|
-- New table for extracted images
|
|
CREATE TABLE document_images (
|
|
id TEXT PRIMARY KEY,
|
|
documentId TEXT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
|
|
pageNumber INTEGER NOT NULL,
|
|
imageIndex INTEGER NOT NULL, -- 0, 1, 2 for multiple images per page
|
|
imagePath TEXT NOT NULL, -- /uploads/{docId}/images/page-{N}-img-{M}.png
|
|
imageFormat TEXT DEFAULT 'png',
|
|
width INTEGER,
|
|
height INTEGER,
|
|
position JSON, -- {x, y, width, height} on page
|
|
extractedText TEXT, -- OCR text from the image
|
|
textConfidence REAL, -- Average OCR confidence
|
|
anchorTextBefore TEXT, -- Text snippet before image
|
|
anchorTextAfter TEXT, -- Text snippet after image
|
|
createdAt INTEGER NOT NULL,
|
|
UNIQUE(documentId, pageNumber, imageIndex)
|
|
);
|
|
|
|
CREATE INDEX idx_document_images_doc ON document_images(documentId);
|
|
CREATE INDEX idx_document_images_page ON document_images(documentId, pageNumber);
|
|
```
|
|
|
|
### 3. File Storage Structure
|
|
|
|
```
|
|
/uploads/
|
|
{documentId}/
|
|
document.pdf -- Original PDF
|
|
images/
|
|
page-1-img-0.png -- First image on page 1
|
|
page-1-img-1.png -- Second image on page 1
|
|
page-2-img-0.png -- First image on page 2
|
|
...
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Backend Image Extraction
|
|
|
|
**File:** `server/workers/image-extractor.js`
|
|
|
|
```javascript
|
|
import { fromPath } from 'pdf2pic';
|
|
import Jimp from 'jimp';
|
|
import Tesseract from 'tesseract.js';
|
|
|
|
async function extractImagesFromPDF(pdfPath, documentId) {
|
|
// 1. Convert PDF pages to images
|
|
// 2. For each page, detect image regions
|
|
// 3. Crop out images
|
|
// 4. Run OCR on each image
|
|
// 5. Save images + metadata
|
|
// 6. Return array of image objects
|
|
}
|
|
```
|
|
|
|
**Dependencies needed:**
|
|
- `pdf2pic` - Convert PDF to images
|
|
- `jimp` - Image manipulation
|
|
- `pdfjs-dist` - More precise PDF parsing (optional)
|
|
|
|
### Phase 2: OCR Worker Integration
|
|
|
|
**File:** `server/workers/ocr-worker.js`
|
|
|
|
Add after page text extraction:
|
|
|
|
```javascript
|
|
// Extract images from this page
|
|
const pageImages = await extractImagesFromPage(pdfPath, pageNum, documentId);
|
|
|
|
// For each image
|
|
for (const img of pageImages) {
|
|
// Run Tesseract OCR
|
|
const ocrResult = await Tesseract.recognize(img.path);
|
|
|
|
// Store in database
|
|
await db.run(`
|
|
INSERT INTO document_images (
|
|
id, documentId, pageNumber, imageIndex,
|
|
imagePath, extractedText, textConfidence,
|
|
position, width, height
|
|
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
|
`, [
|
|
img.id, documentId, pageNum, img.index,
|
|
img.path, ocrResult.data.text, ocrResult.data.confidence,
|
|
JSON.stringify(img.position), img.width, img.height
|
|
]);
|
|
|
|
// Index image text in Meilisearch
|
|
await meilisearch.addDocuments([{
|
|
id: img.id,
|
|
type: 'image',
|
|
documentId,
|
|
pageNumber: pageNum,
|
|
content: ocrResult.data.text,
|
|
...
|
|
}]);
|
|
}
|
|
```
|
|
|
|
### Phase 3: Image-Text Anchoring
|
|
|
|
**Algorithm:**
|
|
```javascript
|
|
function findAnchorText(pageText, imagePosition, pageNumber) {
|
|
// Parse page text with positions
|
|
const textBlocks = parseTextWithPositions(pageText);
|
|
|
|
// Find text blocks near image
|
|
const beforeImage = textBlocks.filter(b =>
|
|
b.position.y < imagePosition.y &&
|
|
b.position.y > imagePosition.y - 100
|
|
);
|
|
|
|
const afterImage = textBlocks.filter(b =>
|
|
b.position.y > imagePosition.y + imagePosition.height &&
|
|
b.position.y < imagePosition.y + imagePosition.height + 100
|
|
);
|
|
|
|
return {
|
|
before: beforeImage.map(b => b.text).join(' '),
|
|
after: afterImage.map(b => b.text).join(' ')
|
|
};
|
|
}
|
|
```
|
|
|
|
### Phase 4: API Endpoints
|
|
|
|
**New routes:**
|
|
|
|
```javascript
|
|
// Get all images for a document
|
|
GET /api/documents/:id/images
|
|
Response: [
|
|
{
|
|
id, pageNumber, imageIndex, imagePath,
|
|
extractedText, position,
|
|
anchorTextBefore, anchorTextAfter
|
|
}
|
|
]
|
|
|
|
// Get specific image file
|
|
GET /api/images/:imageId
|
|
Response: PNG file (stream)
|
|
|
|
// Get images for a specific page
|
|
GET /api/documents/:id/pages/:pageNum/images
|
|
Response: Array of images on that page
|
|
```
|
|
|
|
### Phase 5: Frontend Integration
|
|
|
|
**Document Viewer Updates:**
|
|
|
|
1. Fetch images for current page
|
|
2. Display images at correct positions
|
|
3. Show extracted text on hover
|
|
4. Link to anchor text
|
|
|
|
```vue
|
|
<template>
|
|
<div class="pdf-page">
|
|
<canvas ref="pdfCanvas"></canvas>
|
|
|
|
<!-- Overlay images on canvas -->
|
|
<div v-for="img in pageImages" :key="img.id"
|
|
class="page-image-overlay"
|
|
:style="{
|
|
left: img.position.x + 'px',
|
|
top: img.position.y + 'px',
|
|
width: img.position.width + 'px',
|
|
height: img.position.height + 'px'
|
|
}">
|
|
<img :src="`/api/images/${img.id}`"
|
|
:alt="img.extractedText"
|
|
@click="showImageDetail(img)" />
|
|
</div>
|
|
</div>
|
|
</template>
|
|
```
|
|
|
|
---
|
|
|
|
## Technical Challenges
|
|
|
|
### Challenge 1: Image Detection in PDF
|
|
|
|
**Problem:** PDFs can embed images in various ways
|
|
**Solutions:**
|
|
- Use `pdfjs-dist` to parse PDF structure and find image objects
|
|
- Alternative: Use `pdf2image` + image detection algorithms
|
|
- Fallback: User manual image selection
|
|
|
|
### Challenge 2: Image Quality for OCR
|
|
|
|
**Problem:** Extracted images may be low quality
|
|
**Solutions:**
|
|
- Use high DPI when converting PDF to images (300+ DPI)
|
|
- Apply image enhancement before OCR (contrast, sharpening)
|
|
- Use Tesseract preprocessing options
|
|
|
|
### Challenge 3: Positioning Accuracy
|
|
|
|
**Problem:** Mapping PDF coordinates to canvas coordinates
|
|
**Solutions:**
|
|
- Store positions as percentages, not absolute pixels
|
|
- Scale positions based on viewport
|
|
- Test with various PDF sizes
|
|
|
|
---
|
|
|
|
## Performance Considerations
|
|
|
|
### Storage
|
|
- **Images:** ~500KB per image (PNG)
|
|
- **100-page doc with 5 images/page:** ~250MB
|
|
- **Solution:** Store as JPEG with quality 85%, or use WebP
|
|
|
|
### Processing Time
|
|
- **Image extraction:** ~1s per page
|
|
- **OCR per image:** ~2-3s per image
|
|
- **100-page doc with 5 images/page:** ~15-20 minutes
|
|
- **Solution:** Process in background, show progress
|
|
|
|
### Meilisearch Indexing
|
|
- Index image text separately
|
|
- Tag with `type: 'image'` for filtering
|
|
- Include `documentId`, `pageNumber` for joining
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
### Test Cases
|
|
|
|
1. **Single image per page**
|
|
- Extract: ✓
|
|
- OCR: ✓
|
|
- Display: ✓
|
|
|
|
2. **Multiple images per page**
|
|
- Extract all: ✓
|
|
- Correct order: ✓
|
|
- No duplicates: ✓
|
|
|
|
3. **Images with text**
|
|
- OCR accuracy > 80%: ✓
|
|
- Text searchable: ✓
|
|
|
|
4. **Large PDFs (100+ pages)**
|
|
- Processing completes: ✓
|
|
- Progress tracking: ✓
|
|
- No memory leaks: ✓
|
|
|
|
5. **Edge cases**
|
|
- No images: Handle gracefully
|
|
- Corrupted images: Skip and log
|
|
- Very large images: Resize before OCR
|
|
|
|
---
|
|
|
|
## Migration Path
|
|
|
|
### Step 1: Add Database Table
|
|
```bash
|
|
sqlite3 data/navidocs.db < migrations/add_document_images.sql
|
|
```
|
|
|
|
### Step 2: Install Dependencies
|
|
```bash
|
|
npm install pdf2pic jimp tesseract.js
|
|
```
|
|
|
|
### Step 3: Deploy OCR Worker Update
|
|
```bash
|
|
pm2 restart ocr-worker
|
|
```
|
|
|
|
### Step 4: Process Existing Documents
|
|
```bash
|
|
node scripts/reprocess-documents-with-images.js
|
|
```
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Image Classification**
|
|
- Diagrams vs photos vs charts
|
|
- Use ML model for categorization
|
|
|
|
2. **Smart Cropping**
|
|
- Detect diagram boundaries automatically
|
|
- Remove whitespace
|
|
|
|
3. **Image Search**
|
|
- Search by image content (visual similarity)
|
|
- Search text within images
|
|
|
|
4. **Annotations**
|
|
- Allow users to annotate images
|
|
- Link annotations to text
|
|
|
|
---
|
|
|
|
## API Examples
|
|
|
|
### Get Document Images
|
|
```bash
|
|
curl http://localhost:8001/api/documents/{id}/images
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"images": [
|
|
{
|
|
"id": "img-uuid-1",
|
|
"documentId": "doc-123",
|
|
"pageNumber": 1,
|
|
"imageIndex": 0,
|
|
"imagePath": "/uploads/doc-123/images/page-1-img-0.png",
|
|
"extractedText": "Figure 1: System Architecture Diagram showing...",
|
|
"textConfidence": 0.89,
|
|
"position": { "x": 100, "y": 200, "width": 400, "height": 300 },
|
|
"anchorTextBefore": "The following diagram illustrates",
|
|
"anchorTextAfter": "As shown in the figure above"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
**Status:** Design Complete - Ready for Implementation
|
|
**Estimated Dev Time:** 2-3 days
|
|
**Priority:** HIGH (Essential feature)
|