# Image Extraction & OCR Design **Purpose:** Extract images from PDFs, run OCR on them, and anchor to surrounding text **Last Updated:** 2025-10-19 --- ## Requirements 1. **Extract all images from PDF documents** 2. **Run OCR on extracted images** (images contain text) 3. **Anchor images to nearby document text** 4. **Store image positions and relationships** 5. **Display images in document viewer with text** --- ## Architecture ### 1. Image Extraction Pipeline ``` PDF Upload ↓ OCR Worker Processes PDF ↓ ├─ Extract Page Text (existing) ├─ Extract Page Images (NEW) │ ↓ │ ├─ Save images to: /uploads/{docId}/images/ │ ├─ Run Tesseract OCR on each image │ └─ Store image metadata + text └─ Build Image-Text Relationships ↓ Store in Database + Index in Meilisearch ``` ### 2. Database Schema ```sql -- New table for extracted images CREATE TABLE document_images ( id TEXT PRIMARY KEY, documentId TEXT NOT NULL REFERENCES documents(id) ON DELETE CASCADE, pageNumber INTEGER NOT NULL, imageIndex INTEGER NOT NULL, -- 0, 1, 2 for multiple images per page imagePath TEXT NOT NULL, -- /uploads/{docId}/images/page-{N}-img-{M}.png imageFormat TEXT DEFAULT 'png', width INTEGER, height INTEGER, position JSON, -- {x, y, width, height} on page extractedText TEXT, -- OCR text from the image textConfidence REAL, -- Average OCR confidence anchorTextBefore TEXT, -- Text snippet before image anchorTextAfter TEXT, -- Text snippet after image createdAt INTEGER NOT NULL, UNIQUE(documentId, pageNumber, imageIndex) ); CREATE INDEX idx_document_images_doc ON document_images(documentId); CREATE INDEX idx_document_images_page ON document_images(documentId, pageNumber); ``` ### 3. File Storage Structure ``` /uploads/ {documentId}/ document.pdf -- Original PDF images/ page-1-img-0.png -- First image on page 1 page-1-img-1.png -- Second image on page 1 page-2-img-0.png -- First image on page 2 ... ``` --- ## Implementation Plan ### Phase 1: Backend Image Extraction **File:** `server/workers/image-extractor.js` ```javascript import { fromPath } from 'pdf2pic'; import Jimp from 'jimp'; import Tesseract from 'tesseract.js'; async function extractImagesFromPDF(pdfPath, documentId) { // 1. Convert PDF pages to images // 2. For each page, detect image regions // 3. Crop out images // 4. Run OCR on each image // 5. Save images + metadata // 6. Return array of image objects } ``` **Dependencies needed:** - `pdf2pic` - Convert PDF to images - `jimp` - Image manipulation - `pdfjs-dist` - More precise PDF parsing (optional) ### Phase 2: OCR Worker Integration **File:** `server/workers/ocr-worker.js` Add after page text extraction: ```javascript // Extract images from this page const pageImages = await extractImagesFromPage(pdfPath, pageNum, documentId); // For each image for (const img of pageImages) { // Run Tesseract OCR const ocrResult = await Tesseract.recognize(img.path); // Store in database await db.run(` INSERT INTO document_images ( id, documentId, pageNumber, imageIndex, imagePath, extractedText, textConfidence, position, width, height ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) `, [ img.id, documentId, pageNum, img.index, img.path, ocrResult.data.text, ocrResult.data.confidence, JSON.stringify(img.position), img.width, img.height ]); // Index image text in Meilisearch await meilisearch.addDocuments([{ id: img.id, type: 'image', documentId, pageNumber: pageNum, content: ocrResult.data.text, ... }]); } ``` ### Phase 3: Image-Text Anchoring **Algorithm:** ```javascript function findAnchorText(pageText, imagePosition, pageNumber) { // Parse page text with positions const textBlocks = parseTextWithPositions(pageText); // Find text blocks near image const beforeImage = textBlocks.filter(b => b.position.y < imagePosition.y && b.position.y > imagePosition.y - 100 ); const afterImage = textBlocks.filter(b => b.position.y > imagePosition.y + imagePosition.height && b.position.y < imagePosition.y + imagePosition.height + 100 ); return { before: beforeImage.map(b => b.text).join(' '), after: afterImage.map(b => b.text).join(' ') }; } ``` ### Phase 4: API Endpoints **New routes:** ```javascript // Get all images for a document GET /api/documents/:id/images Response: [ { id, pageNumber, imageIndex, imagePath, extractedText, position, anchorTextBefore, anchorTextAfter } ] // Get specific image file GET /api/images/:imageId Response: PNG file (stream) // Get images for a specific page GET /api/documents/:id/pages/:pageNum/images Response: Array of images on that page ``` ### Phase 5: Frontend Integration **Document Viewer Updates:** 1. Fetch images for current page 2. Display images at correct positions 3. Show extracted text on hover 4. Link to anchor text ```vue ``` --- ## Technical Challenges ### Challenge 1: Image Detection in PDF **Problem:** PDFs can embed images in various ways **Solutions:** - Use `pdfjs-dist` to parse PDF structure and find image objects - Alternative: Use `pdf2image` + image detection algorithms - Fallback: User manual image selection ### Challenge 2: Image Quality for OCR **Problem:** Extracted images may be low quality **Solutions:** - Use high DPI when converting PDF to images (300+ DPI) - Apply image enhancement before OCR (contrast, sharpening) - Use Tesseract preprocessing options ### Challenge 3: Positioning Accuracy **Problem:** Mapping PDF coordinates to canvas coordinates **Solutions:** - Store positions as percentages, not absolute pixels - Scale positions based on viewport - Test with various PDF sizes --- ## Performance Considerations ### Storage - **Images:** ~500KB per image (PNG) - **100-page doc with 5 images/page:** ~250MB - **Solution:** Store as JPEG with quality 85%, or use WebP ### Processing Time - **Image extraction:** ~1s per page - **OCR per image:** ~2-3s per image - **100-page doc with 5 images/page:** ~15-20 minutes - **Solution:** Process in background, show progress ### Meilisearch Indexing - Index image text separately - Tag with `type: 'image'` for filtering - Include `documentId`, `pageNumber` for joining --- ## Testing Strategy ### Test Cases 1. **Single image per page** - Extract: ✓ - OCR: ✓ - Display: ✓ 2. **Multiple images per page** - Extract all: ✓ - Correct order: ✓ - No duplicates: ✓ 3. **Images with text** - OCR accuracy > 80%: ✓ - Text searchable: ✓ 4. **Large PDFs (100+ pages)** - Processing completes: ✓ - Progress tracking: ✓ - No memory leaks: ✓ 5. **Edge cases** - No images: Handle gracefully - Corrupted images: Skip and log - Very large images: Resize before OCR --- ## Migration Path ### Step 1: Add Database Table ```bash sqlite3 data/navidocs.db < migrations/add_document_images.sql ``` ### Step 2: Install Dependencies ```bash npm install pdf2pic jimp tesseract.js ``` ### Step 3: Deploy OCR Worker Update ```bash pm2 restart ocr-worker ``` ### Step 4: Process Existing Documents ```bash node scripts/reprocess-documents-with-images.js ``` --- ## Future Enhancements 1. **Image Classification** - Diagrams vs photos vs charts - Use ML model for categorization 2. **Smart Cropping** - Detect diagram boundaries automatically - Remove whitespace 3. **Image Search** - Search by image content (visual similarity) - Search text within images 4. **Annotations** - Allow users to annotate images - Link annotations to text --- ## API Examples ### Get Document Images ```bash curl http://localhost:8001/api/documents/{id}/images ``` Response: ```json { "images": [ { "id": "img-uuid-1", "documentId": "doc-123", "pageNumber": 1, "imageIndex": 0, "imagePath": "/uploads/doc-123/images/page-1-img-0.png", "extractedText": "Figure 1: System Architecture Diagram showing...", "textConfidence": 0.89, "position": { "x": 100, "y": 200, "width": 400, "height": 300 }, "anchorTextBefore": "The following diagram illustrates", "anchorTextAfter": "As shown in the figure above" } ] } ``` --- **Status:** Design Complete - Ready for Implementation **Estimated Dev Time:** 2-3 days **Priority:** HIGH (Essential feature)