ggq-admin 4b91896838 feat: Add image extraction design, database schema, and migration

- Comprehensive image extraction architecture design
- Database schema for document_images table
- Migration 004: Add document_images table with indexes
- Migration runner script
- Design and status documentation

Prepares foundation for image extraction feature with OCR on images.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-19 19:47:30 +02:00

8.1 KiB

Raw Export PDF Blame History

Image Extraction Feature - Implementation Status

Date: 2025-10-19 Priority: HIGH (Essential Feature) Status: In Progress

✅ Completed (Foundation)

1. Architecture Design

Comprehensive design document created
Database schema designed
API endpoints planned
Frontend integration designed
Testing strategy defined

Document: /docs/features/IMAGE_EXTRACTION_DESIGN.md

2. Database Schema

document_images table created
Indexes added for performance
Foreign key relationships established
Migration script (004_add_document_images.sql) applied successfully

Table Structure:

document_images (
  id, documentId, pageNumber, imageIndex,
  imagePath, width, height, position,
  extractedText,  -- OCR from image
  textConfidence,
  anchorTextBefore,  -- Nearby text
  anchorTextAfter
)

🚧 Remaining Work (Estimated: 2-3 days)

Phase 1: Backend Image Extraction (8-10 hours)

Dependencies to install:

npm install pdf-img-convert sharp
# or
npm install pdfjs-dist

Files to create/modify:

server/workers/image-extractor.js (NEW)
- Extract images from each PDF page
- Save to /uploads/{docId}/images/
- Return image metadata (position, size)
server/workers/ocr-worker.js (MODIFY)
- After extracting page text, extract images
- Run Tesseract OCR on each image
- Store in document_images table
- Index image text in Meilisearch

Key Implementation:

// Extract images from page
const pageImages = await extractImagesFromPage(pdfPath, pageNum, docId);

for (const img of pageImages) {
  // Run OCR on image
  const ocrResult = await Tesseract.recognize(img.path);

  // Save to database
  await db.run(`INSERT INTO document_images ...`);

  // Index in Meilisearch
  await meilisearch.addDocuments([{
    id: img.id,
    type: 'image',
    content: ocrResult.data.text,
    ...
  }]);
}

Phase 2: Image-Text Anchoring (4-6 hours)

Algorithm:

Parse page OCR results to get text positions
For each image, find text within 100px above/below
Store as anchorTextBefore and anchorTextAfter
Helps users understand image context

Phase 3: API Endpoints (3-4 hours)

New routes to add:

// server/routes/images.js

// Get all images for a document
GET /api/documents/:id/images
Response: Array of image objects

// Get specific image file
GET /api/images/:imageId
Response: PNG/JPEG file (stream)

// Get images for specific page
GET /api/documents/:id/pages/:pageNum/images
Response: Array of images on that page

Phase 4: Frontend Integration (6-8 hours)

Files to modify:

client/src/views/DocumentView.vue
- Fetch images for current page
- Overlay images on PDF canvas at correct positions
- Show extracted text on hover
- Click to view full-size
client/src/components/ImageOverlay.vue (NEW)
- Display image at correct position
- Show OCR text tooltip
- Full-screen modal on click

Example:

<div class="pdf-page">
  <canvas ref="pdfCanvas"></canvas>

  <!-- Image overlays -->
  <div v-for="img in pageImages"
       class="image-overlay"
       :style="getImagePosition(img)"
       @click="showImageDetail(img)">
    <img :src="`/api/images/${img.id}`" />
    <div class="ocr-text-tooltip">{{ img.extractedText }}</div>
  </div>
</div>

Phase 5: Testing (2-3 hours)

Test with PDFs containing diagrams
Test with PDFs containing charts/graphs
Test with PDFs containing photos with text
Test multi-image per page
Verify OCR accuracy on images
Test search includes image text
Performance test with 100+ page PDFs

Technical Challenges & Solutions

Challenge 1: Image Extraction Quality

Problem: PDF images may be embedded at low resolution

Solutions:

Use high DPI conversion (300 DPI minimum)
Apply image enhancement before OCR
Use Sharp library for image processing

Challenge 2: Positioning Accuracy

Problem: Mapping PDF coordinates to screen coordinates

Solutions:

Store positions as percentages
Scale based on viewport
Test with various zoom levels

Challenge 3: Performance

Problem: Image extraction + OCR is slow

Current Stats:

Image extraction: ~1s per page
OCR per image: ~2-3s per image
100-page doc with 5 images/page = ~20 minutes

Solutions:

Process in background (already implemented via BullMQ)
Show progress updates
Allow partial results (show pages as they complete)
Cache OCR results

Dependencies Needed

{
  "pdf-img-convert": "^1.0.0",  // Convert PDF pages to images
  "sharp": "^0.33.0",            // Image processing & optimization
  "tesseract.js": "^5.0.0"       // Already installed for page OCR
}

Or alternative:

{
  "pdfjs-dist": "^4.0.0",  // Already installed, can extract images directly
  "jimp": "^0.22.0"        // Alternative to Sharp
}

Implementation Plan (Recommended Order)

Week 1: Backend Implementation

Day 1-2:

Install dependencies
Implement image extraction
Test with sample PDFs

Day 2-3:

Add OCR on images
Store in database
Index in Meilisearch

Week 2: Frontend & Testing

Day 4-5:

Create API endpoints
Test endpoints with curl/Postman

Day 5-6:

Implement frontend image display
Add image overlays to document viewer

Day 7:

End-to-end testing
Performance optimization
Bug fixes

Files Created So Far

docs/features/IMAGE_EXTRACTION_DESIGN.md
docs/features/IMAGE_EXTRACTION_STATUS.md  (this file)
server/migrations/004_add_document_images.sql
server/run-migration.js

Files To Create

server/workers/image-extractor.js
server/routes/images.js
client/src/components/ImageOverlay.vue
client/src/composables/useDocumentImages.js

Files To Modify

server/workers/ocr-worker.js  (add image extraction)
server/index.js  (add image routes)
client/src/views/DocumentView.vue  (display images)

Current System State

Services Running:

✅ Backend API (port 8001)
✅ Frontend (port 8080)
✅ OCR Worker (processing jobs)
✅ Meilisearch (port 7700)
✅ Redis (port 6379)

Database:

✅ document_images table created
✅ Indexes added
✅ Ready for image data

What Works Now:

✅ PDF upload
✅ Page text OCR
✅ Search text
✅ View PDFs
❌ Image extraction (not implemented yet)
❌ Image OCR (not implemented yet)
❌ Image display (not implemented yet)

Next Steps

Option A: Continue Implementation Now

I can continue implementing the remaining phases. This will take approximately 2-3 days to complete fully.

Option B: Implement in Stages

Start with Phase 1 (backend extraction) - 1 day
Test with existing PDFs
Then add frontend display - 1 day
Finally add OCR on images - 1 day

Option C: Minimal Viable Feature

Extract images only (no OCR yet) - 4 hours
Display images in viewer - 4 hours
Add OCR later as enhancement

Recommendation: Option B (staged approach) allows for testing and feedback at each phase.

Questions to Decide

Which PDF library?
- pdf-img-convert (simpler, good for extraction)
- pdfjs-dist (already installed, more control)
Image format?
- PNG (lossless, larger files ~500KB/image)
- JPEG (smaller ~100KB/image, slight quality loss)
- WebP (best compression, modern format)
OCR on all images?
- Yes: More searchable content, slower processing
- Selective: Only images that look like they contain text
- User-triggered: Extract on demand
Priority vs PDF Page Navigation Bug?
- Fix page navigation first (30 min)
- Implement image extraction first (2-3 days)
- Do both in parallel

Summary

Completed:

✅ Architecture designed
✅ Database schema created
✅ Migration applied

Remaining:

⏳ Backend implementation (8-10 hours)
⏳ API endpoints (3-4 hours)
⏳ Frontend integration (6-8 hours)
⏳ Testing (2-3 hours)

Total Estimated Time: 20-25 hours (2.5-3 days)

Ready to proceed with implementation? Let me know which option you prefer and I'll start building!

8.1 KiB Raw Export PDF Blame History

Image Extraction Feature - Implementation Status

✅ Completed (Foundation)

1. Architecture Design

2. Database Schema

🚧 Remaining Work (Estimated: 2-3 days)

Phase 1: Backend Image Extraction (8-10 hours)

Phase 2: Image-Text Anchoring (4-6 hours)

Phase 3: API Endpoints (3-4 hours)

Phase 4: Frontend Integration (6-8 hours)

Phase 5: Testing (2-3 hours)

Technical Challenges & Solutions

Challenge 1: Image Extraction Quality

Challenge 2: Positioning Accuracy

Challenge 3: Performance

Dependencies Needed

Implementation Plan (Recommended Order)

Week 1: Backend Implementation

Week 2: Frontend & Testing

Files Created So Far

Files To Create

Files To Modify

Current System State

Services Running:

Database:

What Works Now:

Next Steps

Option A: Continue Implementation Now

Option B: Implement in Stages

Option C: Minimal Viable Feature

Questions to Decide

Summary

8.1 KiB

Raw Export PDF Blame History