navidocs/server
ggq-admin 09d9f1b601 Implement PDF image extraction with OCR in OCR worker
This commit adds comprehensive image extraction and OCR functionality to the OCR worker:

Features:
- Created image-extractor.js worker module with extractImagesFromPage() function
- Uses pdftoppm (with ImageMagick fallback) to convert PDF pages to high-res images
- Images saved to /uploads/{documentId}/images/page-{N}-img-{M}.png
- Returns image metadata: id, path, position, width, height

OCR Worker Integration:
- Imports image-extractor module and extractTextFromImage from OCR service
- After processing page text, extracts images from each page
- Runs Tesseract OCR on extracted images
- Stores image data in document_images table with extracted text and confidence
- Indexes images in Meilisearch with type='image' for searchability
- Updates document.imageCount and sets imagesExtracted flag

Database:
- Uses existing document_images table from migration 004
- Stores image metadata, OCR text, and confidence scores

Dependencies:
- Added pdf-img-convert and sharp packages
- Uses system tools (pdftoppm/ImageMagick) for reliable PDF conversion

Testing:
- Created test-image-extraction.js to verify image extraction
- Created test-full-pipeline.js to test end-to-end extraction + OCR
- Successfully tested with 05-versions-space.pdf test document

Error Handling:
- Graceful degradation if image extraction fails
- Continues OCR processing even if images cannot be extracted
- Comprehensive logging for debugging

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 19:54:25 +02:00
..
config chore(debug): log tenant token parent uid for troubleshooting 2025-10-19 17:11:05 +02:00
db feat: NaviDocs MVP - Complete codebase extraction from lilian1 2025-10-19 01:55:44 +02:00
examples feat: NaviDocs MVP - Complete codebase extraction from lilian1 2025-10-19 01:55:44 +02:00
middleware feat: NaviDocs MVP - Complete codebase extraction from lilian1 2025-10-19 01:55:44 +02:00
migrations feat: Add image extraction design, database schema, and migration 2025-10-19 19:47:30 +02:00
routes chore(env): add MEILISEARCH_SEARCH_KEY for dev; adjust routes to use search key fallback 2025-10-19 17:27:18 +02:00
scripts feat: NaviDocs MVP - Complete codebase extraction from lilian1 2025-10-19 01:55:44 +02:00
services feat: Add Google Cloud Vision API as primary OCR option 2025-10-19 09:08:38 +02:00
test/data chore: Local development environment setup 2025-10-19 04:42:55 +02:00
workers Implement PDF image extraction with OCR in OCR worker 2025-10-19 19:54:25 +02:00
.env.example feat: Complete frontend UI polish with Meilisearch-inspired design 2025-10-19 16:40:48 +02:00
API_SUMMARY.md feat: NaviDocs MVP - Complete codebase extraction from lilian1 2025-10-19 01:55:44 +02:00
index.js feat: NaviDocs MVP - Complete codebase extraction from lilian1 2025-10-19 01:55:44 +02:00
package.json Implement PDF image extraction with OCR in OCR worker 2025-10-19 19:54:25 +02:00
run-migration.js feat: Add image extraction design, database schema, and migration 2025-10-19 19:47:30 +02:00
test-full-pipeline.js Implement PDF image extraction with OCR in OCR worker 2025-10-19 19:54:25 +02:00
test-image-extraction.js Implement PDF image extraction with OCR in OCR worker 2025-10-19 19:54:25 +02:00
test-routes.js feat: NaviDocs MVP - Complete codebase extraction from lilian1 2025-10-19 01:55:44 +02:00