navidocs

dannystocker/navidocs

Fork 0

Commit graph

Author	SHA1	Message	Date
ggq-admin	09d9f1b601	Implement PDF image extraction with OCR in OCR worker This commit adds comprehensive image extraction and OCR functionality to the OCR worker: Features: - Created image-extractor.js worker module with extractImagesFromPage() function - Uses pdftoppm (with ImageMagick fallback) to convert PDF pages to high-res images - Images saved to /uploads/{documentId}/images/page-{N}-img-{M}.png - Returns image metadata: id, path, position, width, height OCR Worker Integration: - Imports image-extractor module and extractTextFromImage from OCR service - After processing page text, extracts images from each page - Runs Tesseract OCR on extracted images - Stores image data in document_images table with extracted text and confidence - Indexes images in Meilisearch with type='image' for searchability - Updates document.imageCount and sets imagesExtracted flag Database: - Uses existing document_images table from migration 004 - Stores image metadata, OCR text, and confidence scores Dependencies: - Added pdf-img-convert and sharp packages - Uses system tools (pdftoppm/ImageMagick) for reliable PDF conversion Testing: - Created test-image-extraction.js to verify image extraction - Created test-full-pipeline.js to test end-to-end extraction + OCR - Successfully tested with 05-versions-space.pdf test document Error Handling: - Graceful degradation if image extraction fails - Continues OCR processing even if images cannot be extracted - Comprehensive logging for debugging Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 19:54:25 +02:00
ggq-admin	155a8c0305	feat: NaviDocs MVP - Complete codebase extraction from lilian1 ## Backend (server/) - Express 5 API with security middleware (helmet, rate limiting) - SQLite database with WAL mode (schema from docs/architecture/) - Meilisearch integration with tenant tokens - BullMQ + Redis background job queue - OCR pipeline with Tesseract.js - File safety validation (extension, MIME, size) - 4 API route modules: upload, jobs, search, documents ## Frontend (client/) - Vue 3 with Composition API (<script setup>) - Vite 5 build system with HMR - Tailwind CSS (Meilisearch-inspired design) - UploadModal with drag-and-drop - FigureZoom component (ported from lilian1) - Meilisearch search integration with tenant tokens - Job polling composable - Clean SVG icons (no emojis) ## Code Extraction - ✅ manuals.js → UploadModal.vue, useJobPolling.js - ✅ figure-zoom.js → FigureZoom.vue - ✅ service-worker.js → client/public/service-worker.js (TODO) - ✅ glossary.json → Merged into Meilisearch synonyms - ❌ Discarded: quiz.js, persona.js, gamification.js (Frank-AI junk) ## Documentation - Complete extraction plan in docs/analysis/ - README with quick start guide - Architecture summary in docs/architecture/ ## Build Status - Server dependencies: ✅ Installed (234 packages) - Client dependencies: ✅ Installed (160 packages) - Client build: ✅ Successful (2.63s) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 01:55:44 +02:00

Author

SHA1

Message

Date

ggq-admin

09d9f1b601

Implement PDF image extraction with OCR in OCR worker

This commit adds comprehensive image extraction and OCR functionality to the OCR worker:

Features:
- Created image-extractor.js worker module with extractImagesFromPage() function
- Uses pdftoppm (with ImageMagick fallback) to convert PDF pages to high-res images
- Images saved to /uploads/{documentId}/images/page-{N}-img-{M}.png
- Returns image metadata: id, path, position, width, height

OCR Worker Integration:
- Imports image-extractor module and extractTextFromImage from OCR service
- After processing page text, extracts images from each page
- Runs Tesseract OCR on extracted images
- Stores image data in document_images table with extracted text and confidence
- Indexes images in Meilisearch with type='image' for searchability
- Updates document.imageCount and sets imagesExtracted flag

Database:
- Uses existing document_images table from migration 004
- Stores image metadata, OCR text, and confidence scores

Dependencies:
- Added pdf-img-convert and sharp packages
- Uses system tools (pdftoppm/ImageMagick) for reliable PDF conversion

Testing:
- Created test-image-extraction.js to verify image extraction
- Created test-full-pipeline.js to test end-to-end extraction + OCR
- Successfully tested with 05-versions-space.pdf test document

Error Handling:
- Graceful degradation if image extraction fails
- Continues OCR processing even if images cannot be extracted
- Comprehensive logging for debugging

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-19 19:54:25 +02:00

ggq-admin

155a8c0305

feat: NaviDocs MVP - Complete codebase extraction from lilian1

## Backend (server/)
- Express 5 API with security middleware (helmet, rate limiting)
- SQLite database with WAL mode (schema from docs/architecture/)
- Meilisearch integration with tenant tokens
- BullMQ + Redis background job queue
- OCR pipeline with Tesseract.js
- File safety validation (extension, MIME, size)
- 4 API route modules: upload, jobs, search, documents

## Frontend (client/)
- Vue 3 with Composition API (<script setup>)
- Vite 5 build system with HMR
- Tailwind CSS (Meilisearch-inspired design)
- UploadModal with drag-and-drop
- FigureZoom component (ported from lilian1)
- Meilisearch search integration with tenant tokens
- Job polling composable
- Clean SVG icons (no emojis)

## Code Extraction
- ✅ manuals.js → UploadModal.vue, useJobPolling.js
- ✅ figure-zoom.js → FigureZoom.vue
- ✅ service-worker.js → client/public/service-worker.js (TODO)
- ✅ glossary.json → Merged into Meilisearch synonyms
- ❌ Discarded: quiz.js, persona.js, gamification.js (Frank-AI junk)

## Documentation
- Complete extraction plan in docs/analysis/
- README with quick start guide
- Architecture summary in docs/architecture/

## Build Status
- Server dependencies: ✅ Installed (234 packages)
- Client dependencies: ✅ Installed (160 packages)
- Client build: ✅ Successful (2.63s)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-19 01:55:44 +02:00

2 commits