Implements multi-format document upload capability expanding beyond PDFs.
Changes:
- server/package.json: Add mammoth (DOCX) and xlsx (Excel) dependencies
- server/services/file-safety.js: Expand allowed file types and MIME types
- Added getFileCategory() function to classify file types
- Support for images, Office docs, and text files
- Flexible MIME validation for text files
- server/services/document-processor.js: NEW routing service
- processImageFile(): Tesseract OCR for JPG/PNG/WebP
- processWordDocument(): Mammoth for DOCX text extraction
- processExcelDocument(): XLSX for spreadsheet data extraction
- processTextFile(): Native reading for TXT/MD files
- Unified interface with processDocument() router
- server/workers/ocr-worker.js: Switch from extractTextFromPDF to processDocument
- Now handles all file types through unified processor
- client/src/components/UploadModal.vue: Update UI for multi-format
- File input accepts all new file types
- Updated help text to show supported formats
Supported formats: PDF, JPG, PNG, WebP, DOCX, XLSX, TXT, MD
Text extraction methods: Native (Office/text), Tesseract OCR (images), PDF.js (PDFs)
Search indexing: All file types processed and indexed in Meilisearch
Session: Cloud Session 2 - Multi-Format Upload Support
Branch: feature/multiformat
Status: Complete - Ready for testing
Fixed:
- Price: €800K-€1.5M, Sunseeker added
- Agent 1: Joe Trader persona + actual sale ads research
- Ignored meilisearch binary + data/ (too large for GitHub)
- SESSION_DEBUG_BLOCKERS.md created
Ready for Session 1 launch.
🤖 Generated with Claude Code
This commit adds comprehensive image extraction and OCR functionality to the OCR worker:
Features:
- Created image-extractor.js worker module with extractImagesFromPage() function
- Uses pdftoppm (with ImageMagick fallback) to convert PDF pages to high-res images
- Images saved to /uploads/{documentId}/images/page-{N}-img-{M}.png
- Returns image metadata: id, path, position, width, height
OCR Worker Integration:
- Imports image-extractor module and extractTextFromImage from OCR service
- After processing page text, extracts images from each page
- Runs Tesseract OCR on extracted images
- Stores image data in document_images table with extracted text and confidence
- Indexes images in Meilisearch with type='image' for searchability
- Updates document.imageCount and sets imagesExtracted flag
Database:
- Uses existing document_images table from migration 004
- Stores image metadata, OCR text, and confidence scores
Dependencies:
- Added pdf-img-convert and sharp packages
- Uses system tools (pdftoppm/ImageMagick) for reliable PDF conversion
Testing:
- Created test-image-extraction.js to verify image extraction
- Created test-full-pipeline.js to test end-to-end extraction + OCR
- Successfully tested with 05-versions-space.pdf test document
Error Handling:
- Graceful degradation if image extraction fails
- Continues OCR processing even if images cannot be extracted
- Comprehensive logging for debugging
Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Import dotenv in worker to load .env configuration
- Specify explicit path to server/.env file
- Update Meilisearch config to use changeme123 as default key
- Add debug logging to Meilisearch client initialization
- Add meilisearch-data/ to .gitignore
OCR pipeline is fully functional with 85% confidence:
- PDF upload ✅
- Queue processing ✅
- PDF to image conversion ✅
- Tesseract OCR ✅
- Database storage ✅
Remaining issue: Meilisearch authentication needs to be resolved
to enable search indexing.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Replace Tesseract.js with local tesseract CLI due to CDN 404 issues
- Fix queue name mismatch (ocr-processing vs ocr-jobs)
- Local tesseract uses pre-installed training data
- Faster and more reliable than downloading from CDN
\ud83e\udd16 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>