navidocs/server/package.json
ggq-admin 09d9f1b601 Implement PDF image extraction with OCR in OCR worker
This commit adds comprehensive image extraction and OCR functionality to the OCR worker:

Features:
- Created image-extractor.js worker module with extractImagesFromPage() function
- Uses pdftoppm (with ImageMagick fallback) to convert PDF pages to high-res images
- Images saved to /uploads/{documentId}/images/page-{N}-img-{M}.png
- Returns image metadata: id, path, position, width, height

OCR Worker Integration:
- Imports image-extractor module and extractTextFromImage from OCR service
- After processing page text, extracts images from each page
- Runs Tesseract OCR on extracted images
- Stores image data in document_images table with extracted text and confidence
- Indexes images in Meilisearch with type='image' for searchability
- Updates document.imageCount and sets imagesExtracted flag

Database:
- Uses existing document_images table from migration 004
- Stores image metadata, OCR text, and confidence scores

Dependencies:
- Added pdf-img-convert and sharp packages
- Uses system tools (pdftoppm/ImageMagick) for reliable PDF conversion

Testing:
- Created test-image-extraction.js to verify image extraction
- Created test-full-pipeline.js to test end-to-end extraction + OCR
- Successfully tested with 05-versions-space.pdf test document

Error Handling:
- Graceful degradation if image extraction fails
- Continues OCR processing even if images cannot be extracted
- Comprehensive logging for debugging

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 19:54:25 +02:00

43 lines
988 B
JSON

{
"name": "navidocs-server",
"version": "1.0.0",
"description": "NaviDocs backend API - Boat manual management with OCR and search",
"type": "module",
"main": "index.js",
"scripts": {
"start": "node index.js",
"dev": "node --watch index.js",
"init-db": "node db/init.js"
},
"keywords": [
"boat",
"manuals",
"ocr",
"meilisearch"
],
"author": "",
"license": "MIT",
"dependencies": {
"bcrypt": "^5.1.0",
"better-sqlite3": "^11.0.0",
"bullmq": "^5.0.0",
"cors": "^2.8.5",
"dotenv": "^16.0.0",
"express": "^5.0.0",
"express-rate-limit": "^7.0.0",
"file-type": "^19.0.0",
"helmet": "^7.0.0",
"ioredis": "^5.0.0",
"jsonwebtoken": "^9.0.0",
"meilisearch": "^0.41.0",
"multer": "^1.4.5-lts.1",
"pdf-img-convert": "^2.0.0",
"pdf-parse": "^1.1.1",
"sharp": "^0.34.4",
"tesseract.js": "^5.0.0",
"uuid": "^10.0.0"
},
"devDependencies": {
"@types/node": "^20.0.0"
}
}