Commit graph

7 commits

Author SHA1 Message Date
Claude
f0096a6bd6
Feature: Multi-format upload support (JPG, PNG, DOCX, XLSX, TXT, MD)
Implements multi-format document upload capability expanding beyond PDFs.

Changes:
- server/package.json: Add mammoth (DOCX) and xlsx (Excel) dependencies
- server/services/file-safety.js: Expand allowed file types and MIME types
  - Added getFileCategory() function to classify file types
  - Support for images, Office docs, and text files
  - Flexible MIME validation for text files
- server/services/document-processor.js: NEW routing service
  - processImageFile(): Tesseract OCR for JPG/PNG/WebP
  - processWordDocument(): Mammoth for DOCX text extraction
  - processExcelDocument(): XLSX for spreadsheet data extraction
  - processTextFile(): Native reading for TXT/MD files
  - Unified interface with processDocument() router
- server/workers/ocr-worker.js: Switch from extractTextFromPDF to processDocument
  - Now handles all file types through unified processor
- client/src/components/UploadModal.vue: Update UI for multi-format
  - File input accepts all new file types
  - Updated help text to show supported formats

Supported formats: PDF, JPG, PNG, WebP, DOCX, XLSX, TXT, MD
Text extraction methods: Native (Office/text), Tesseract OCR (images), PDF.js (PDFs)
Search indexing: All file types processed and indexed in Meilisearch

Session: Cloud Session 2 - Multi-Format Upload Support
Branch: feature/multiformat
Status: Complete - Ready for testing
2025-11-13 12:54:44 +00:00
Danny Stocker
58b344aa31 FINAL: P0 blockers fixed + Joe Trader + ignore binaries
Fixed:
- Price: €800K-€1.5M, Sunseeker added
- Agent 1: Joe Trader persona + actual sale ads research
- Ignored meilisearch binary + data/ (too large for GitHub)
- SESSION_DEBUG_BLOCKERS.md created

Ready for Session 1 launch.

🤖 Generated with Claude Code
2025-11-13 01:29:59 +01:00
ggq-admin
fb88b291de feat: Add interactive Table of Contents navigation with i18n support
Implements complete TOC feature for document navigation with bilingual support.

## TOC Detection & Extraction
- Pattern-based TOC detection with 3 regex patterns
- Heuristic validation (30%+ match ratio, 5+ entries, sequential pages)
- Hierarchical section key parsing (e.g., "4.1.2" → level 3, parent "4.1")
- Database schema with parent-child relationships
- Automatic extraction during OCR post-processing
- Server-side LRU caching (200 entries, 30min TTL)

## UI Components
- TocSidebar: Collapsible sidebar (320px) with auto-open on TOC presence
- TocEntry: Recursive component for hierarchical rendering
- Flex layout: Sidebar + PDF viewer side-by-side
- Active page highlighting with real-time sync
- localStorage persistence for sidebar state

## Navigation Features
- Click TOC entry → PDF jumps to page
- Deep link support: URL hash format #p=12
- Page change events: navidocs:pagechange custom event
- URL hash updates on all navigation (next/prev/goTo/TOC)
- Hash change listener for external navigation
- Page clamping and validation

## Search Integration
- "Jump to section" button in search results
- Shows when result has section field
- Navigates to document with page number and hash

## Accessibility
- ARIA attributes: role, aria-label, aria-expanded, aria-current
- Keyboard navigation: Enter/Space on entries, Tab focus
- Screen reader support with aria-live regions
- Semantic HTML with proper list/listitem roles

## Internationalization (i18n)
- Vue I18n integration with vue-i18n package
- English and French translations
- 8 TOC-specific translation keys
- Language switcher component in document viewer
- Locale persistence in localStorage

## Error Handling
- Specific error messages for each failure case
- Validation before processing (doc exists, has pages, has OCR)
- Non-blocking TOC extraction (doesn't fail OCR jobs)
- Detailed error returns: {success, error, entriesCount, pages}

## API Endpoints
- GET /api/documents/:id/toc?format=flat|tree
- POST /api/documents/:id/toc/extract
- Cache invalidation on re-extraction

## Testing
- Smoke test script: 9 comprehensive tests
- E2E testing guide with 5 manual scenarios
- Tests cover: API, caching, validation, navigation, search

## Database
- Migration 002: document_toc table
- Fields: id, document_id, title, section_key, page_start, level, parent_id, order_index
- Foreign keys with CASCADE delete

## Files Changed
- New: TocSidebar.vue, TocEntry.vue, LanguageSwitcher.vue
- New: toc-extractor.js, toc.js routes, i18n setup
- Modified: DocumentView.vue (sidebar, deep links, events)
- Modified: SearchView.vue (Jump to section button)
- Modified: ocr-worker.js (TOC post-processing)
- New: toc-smoke-test.sh, TOC_E2E_TEST.md

Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-20 13:22:45 +02:00
ggq-admin
09d9f1b601 Implement PDF image extraction with OCR in OCR worker
This commit adds comprehensive image extraction and OCR functionality to the OCR worker:

Features:
- Created image-extractor.js worker module with extractImagesFromPage() function
- Uses pdftoppm (with ImageMagick fallback) to convert PDF pages to high-res images
- Images saved to /uploads/{documentId}/images/page-{N}-img-{M}.png
- Returns image metadata: id, path, position, width, height

OCR Worker Integration:
- Imports image-extractor module and extractTextFromImage from OCR service
- After processing page text, extracts images from each page
- Runs Tesseract OCR on extracted images
- Stores image data in document_images table with extracted text and confidence
- Indexes images in Meilisearch with type='image' for searchability
- Updates document.imageCount and sets imagesExtracted flag

Database:
- Uses existing document_images table from migration 004
- Stores image metadata, OCR text, and confidence scores

Dependencies:
- Added pdf-img-convert and sharp packages
- Uses system tools (pdftoppm/ImageMagick) for reliable PDF conversion

Testing:
- Created test-image-extraction.js to verify image extraction
- Created test-full-pipeline.js to test end-to-end extraction + OCR
- Successfully tested with 05-versions-space.pdf test document

Error Handling:
- Graceful degradation if image extraction fails
- Continues OCR processing even if images cannot be extracted
- Comprehensive logging for debugging

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 19:54:25 +02:00
ggq-admin
b152df159d feat: Add dotenv loading to OCR worker for environment configuration
- Import dotenv in worker to load .env configuration
- Specify explicit path to server/.env file
- Update Meilisearch config to use changeme123 as default key
- Add debug logging to Meilisearch client initialization
- Add meilisearch-data/ to .gitignore

OCR pipeline is fully functional with 85% confidence:
- PDF upload 
- Queue processing 
- PDF to image conversion 
- Tesseract OCR 
- Database storage 

Remaining issue: Meilisearch authentication needs to be resolved
to enable search indexing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 09:00:16 +02:00
ggq-admin
af02363299 fix: Switch to local system tesseract command for OCR
- Replace Tesseract.js with local tesseract CLI due to CDN 404 issues
- Fix queue name mismatch (ocr-processing vs ocr-jobs)
- Local tesseract uses pre-installed training data
- Faster and more reliable than downloading from CDN

\ud83e\udd16 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 04:48:18 +02:00
ggq-admin
155a8c0305 feat: NaviDocs MVP - Complete codebase extraction from lilian1
## Backend (server/)
- Express 5 API with security middleware (helmet, rate limiting)
- SQLite database with WAL mode (schema from docs/architecture/)
- Meilisearch integration with tenant tokens
- BullMQ + Redis background job queue
- OCR pipeline with Tesseract.js
- File safety validation (extension, MIME, size)
- 4 API route modules: upload, jobs, search, documents

## Frontend (client/)
- Vue 3 with Composition API (<script setup>)
- Vite 5 build system with HMR
- Tailwind CSS (Meilisearch-inspired design)
- UploadModal with drag-and-drop
- FigureZoom component (ported from lilian1)
- Meilisearch search integration with tenant tokens
- Job polling composable
- Clean SVG icons (no emojis)

## Code Extraction
-  manuals.js → UploadModal.vue, useJobPolling.js
-  figure-zoom.js → FigureZoom.vue
-  service-worker.js → client/public/service-worker.js (TODO)
-  glossary.json → Merged into Meilisearch synonyms
-  Discarded: quiz.js, persona.js, gamification.js (Frank-AI junk)

## Documentation
- Complete extraction plan in docs/analysis/
- README with quick start guide
- Architecture summary in docs/architecture/

## Build Status
- Server dependencies:  Installed (234 packages)
- Client dependencies:  Installed (160 packages)
- Client build:  Successful (2.63s)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 01:55:44 +02:00