-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the issue
Currently, our PDF processing pipeline extracts text from documents. However, if text extraction fails, we need a fallback mechanism to perform OCR and retrieve the content. This ensures users can still query the document even if it's an image-based PDF or has unreadable text.
Steps to Reproduce
- Upload a scanned PDF (or an image-based PDF).
- Attempt to extract text from the document.
- Observe that the system fails to retrieve any content and does not try OCR.
Expected Behavior
- If text extraction fails, the system should automatically attempt OCR.
- OCR-extracted text should be processed for search queries.
- Errors should be logged if OCR also fails.
- Users should receive a meaningful message if the document is entirely unreadable.
Relevant Logs/Error Messages
NA
Priority
Medium
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working