Skip to content

[Bug]: Enhance PDF Processing with OCR for Unreadable Documents #16

@MridulTi

Description

@MridulTi

Describe the issue

Currently, our PDF processing pipeline extracts text from documents. However, if text extraction fails, we need a fallback mechanism to perform OCR and retrieve the content. This ensures users can still query the document even if it's an image-based PDF or has unreadable text.

Steps to Reproduce

  1. Upload a scanned PDF (or an image-based PDF).
  2. Attempt to extract text from the document.
  3. Observe that the system fails to retrieve any content and does not try OCR.

Expected Behavior

  • If text extraction fails, the system should automatically attempt OCR.
  • OCR-extracted text should be processed for search queries.
  • Errors should be logged if OCR also fails.
  • Users should receive a meaningful message if the document is entirely unreadable.

Relevant Logs/Error Messages

NA

Priority

Medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions