Skip to content

Replace PyPDF2 with pypdfium2 #38

Open
yiwei-ang wants to merge 3 commits intoalejandro-ao:mainfrom
yiwei-ang:feature/pdfium
Open

Replace PyPDF2 with pypdfium2 #38
yiwei-ang wants to merge 3 commits intoalejandro-ao:mainfrom
yiwei-ang:feature/pdfium

Conversation

@yiwei-ang
Copy link

@yiwei-ang yiwei-ang commented Aug 23, 2023

I really appreciate @alejandro-ao for creating good video demonstrating the perfect blend of openai, PDF readers and streamlit!

I've tried to use the tool for several PDFs, I found that there's an issue of text extraction quality using PyPDF2, that contexts of a PDF are not extracted fully and completely.

After looking into https://github.com/py-pdf/benchmarks, it seems we can go with pypdfium2 that serves similar functionality, while providing better text extraction quality and faster computational time (Verified from my end!)

@yiwei-ang yiwei-ang changed the title Replace pypdfium2 with Replace PyPDF2 with pypdfium2 Aug 23, 2023
@IlianP
Copy link

IlianP commented Sep 8, 2023

As a side note, LangChain also supports pypdfium2 as a document loader:
https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdfium2

@costabm
Copy link

costabm commented Nov 2, 2023

I have added this important feature to my larger pull request (my first one ever). I gave you credit there, but no sure this is the right way to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants