-
Notifications
You must be signed in to change notification settings - Fork 686
Description
Issue: Image extraction does not handle occlusions (images overlapped by text)
I’m working with textbook-style PDFs (e.g. educational/science books) where diagrams and figures are often partially overlapped by text, labels, or callouts. While PyMuPDF is excellent at extracting embedded images and layout information, I’m running into consistent issues when images are occluded by text or drawn as part of the page content.
Observed behavior
In these cases:
page.get_images(full=True)does not return the diagram as a clean imagepage.get_text("dict")correctly reports text blocks overlapping the diagram- The diagram itself is either:
- split into multiple fragments, or
- not extractable as a single image asset, or
- only recoverable via full-page rendering
This makes it difficult to reliably extract diagrams as standalone images when text overlaps them.
Expected behavior
Ideally, there would be a way to:
- Extract visual image regions even when they are partially covered by text
- Identify a figure-level bounding box that includes occluded content
- Or have clearer guidance on whether this is intentionally unsupported due to PDF format limitations
I understand that PDFs are geometry-based and may not encode semantic “figure” concepts, but from a user perspective it’s unclear whether this limitation is fundamental or if there are recommended PyMuPDF approaches to mitigate it.
What I’ve tried
page.get_images(full=True)page.get_image_bbox(xref)page.get_text("dict")with manual bounding-box heuristics- Rendering full pages as a fallback (
page.get_pixmap)
Rendering works, but it loses the ability to extract individual diagrams as separate assets.
Questions
Is handling image occlusion by text:
- A known limitation of PyMuPDF and/or the PDF format?
- Something that newer versions or APIs aim to improve?
- Out of scope by design (requiring vision-based post-processing)?
Any clarification on the intended behavior or best practices would be really helpful.
Thanks, and I appreciate the work that’s gone into PyMuPDF.