Image extraction does not handle occlusions (images partially overlapped by text) #4861
Replies: 3 comments
-
|
Please provide an example, otherwise we can't deal with this. |
Beta Was this translation helpful? Give feedback.
-
|
I think your post is actually an inquiry, so I'm going to transfer it to the |
Beta Was this translation helpful? Give feedback.
-
|
The title of your post is incorrect or misleading: You are probably confusing vector graphics with images. Vector graphics consist of drawing primitives like lines, curves or rectangles. These objects are painted on the page together with other content like text. The painting commands are formulated in PDF's mini-language which is similar to PostScript and contained e.g. in a page's What your human eye classifies as an "image" may in reality be a bunch of hundreds or thousands of small vector primitives, together creating the impression of some chart. Inside the PDF, nothing exists that would make this collection identifiable as one identity. But there is hope even in those cases - provided you are willing to invest some effort: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Issue: Image extraction does not handle occlusions (images overlapped by text)
I’m working with textbook-style PDFs (e.g. educational/science books) where diagrams and figures are often partially overlapped by text, labels, or callouts. While PyMuPDF is excellent at extracting embedded images and layout information, I’m running into consistent issues when images are occluded by text or drawn as part of the page content.
Observed behavior
In these cases:
page.get_images(full=True)does not return the diagram as a clean imagepage.get_text("dict")correctly reports text blocks overlapping the diagramThis makes it difficult to reliably extract diagrams as standalone images when text overlaps them.
Expected behavior
Ideally, there would be a way to:
I understand that PDFs are geometry-based and may not encode semantic “figure” concepts, but from a user perspective it’s unclear whether this limitation is fundamental or if there are recommended PyMuPDF approaches to mitigate it.
What I’ve tried
page.get_images(full=True)page.get_image_bbox(xref)page.get_text("dict")with manual bounding-box heuristicspage.get_pixmap)Rendering works, but it loses the ability to extract individual diagrams as separate assets.
Questions
Is handling image occlusion by text:
Any clarification on the intended behavior or best practices would be really helpful.
Thanks, and I appreciate the work that’s gone into PyMuPDF.
Beta Was this translation helpful? Give feedback.
All reactions