Image extraction does not handle occlusions (images partially overlapped by text) #4861

dhruvsahgal2003 · 2026-01-12T06:37:54Z

dhruvsahgal2003
Jan 12, 2026

Issue: Image extraction does not handle occlusions (images overlapped by text)

I’m working with textbook-style PDFs (e.g. educational/science books) where diagrams and figures are often partially overlapped by text, labels, or callouts. While PyMuPDF is excellent at extracting embedded images and layout information, I’m running into consistent issues when images are occluded by text or drawn as part of the page content.

Observed behavior

In these cases:

page.get_images(full=True) does not return the diagram as a clean image
page.get_text("dict") correctly reports text blocks overlapping the diagram
The diagram itself is either:
- split into multiple fragments, or
- not extractable as a single image asset, or
- only recoverable via full-page rendering

This makes it difficult to reliably extract diagrams as standalone images when text overlaps them.

Expected behavior

Ideally, there would be a way to:

Extract visual image regions even when they are partially covered by text
Identify a figure-level bounding box that includes occluded content
Or have clearer guidance on whether this is intentionally unsupported due to PDF format limitations

I understand that PDFs are geometry-based and may not encode semantic “figure” concepts, but from a user perspective it’s unclear whether this limitation is fundamental or if there are recommended PyMuPDF approaches to mitigate it.

What I’ve tried

page.get_images(full=True)
page.get_image_bbox(xref)
page.get_text("dict") with manual bounding-box heuristics
Rendering full pages as a fallback (page.get_pixmap)

Rendering works, but it loses the ability to extract individual diagrams as separate assets.

Questions

Is handling image occlusion by text:

A known limitation of PyMuPDF and/or the PDF format?
Something that newer versions or APIs aim to improve?
Out of scope by design (requiring vision-based post-processing)?

Any clarification on the intended behavior or best practices would be really helpful.

Thanks, and I appreciate the work that’s gone into PyMuPDF.

JorjMcKie · 2026-01-12T07:29:09Z

JorjMcKie
Jan 12, 2026
Maintainer

Please provide an example, otherwise we can't deal with this.
Also please look at Page.cluster_drawings(): could be that you are referring to vector graphics, which usually consist of a plethora of small vectors and require extra handling.

0 replies

JorjMcKie · 2026-01-12T08:47:32Z

JorjMcKie
Jan 12, 2026
Maintainer

I think your post is actually an inquiry, so I'm going to transfer it to the Discussions tab.
In case a reasonable enhancement can be identified, you can still submit a more focused request.

0 replies

JorjMcKie · 2026-01-12T09:09:36Z

JorjMcKie
Jan 12, 2026
Maintainer

The title of your post is incorrect or misleading:
Embedded images can always be extracted in their entirety, whether or not they are covered by something else.

You are probably confusing vector graphics with images. Vector graphics consist of drawing primitives like lines, curves or rectangles. These objects are painted on the page together with other content like text. The painting commands are formulated in PDF's mini-language which is similar to PostScript and contained e.g. in a page's /Contents objects. This is completely controlled by the PDF creator: text, vectors and other objects painting instructions may occur here in any sequence.
The sequence of the painting instructions influence what is visible: content painted later can overlay earlier content - partly or completely, additionally influenced by transparency etc.

What your human eye classifies as an "image" may in reality be a bunch of hundreds or thousands of small vector primitives, together creating the impression of some chart. Inside the PDF, nothing exists that would make this collection identifiable as one identity.

But there is hope even in those cases - provided you are willing to invest some effort:
PyMuPDF allows you to remove unwanted objects temporarily. You could e.g. remove pesky text in some page region. The take a "photo" (= Pixmap restricted to that region).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image extraction does not handle occlusions (images partially overlapped by text) #4861

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Image extraction does not handle occlusions (images partially overlapped by text) #4861

Uh oh!

dhruvsahgal2003 Jan 12, 2026

Issue: Image extraction does not handle occlusions (images overlapped by text)

Observed behavior

Expected behavior

What I’ve tried

Questions

Replies: 3 comments

Uh oh!

JorjMcKie Jan 12, 2026 Maintainer

Uh oh!

JorjMcKie Jan 12, 2026 Maintainer

Uh oh!

Uh oh!

JorjMcKie Jan 12, 2026 Maintainer

dhruvsahgal2003
Jan 12, 2026

JorjMcKie
Jan 12, 2026
Maintainer

JorjMcKie
Jan 12, 2026
Maintainer

JorjMcKie
Jan 12, 2026
Maintainer