Skip to content

Image extraction does not handle occlusions (images partially overlapped by text) #4860

@dhruvsahgal2003

Description

@dhruvsahgal2003

Issue: Image extraction does not handle occlusions (images overlapped by text)

I’m working with textbook-style PDFs (e.g. educational/science books) where diagrams and figures are often partially overlapped by text, labels, or callouts. While PyMuPDF is excellent at extracting embedded images and layout information, I’m running into consistent issues when images are occluded by text or drawn as part of the page content.

Observed behavior

In these cases:

  • page.get_images(full=True) does not return the diagram as a clean image
  • page.get_text("dict") correctly reports text blocks overlapping the diagram
  • The diagram itself is either:
    • split into multiple fragments, or
    • not extractable as a single image asset, or
    • only recoverable via full-page rendering

This makes it difficult to reliably extract diagrams as standalone images when text overlaps them.


Expected behavior

Ideally, there would be a way to:

  • Extract visual image regions even when they are partially covered by text
  • Identify a figure-level bounding box that includes occluded content
  • Or have clearer guidance on whether this is intentionally unsupported due to PDF format limitations

I understand that PDFs are geometry-based and may not encode semantic “figure” concepts, but from a user perspective it’s unclear whether this limitation is fundamental or if there are recommended PyMuPDF approaches to mitigate it.


What I’ve tried

  • page.get_images(full=True)
  • page.get_image_bbox(xref)
  • page.get_text("dict") with manual bounding-box heuristics
  • Rendering full pages as a fallback (page.get_pixmap)

Rendering works, but it loses the ability to extract individual diagrams as separate assets.


Questions

Is handling image occlusion by text:

  1. A known limitation of PyMuPDF and/or the PDF format?
  2. Something that newer versions or APIs aim to improve?
  3. Out of scope by design (requiring vision-based post-processing)?

Any clarification on the intended behavior or best practices would be really helpful.

Thanks, and I appreciate the work that’s gone into PyMuPDF.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions