Skip to content

[Improvement Plan] FrameFinderLE - Reduce VLM Sensitivity and Improve Keyframe Quality #1

@ThuyHaLE

Description

@ThuyHaLE

[Motivation] Upgrade FrameFinderLE to Improve Quality and Robustness

🚧 Limitations in v.1.0.1

  1. Over-reliance on a Single Visual Language Model (VLM)

    • The system currently uses the LLava model to generate captions from keyframes.
    • No comparison is made with other VLMs, making the system sensitive to the specific characteristics of the chosen model.
    • Retrieval results may be biased or inaccurate if captions do not align well with the database concepts.
  2. Inconsistent Keyframe Quality

    • Keyframes are extracted using the Pysence detect method, which may produce excessive or redundant frames (noise).
    • Visual artifacts such as scene transition logos in news videos are often mistakenly included as representative frames.
  3. Limited Vietnamese Language Support

    • Vietnamese queries are not fully supported and may result in poor matching or misinterpretations due to the lack of a dedicated language processing pipeline.

🔥 Motivation for v.1.1.0

  1. Reduce Dependency on VLM-Based Captions

    • Integrate multimodal captioning using:
      • Captions from VLM
      • Speech transcripts
      • OCR-extracted text
    • Leverage the richer and more diverse annotations from the new dataset (publicly available on June 1, 2025 from The News Event Retrieval and Explanation Grand Challenge at ACM Multimedia 2025) to enhance the system’s semantic understanding.
  2. Improve Keyframe Extraction Quality

    • Instead of detecting keyframes directly, videos will be segmented into mini-clips representing finer-grained events.
    • Only representative frames from meaningful clips will be selected.
    • This helps eliminate redundancy and reduces irrelevant content like transition logos.
  3. Lay the Foundation for Reasoning (Optional/Future)

    • Open the possibility to support reasoning-based explanation in the Automatic Mode (e.g., for competitions or future extensions).
    • Combine multimodal inputs (caption, transcript, OCR) with LLMs for generating explanations—without compromising the core design as a retrieval-first system.

📌 Planned Enhancements

  • Create a new branch enhancement/multimodal-captioning
  • Design a new multimodal captioning pipeline (VLM + transcript + OCR)
  • Build a video segmentation module for mini-clip generation
  • Update keyframe extraction logic
  • Re-benchmark retrieval quality and system accuracy
  • Add optional Vietnamese query support
  • Update README.md and CHANGELOG.md after finalizing

This issue will remain open for iterative updates and documentation of the enhancement progress. Once completed, the feature will be merged and documented officially in the repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions