[Improvement Plan] FrameFinderLE - Reduce VLM Sensitivity and Improve Keyframe Quality

## [Motivation] Upgrade FrameFinderLE to Improve Quality and Robustness

### 🚧 Limitations in v.1.0.1

1. **Over-reliance on a Single Visual Language Model (VLM)**
   - The system currently uses the LLava model to generate captions from keyframes.
   - No comparison is made with other VLMs, making the system sensitive to the specific characteristics of the chosen model.
   - Retrieval results may be biased or inaccurate if captions do not align well with the database concepts.

2. **Inconsistent Keyframe Quality**
   - Keyframes are extracted using the `Pysence detect` method, which may produce excessive or redundant frames (noise).
   - Visual artifacts such as scene transition logos in news videos are often mistakenly included as representative frames.

3. **Limited Vietnamese Language Support**
   - Vietnamese queries are not fully supported and may result in poor matching or misinterpretations due to the lack of a dedicated language processing pipeline.

---

### 🔥 Motivation for v.1.1.0

1. **Reduce Dependency on VLM-Based Captions**
   - Integrate multimodal captioning using:
     - Captions from VLM
     - Speech transcripts
     - OCR-extracted text
   - Leverage the richer and more diverse annotations from the new dataset (publicly available on June 1, 2025 from The News Event Retrieval and Explanation Grand Challenge at ACM Multimedia 2025) to enhance the system’s semantic understanding.

2. **Improve Keyframe Extraction Quality**
   - Instead of detecting keyframes directly, videos will be segmented into **mini-clips** representing finer-grained events.
   - Only representative frames from meaningful clips will be selected.
   - This helps eliminate redundancy and reduces irrelevant content like transition logos.

3. **Lay the Foundation for Reasoning (Optional/Future)**
   - Open the possibility to support **reasoning-based explanation** in the Automatic Mode (e.g., for competitions or future extensions).
   - Combine multimodal inputs (caption, transcript, OCR) with LLMs for generating explanations—without compromising the core design as a retrieval-first system.

---

### 📌 Planned Enhancements

- [ ] Create a new branch `enhancement/multimodal-captioning`
- [ ] Design a new multimodal captioning pipeline (VLM + transcript + OCR)
- [ ] Build a video segmentation module for mini-clip generation
- [ ] Update keyframe extraction logic
- [ ] Re-benchmark retrieval quality and system accuracy
- [ ] Add optional Vietnamese query support
- [ ] Update `README.md` and `CHANGELOG.md` after finalizing

---

> This issue will remain open for iterative updates and documentation of the enhancement progress. Once completed, the feature will be merged and documented officially in the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement Plan] FrameFinderLE - Reduce VLM Sensitivity and Improve Keyframe Quality #1

[Motivation] Upgrade FrameFinderLE to Improve Quality and Robustness

🚧 Limitations in v.1.0.1

🔥 Motivation for v.1.1.0

📌 Planned Enhancements

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Improvement Plan] FrameFinderLE - Reduce VLM Sensitivity and Improve Keyframe Quality #1

Description

[Motivation] Upgrade FrameFinderLE to Improve Quality and Robustness

🚧 Limitations in v.1.0.1

🔥 Motivation for v.1.1.0

📌 Planned Enhancements

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions