Skip to content

MARS is an Agentic RAG (Retrieval-Augmented Generation) pipeline designed to bridge the "Semantic Gap" and "Visual Gap" in e-commerce search.

Notifications You must be signed in to change notification settings

Loki-Silvres/MARS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

MARS: Multimodal Agentic Room Search 🪐

MARS Funnel Visualization

MARS is an Agentic RAG (Retrieval-Augmented Generation) pipeline designed to bridge the "Semantic Gap" and "Visual Gap" in e-commerce search. Unlike standard vector search engines that rely solely on pixel similarity, MARS employs a fine-tuned Vision-Language Model (VLM) to reason about stylistic coherence and functional constraints in interior design.

The Problem

Standard retrieval methods fail at complex multimodal intent:

  1. Keyword Search (BM25): Finds "Chairs," but returns office chairs for a rustic living room.
  2. Vector Search (CLIP/SigLIP): Finds "Wood texture," but returns a wooden table when the user asked for a chair.
  3. The Result: Users get items that match keywords or colors, but fail the "Vibe Check" or functional logic.

My Solution: The MARS Funnel

MARS utilizes a 3-stage funnel to balance retrieval speed with agentic reasoning:

Stage Name Technology Responsibility Input -> Output
1 The Filter BM25 / SQL Semantic Locking: Guarantees the retrieved item is the correct category (e.g., "Lamp"). 50k DB $\to$ 200 Candidates
2 The Vibe Check SigLIP (so400m) Visual Retrieval: Finds items that match the color palette and texture of the user's room. 200 $\to$ 20 Candidates
3 The Agent Fine-Tuned Qwen2-VL Design Reasoning: Critiques the pair for style (Modern vs Rustic) and function (Indoor vs Outdoor). 20 $\to$ Top 5 Picks

Methodology

1. Synthetic Data Engineering (Teacher-Student)

Since no dataset exists for "Design Reasoning," I engineered a proprietary dataset:

  • Source: Amazon Berkeley Objects (ABO) + Room Scene Dataset.
  • Hard Negative Mining: Used SigLIP to find "Hard Negatives" (items that look similar but are wrong), "Hard Positives and random control pairs."
  • The Teacher: Used Qwen2.5-VL-7B-Instruct to label 1,500 image pairs with a Score (0.0-1.0) and a rationalized critique.
  • Prompt Engineering: Designed a "Lenient Critic" prompt to distinguish between stylistic contrast (good) and functional clashes (bad).

2. Supervised Fine-Tuning (SFT)

  • Student Model: Qwen2-VL-2B-Instruct.
  • Technique: QLoRA (4-bit quantization + LoRA adapters).
  • Training: Optimized on a single P100 using bitsandbytes and peft.
  • Objective: Minimized CEloss on generating the JSON structure { "score": X, "rationale": "..." }.

3. Inference Optimization

  • Hybrid Retrieval: Implemented a weighted RRF (Reciprocal Rank Fusion) of Text + Visual scores.
  • Batched Inference: Optimized the Agent to process candidates in batches of 4, reducing re-ranking latency from 3 minutes to ~25 seconds.

Demo Results

Query: "I need a chair for this Yard."

Approach Result Why it failed/succeeded
Classical (BM25) Returned an Office Chair. Matched text "Chair", ignored environment.
Vector (SigLIP) Returned a Wooden Dining Chair. Matched "Wood" texture of the fence, ignored function.
MARS Agent Returned an Outdoor Lounge Set. Reasoned that "Indoor wood furniture degrades outside."

Tech Stack

  • Core: PyTorch, Transformers, Pandas.
  • Models: Qwen2-VL (2B), Qwen2.5-VL (7B), SigLIP-so400m-patch14-384.
  • Retrieval: RankBM25, FAISS (concept), Torch Tensor Operations.
  • Training: LoRA, QLoRA, TRL, Accelerate.

Usage & Reproducing results

This project is built as a modular pipeline on Kaggle. You can reproduce the results by running the notebooks in the following order:

Phase 1: Data Engineering

Phase 2: Model Training

  • Step 3: Style Search - SFT - Qwen2VL2B
    • Action: Fine-tuning the lightweight Qwen2-VL-2B model using QLoRA on the synthetic dataset to create the "Agentic Critic."
    • Hardware: Trained on a single P100 GPU (approx. 2 hrs).

Phase 3: The Full Pipeline

  • Step 4 (Demo): Style Search - Hybrid IR Engine
    • Action: The End-to-End MARS Pipeline.
    • Workflow: User Query $\to$ BM25 Filter $\to$ SigLIP Vector Search $\to$ Fine-Tuned Agent Critique.

About

MARS is an Agentic RAG (Retrieval-Augmented Generation) pipeline designed to bridge the "Semantic Gap" and "Visual Gap" in e-commerce search.

Topics

Resources

Stars

Watchers

Forks