MARS: Multimodal Agentic Room Search 🪐

MARS is an Agentic RAG (Retrieval-Augmented Generation) pipeline designed to bridge the "Semantic Gap" and "Visual Gap" in e-commerce search. Unlike standard vector search engines that rely solely on pixel similarity, MARS employs a fine-tuned Vision-Language Model (VLM) to reason about stylistic coherence and functional constraints in interior design.

The Problem

Standard retrieval methods fail at complex multimodal intent:

Keyword Search (BM25): Finds "Chairs," but returns office chairs for a rustic living room.
Vector Search (CLIP/SigLIP): Finds "Wood texture," but returns a wooden table when the user asked for a chair.
The Result: Users get items that match keywords or colors, but fail the "Vibe Check" or functional logic.

My Solution: The MARS Funnel

MARS utilizes a 3-stage funnel to balance retrieval speed with agentic reasoning:

Stage	Name	Technology	Responsibility	Input -> Output
1	The Filter	BM25 / SQL	Semantic Locking: Guarantees the retrieved item is the correct category (e.g., "Lamp").	50k DB $\to$ 200 Candidates
2	The Vibe Check	SigLIP (so400m)	Visual Retrieval: Finds items that match the color palette and texture of the user's room.	200 $\to$ 20 Candidates
3	The Agent	Fine-Tuned Qwen2-VL	Design Reasoning: Critiques the pair for style (Modern vs Rustic) and function (Indoor vs Outdoor).	20 $\to$ Top 5 Picks

Methodology

1. Synthetic Data Engineering (Teacher-Student)

Since no dataset exists for "Design Reasoning," I engineered a proprietary dataset:

Source: Amazon Berkeley Objects (ABO) + Room Scene Dataset.
Hard Negative Mining: Used SigLIP to find "Hard Negatives" (items that look similar but are wrong), "Hard Positives and random control pairs."
The Teacher: Used Qwen2.5-VL-7B-Instruct to label 1,500 image pairs with a Score (0.0-1.0) and a rationalized critique.
Prompt Engineering: Designed a "Lenient Critic" prompt to distinguish between stylistic contrast (good) and functional clashes (bad).

2. Supervised Fine-Tuning (SFT)

Student Model: Qwen2-VL-2B-Instruct.
Technique: QLoRA (4-bit quantization + LoRA adapters).
Training: Optimized on a single P100 using bitsandbytes and peft.
Objective: Minimized CEloss on generating the JSON structure { "score": X, "rationale": "..." }.

3. Inference Optimization

Hybrid Retrieval: Implemented a weighted RRF (Reciprocal Rank Fusion) of Text + Visual scores.
Batched Inference: Optimized the Agent to process candidates in batches of 4, reducing re-ranking latency from 3 minutes to ~25 seconds.

Demo Results

Query: "I need a chair for this Yard."

Approach	Result	Why it failed/succeeded
Classical (BM25)	Returned an Office Chair.	Matched text "Chair", ignored environment.
Vector (SigLIP)	Returned a Wooden Dining Chair.	Matched "Wood" texture of the fence, ignored function.
MARS Agent	Returned an Outdoor Lounge Set.	Reasoned that "Indoor wood furniture degrades outside."

Tech Stack

Core: PyTorch, Transformers, Pandas.
Models: Qwen2-VL (2B), Qwen2.5-VL (7B), SigLIP-so400m-patch14-384.
Retrieval: RankBM25, FAISS (concept), Torch Tensor Operations.
Training: LoRA, QLoRA, TRL, Accelerate.

Usage & Reproducing results

This project is built as a modular pipeline on Kaggle. You can reproduce the results by running the notebooks in the following order:

Phase 1: Data Engineering

Step 1: Style Search - Data Prep - SigLIP pair generation
- Action: Cleaning the Amazon Berkeley Objects dataset and using SigLIP (so400m) to mine "Hard Positives" and "Hard Negatives."
- Output: Style Search Siglip generated pairs
Step 2: Style Search - Data Prep - Qwen2.5VL7B Teacher
- Action: Using the 7B model as a "Teacher" to label the pairs with scores (0.0-1.0) and natural language rationales.
- Output: Style Search Qwen2.5VL generated pairs

Phase 2: Model Training

Step 3: Style Search - SFT - Qwen2VL2B
- Action: Fine-tuning the lightweight Qwen2-VL-2B model using QLoRA on the synthetic dataset to create the "Agentic Critic."
- Hardware: Trained on a single P100 GPU (approx. 2 hrs).

Phase 3: The Full Pipeline

Step 4 (Demo): Style Search - Hybrid IR Engine
- Action: The End-to-End MARS Pipeline.
- Workflow: User Query $\to$ BM25 Filter $\to$ SigLIP Vector Search $\to$ Fine-Tuned Agent Critique.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MARS: Multimodal Agentic Room Search 🪐

The Problem

My Solution: The MARS Funnel

Methodology

1. Synthetic Data Engineering (Teacher-Student)

2. Supervised Fine-Tuning (SFT)

3. Inference Optimization

Demo Results

Tech Stack

Usage & Reproducing results

Phase 1: Data Engineering

Phase 2: Model Training

Phase 3: The Full Pipeline

About

Uh oh!

Loki-Silvres/MARS

Folders and files

Latest commit

History

Repository files navigation

MARS: Multimodal Agentic Room Search 🪐

The Problem

My Solution: The MARS Funnel

Methodology

1. Synthetic Data Engineering (Teacher-Student)

2. Supervised Fine-Tuning (SFT)

3. Inference Optimization

Demo Results

Tech Stack

Usage & Reproducing results

Phase 1: Data Engineering

Phase 2: Model Training

Phase 3: The Full Pipeline

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks