MARS is an Agentic RAG (Retrieval-Augmented Generation) pipeline designed to bridge the "Semantic Gap" and "Visual Gap" in e-commerce search. Unlike standard vector search engines that rely solely on pixel similarity, MARS employs a fine-tuned Vision-Language Model (VLM) to reason about stylistic coherence and functional constraints in interior design.
Standard retrieval methods fail at complex multimodal intent:
- Keyword Search (BM25): Finds "Chairs," but returns office chairs for a rustic living room.
- Vector Search (CLIP/SigLIP): Finds "Wood texture," but returns a wooden table when the user asked for a chair.
- The Result: Users get items that match keywords or colors, but fail the "Vibe Check" or functional logic.
MARS utilizes a 3-stage funnel to balance retrieval speed with agentic reasoning:
| Stage | Name | Technology | Responsibility | Input -> Output |
|---|---|---|---|---|
| 1 | The Filter | BM25 / SQL | Semantic Locking: Guarantees the retrieved item is the correct category (e.g., "Lamp"). | 50k DB |
| 2 | The Vibe Check | SigLIP (so400m) | Visual Retrieval: Finds items that match the color palette and texture of the user's room. | 200 |
| 3 | The Agent | Fine-Tuned Qwen2-VL | Design Reasoning: Critiques the pair for style (Modern vs Rustic) and function (Indoor vs Outdoor). | 20 |
Since no dataset exists for "Design Reasoning," I engineered a proprietary dataset:
- Source: Amazon Berkeley Objects (ABO) + Room Scene Dataset.
- Hard Negative Mining: Used SigLIP to find "Hard Negatives" (items that look similar but are wrong), "Hard Positives and random control pairs."
- The Teacher: Used Qwen2.5-VL-7B-Instruct to label 1,500 image pairs with a Score (0.0-1.0) and a rationalized critique.
- Prompt Engineering: Designed a "Lenient Critic" prompt to distinguish between stylistic contrast (good) and functional clashes (bad).
- Student Model: Qwen2-VL-2B-Instruct.
- Technique: QLoRA (4-bit quantization + LoRA adapters).
- Training: Optimized on a single P100 using
bitsandbytesandpeft. - Objective: Minimized CEloss on generating the JSON structure
{ "score": X, "rationale": "..." }.
- Hybrid Retrieval: Implemented a weighted RRF (Reciprocal Rank Fusion) of Text + Visual scores.
- Batched Inference: Optimized the Agent to process candidates in batches of 4, reducing re-ranking latency from 3 minutes to ~25 seconds.
Query: "I need a chair for this Yard."
| Approach | Result | Why it failed/succeeded |
|---|---|---|
| Classical (BM25) | Returned an Office Chair. | Matched text "Chair", ignored environment. |
| Vector (SigLIP) | Returned a Wooden Dining Chair. | Matched "Wood" texture of the fence, ignored function. |
| MARS Agent | Returned an Outdoor Lounge Set. | Reasoned that "Indoor wood furniture degrades outside." |
- Core: PyTorch, Transformers, Pandas.
- Models: Qwen2-VL (2B), Qwen2.5-VL (7B), SigLIP-so400m-patch14-384.
- Retrieval: RankBM25, FAISS (concept), Torch Tensor Operations.
- Training: LoRA, QLoRA, TRL, Accelerate.
This project is built as a modular pipeline on Kaggle. You can reproduce the results by running the notebooks in the following order:
-
Step 1: Style Search - Data Prep - SigLIP pair generation
- Action: Cleaning the Amazon Berkeley Objects dataset and using SigLIP (so400m) to mine "Hard Positives" and "Hard Negatives."
- Output: Style Search Siglip generated pairs
-
Step 2: Style Search - Data Prep - Qwen2.5VL7B Teacher
- Action: Using the 7B model as a "Teacher" to label the pairs with scores (0.0-1.0) and natural language rationales.
- Output: Style Search Qwen2.5VL generated pairs
- Step 3: Style Search - SFT - Qwen2VL2B
- Action: Fine-tuning the lightweight Qwen2-VL-2B model using QLoRA on the synthetic dataset to create the "Agentic Critic."
- Hardware: Trained on a single P100 GPU (approx. 2 hrs).
-
Step 4 (Demo): Style Search - Hybrid IR Engine
- Action: The End-to-End MARS Pipeline.
-
Workflow: User Query
$\to$ BM25 Filter$\to$ SigLIP Vector Search$\to$ Fine-Tuned Agent Critique.