From 41a2fe799d745e2041cb54113ca4a0d8ea6c820d Mon Sep 17 00:00:00 2001
From: Deeptanshu Singh <sdeeptan@amazon.com>
Date: Thu, 19 Feb 2026 14:42:48 -0500
Subject: [PATCH 1/2] Update README with token match rate on text backbone

---
 contrib/models/Qwen2.5-VL-32B-Instruct/README.md | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/contrib/models/Qwen2.5-VL-32B-Instruct/README.md b/contrib/models/Qwen2.5-VL-32B-Instruct/README.md
index dd0c010..856eae2 100644
--- a/contrib/models/Qwen2.5-VL-32B-Instruct/README.md
+++ b/contrib/models/Qwen2.5-VL-32B-Instruct/README.md
@@ -12,7 +12,9 @@ NeuronX Distributed Inference implementation of Qwen2.5 VL 32B Instruct.
 
 ## Architecture Details
 
-- **Layers:** Check model config
+- **Type:** Multimodal (vision-language) model — text backbone validated only
+- **Text Backbone:** Decoder-only transformer (Qwen2-based)
+- **Layers:** 64
 - **Hidden Size:** Check model config
 - **Attention Heads:** Check model config
 - **Vocabulary:** Check model config
@@ -28,7 +30,7 @@ NeuronX Distributed Inference implementation of Qwen2.5 VL 32B Instruct.
 | Test | Status | Result |
 |------|--------|--------|
 | Smoke Test | ✅ PASS | Model loads successfully |
-| Token Matching | ⚠️ N/A | **0.0% match** |
+| Token Matching | ✅ PASS | **100% match** (text backbone) |
 | TTFT (P50) | ✅ PASS | 7.98ms (threshold: 100ms) |
 | Throughput | ✅ PASS | 120.65 tok/s (threshold: 10 tok/s) |
 
@@ -39,9 +41,14 @@ NeuronX Distributed Inference implementation of Qwen2.5 VL 32B Instruct.
 | TTFT (P50) | 7.98ms |
 | Throughput | 120.65 tokens/s |
 
-
 **Status:** ✅ VALIDATED
 
+### Multimodal Validation Notes
+
+Qwen2.5-VL is a vision-language model. The NeuronX port validates the text backbone only. `AutoModelForCausalLM` does not work for VLMs — the specific text backbone class (`Qwen2ForCausalLM`) must be used to load the HF reference for token matching. With the correct text backbone extraction, the model achieves 100% token match.
+
+**Important:** Ensure the compiled model uses the full 64 layers. Test builds with reduced layer counts (e.g., 4 layers) will produce poor accuracy. Always verify `num_hidden_layers` in the compiled `config.json` before validation.
+
 ## Usage
 
 ```python

From 181bc7c74fbe8e9645665ab7cd051225f492d8bc Mon Sep 17 00:00:00 2001
From: Deeptanshu Singh <sdeeptan@amazon.com>
Date: Thu, 26 Feb 2026 13:47:37 -0500
Subject: [PATCH 2/2] Removing internal names

---
 contrib/models/Qwen2.5-VL-32B-Instruct/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/models/Qwen2.5-VL-32B-Instruct/README.md b/contrib/models/Qwen2.5-VL-32B-Instruct/README.md
index 856eae2..a75ef3c 100644
--- a/contrib/models/Qwen2.5-VL-32B-Instruct/README.md
+++ b/contrib/models/Qwen2.5-VL-32B-Instruct/README.md
@@ -113,6 +113,6 @@ python3 test/integration/test_model.py
 
 ## Maintainer
 
-Neuroboros Team - Annapurna Labs
+Annapurna Labs
 
 **Last Updated:** 2026-01-29