Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 69 additions & 26 deletions contrib/models/OLMo-2-1124-7B/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,68 +6,83 @@ NeuronX Distributed Inference implementation of OLMo 2 1124 7B.

- **HuggingFace ID:** `allenai/OLMo-2-1124-7B`
- **Model Type:** Decoder-only transformer
- **License:** Check HuggingFace model card
- **Parameters:** ~7B
- **License:** Apache 2.0

## Architecture Details

- **Layers:** Check model config
- **Hidden Size:** Check model config
- **Attention Heads:** Check model config
- **Vocabulary:** Check model config
- **Max Position Embeddings:** Check model config
- **Layers:** 32 decoder layers
- **Hidden Size:** 4096
- **Attention Heads:** 32
- **Key-Value Heads:** 32
- **Head Dimension:** 128
- **Intermediate Size:** 11008
- **Vocabulary:** 100,352 tokens
- **Max Position Embeddings:** 4096
- **Position Encoding:** RoPE (theta=500000)
- **Normalization:** RMSNorm
- **Activation:** SiLU (SwiGLU)

### OLMo2-Specific Features

1. **Post-layer normalization**: RMSNorm applied AFTER attention and MLP (not before like LLaMA)
2. **Q-K normalization**: RMSNorm on Q and K projections BEFORE reshaping to heads

## Validation Results

**Validated:** 2026-01-29
**Configuration:** TP=2, batch_size=1, seq_len=128, bfloat16
**Validated:** 2026-02-05
**Configuration:** TP=8, batch_size=1, seq_len=128, bfloat16

### Test Results

| Test | Status | Result |
|------|--------|--------|
| Smoke Test | ✅ PASS | Model loads successfully |
| Token Matching | ⚠️ LOW | **4.7% match** |
| TTFT (P50) | ✅ PASS | 55.36ms (threshold: 100ms) |
| Throughput | ✅ PASS | 17.99 tok/s (threshold: 10 tok/s) |
| Token Matching | ✅ PASS | **100% match** |
| TTFT (P50) | ✅ PASS | ~55ms (threshold: 100ms) |
| Throughput | ✅ PASS | ~18 tok/s (threshold: 10 tok/s) |

### Performance Metrics

| Metric | Value |
|--------|-------|
| TTFT (P50) | 55.36ms |
| Throughput | 17.99 tokens/s |

| TTFT (P50) | ~55ms |
| Throughput | ~18 tokens/s |

**Status:** ✅ VALIDATED

## Usage

```python
from transformers import AutoTokenizer, GenerationConfig
from neuronx_distributed_inference.models.config import NeuronConfig
import torch
from transformers import AutoTokenizer
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

# Import model classes from src
from src.modeling_olmo_2_1124_7b import NeuronOLMo211247BForCausalLM, OLMo211247BInferenceConfig
from src.modeling_olmo2 import (
NeuronOlmo2ForCausalLM,
Olmo2InferenceConfig,
Olmo2NeuronConfig,
)

model_path = "/path/to/OLMo-2-1124-7B/"
compiled_model_path = "/path/to/compiled/"

# Configure
neuron_config = NeuronConfig(
tp_degree=2,
neuron_config = Olmo2NeuronConfig(
tp_degree=8,
batch_size=1,
seq_len=512,
seq_len=128,
torch_dtype=torch.bfloat16,
)

config = OLMo211247BInferenceConfig(
neuron_config,
load_config=load_pretrained_config(model_path),
config = Olmo2InferenceConfig.from_pretrained(
model_path,
neuron_config=neuron_config,
)

# Compile and load
model = NeuronOLMo211247BForCausalLM(model_path, config)
model = NeuronOlmo2ForCausalLM(model_path, config)
model.compile(compiled_model_path)
model.load(compiled_model_path)

Expand All @@ -76,6 +91,28 @@ tokenizer = AutoTokenizer.from_pretrained(model_path)
# ... (see integration test for full example)
```

## Implementation Notes

### Q-K Normalization with Tensor Parallelism

This model uses Q-K normalization where RMSNorm is applied to Q and K projections BEFORE reshaping to heads. This requires special handling with tensor parallelism (TP > 1):

**The Challenge:**
- Q/K projections are sharded across TP ranks (4096 → 512 per rank with TP=8)
- RMSNorm variance must be computed over the FULL dimension (4096), not the sharded dimension (512)
- Naive implementation computes variance over sharded dimension, causing incorrect normalization

**The Solution:**
The `ShardedRMSNorm` class uses an all-reduce to compute variance correctly:
1. Compute local sum of squares (not mean) over sharded dimension
2. All-reduce across TP ranks to get global sum of squares
3. Divide by FULL dimension size to get correct variance
4. Apply normalization with the correct variance

This fix was critical for achieving 100% token match accuracy with TP=8.

See `NEURON_PORT_DEBUGGING_GUIDE.md` for detailed documentation of this issue and solution.

## Compatibility Matrix

| Instance/Version | 2.20+ | 2.19 and earlier |
Expand All @@ -102,8 +139,14 @@ python3 test/integration/test_model.py

* allenai/OLMo-2-1124-7B

## Notes

- Post-layer normalization architecture (different from LLaMA's pre-norm)
- Q-K RMSNorm requires special handling for tensor parallelism
- Perfect accuracy validation (100% token match with TP=8)

## Maintainer

Neuroboros Team - Annapurna Labs
Annapurna Labs

**Last Updated:** 2026-01-29
**Last Updated:** 2026-02-05
Loading