Add Trinity model family (AfmoeForCausalLM) contrib by jimburtoft · Pull Request #55 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-02-27T04:57:29Z

Description

Unified NxDI implementation for the Arcee AI Trinity model family (AfmoeForCausalLM). A single modeling_trinity.py supports all three model sizes -- Nano (~6B), Mini (26B), and Large (250B) -- with config-driven differences only.
Trinity is a Mixture-of-Experts architecture with several unique features: gated attention (sigmoid gate before o_proj), mixed sliding/full attention, QK normalization, conditional RoPE, expert bias in routing, and route_scale baked into weights.

Model Information

Model Name: Trinity (Nano, Mini, Large)
Model Architecture: Mixture-of-Experts decoder-only transformer (AfmoeForCausalLM)
Purpose: Text generation (causal language modeling)

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- Integration test validates model accuracy via logit comparison and top-k token verification
- Test can compile and run the model on Neuron (validated on trn2 and inf2)
README.md with the following sections:
- Usage Example: Code examples for all three model sizes (Nano, Mini, Large)
- Compatibility Matrix: Table showing tested instance types (trn2.3xlarge, trn2.48xlarge, inf2.8xlarge, inf2.xlarge) with SDK 2.27
- Example Checkpoints: Links to arcee-ai/Trinity-Nano-Preview, arcee-ai/Trinity-Mini, arcee-ai/Trinity-Large-Preview
- Testing Instructions: Commands to run the test suite
Source Code (src/)
- modeling_trinity.py (1328 lines) following NxD Inference patterns
- Properly structured in the contrib folder hierarchy

Optional Components

Unit Tests (CPU or Neuron-based)
- Unit test directory included (test/unit/) but no unit tests yet

Folder Structure

/contrib/models/Trinity/
README.md

/src
init.py
modeling_trinity.py
/test
init.py
- /unit
  init.py
- /integration
  init.py
  test_model.py

Testing

How did you test this change?
Each model size was compiled and loaded on the appropriate Neuron instance. Forward passes were run on 3 standardized prompts and top-1 token predictions were verified for coherence. Multi-token generation (5 tokens) was tested via naive autoregressive loop. CPU reference comparison is in process, but all outputs are coherent and grammatically correct.

Test Results:

Model	Instance	TP	Compile	Load	Forward	Status
Nano	trn2.3xlarge	2	5.1 min	2.2 min	0.50s	PASS
Nano	inf2.8xlarge	1	reused	47.7s	0.73s	PASS
Nano	inf2.xlarge	1	--	OOM	--	FAIL (16GB system RAM)
Mini	trn2.3xlarge	4	4.9 min	4.1 min	0.37s	PASS
Large	trn2.48xlarge	64	8.6 min	15.6 min	1.15s	PASS
Large	trn2.48xlarge	32	10.1 min	--	--	FAIL (HBM OOM per NC)

Sample first-token predictions (all models):

"Hello, how are you?" -> I
"Explain quantum computing in simple terms." -> Quantum / What / Answer (varies by size)
"Write a Python function that calculates the Fibonacci sequence." -> The

Compatibility

Tested with:

Neuron SDK Version(s): 2.27 (NxDI 0.7.15063, neuronx-cc 2.22.12471)
Instance Type(s): trn2.3xlarge, trn2.48xlarge, inf2.8xlarge, inf2.xlarge
PyTorch Version: 2.9.0 (torch-neuronx 2.9.0.2.11)
Python Version: 3.12
Transformers Version: 4.56.2
DLAMI: Deep Learning AMI Neuron (Ubuntu 24.04) 20260126
Venv: /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/

Additional Information

Key porting challenges solved:

Gated attention -- Sigmoid gate applied to attention output before o_proj. Solved via inline override of attention forward methods (required for Neuron tracer compatibility).
route_scale -- NxDI MoE v2 does not support route_scale natively. Baked into expert down_proj weights during weight conversion.
Expert bias -- Created custom RouterTopKWithBias subclass since NxDI routing does not support learned bias.
Gate weight padding at high TP -- When num_attention_heads is not divisible by tp_degree (e.g., Large: 48 heads / TP=64), gate weights are padded with interleaved layout matching Q projection.
Trinity-Large requires TP=64 on trn2.48xlarge (all 64 logical NeuronCores). TP=32 causes HBM OOM because sharded weights (~23.5GB) fill the ~24GB HBM per physical NC with no room for scratchpad.

Known limitations:

MoE v2 NKI kernel accumulates in bf16, causing slightly higher numerical divergence vs CPU reference. Top-1 token accuracy is preserved.
NKI flash attention requires padding_side="right" on the tokenizer.
inf2.xlarge cannot run Nano (16GB system RAM insufficient for weight loading).
Related Issues
N/A -- This is a new model contribution.

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines (../contrib/CONTRIBUTING.md)
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Unified NxDI implementation supporting all three Arcee AI Trinity sizes (Nano ~6B, Mini ~26B, Large ~250B) from a single modeling_trinity.py. Validated on SDK 2.27 (NxDI 0.7.15063, neuronx-cc 2.22.12471): - Nano: inf2.8xlarge (TP=1) and trn2.3xlarge (TP=2) - Mini: trn2.3xlarge (TP=4) - Large: trn2.48xlarge (TP=64)

Add layer_to_cache_size_mapping in setup_attr_for_model() to provide per-layer KV cache sizes for mixed attention models. Without this, KVCacheManager sizes all layers to sliding_window, causing a tensor shape mismatch in compute_for_token_gen when seq_len > sliding_window. Update README with validated max sequence lengths: - Nano TP=2: 40960, TP=4: 49152 (trn2.3xlarge) - Mini TP=4: 32768 (trn2.3xlarge) - Large TP=64: 30720 (trn2.48xlarge) All verified with actual token generation at max seq_len.

aarondou approved these changes Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Trinity model family (AfmoeForCausalLM) contrib#55

Add Trinity model family (AfmoeForCausalLM) contrib#55
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/trinity-model

jimburtoft commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jimburtoft commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Test Results:

Sample first-token predictions (all models):

Compatibility

Additional Information

Known limitations:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimburtoft commented Feb 27, 2026 •

edited

Loading