Add Trinity model family (AfmoeForCausalLM) contrib#55
Open
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
Open
Add Trinity model family (AfmoeForCausalLM) contrib#55jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
Conversation
Unified NxDI implementation supporting all three Arcee AI Trinity sizes (Nano ~6B, Mini ~26B, Large ~250B) from a single modeling_trinity.py. Validated on SDK 2.27 (NxDI 0.7.15063, neuronx-cc 2.22.12471): - Nano: inf2.8xlarge (TP=1) and trn2.3xlarge (TP=2) - Mini: trn2.3xlarge (TP=4) - Large: trn2.48xlarge (TP=64)
aarondou
approved these changes
Feb 27, 2026
Add layer_to_cache_size_mapping in setup_attr_for_model() to provide per-layer KV cache sizes for mixed attention models. Without this, KVCacheManager sizes all layers to sliding_window, causing a tensor shape mismatch in compute_for_token_gen when seq_len > sliding_window. Update README with validated max sequence lengths: - Nano TP=2: 40960, TP=4: 49152 (trn2.3xlarge) - Mini TP=4: 32768 (trn2.3xlarge) - Large TP=64: 30720 (trn2.48xlarge) All verified with actual token generation at max seq_len.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Unified NxDI implementation for the Arcee AI Trinity model family (AfmoeForCausalLM). A single modeling_trinity.py supports all three model sizes -- Nano (~6B), Mini (26B), and Large (250B) -- with config-driven differences only.
Trinity is a Mixture-of-Experts architecture with several unique features: gated attention (sigmoid gate before o_proj), mixed sliding/full attention, QK normalization, conditional RoPE, expert bias in routing, and route_scale baked into weights.
Model Information
Model Name: Trinity (Nano, Mini, Large)
Model Architecture: Mixture-of-Experts decoder-only transformer (AfmoeForCausalLM)
Purpose: Text generation (causal language modeling)
Checklist
Required Components
Optional Components
Folder Structure
/contrib/models/Trinity/
README.md
init.py
modeling_trinity.py
init.py
init.py
init.py
test_model.py
Testing
How did you test this change?
Each model size was compiled and loaded on the appropriate Neuron instance. Forward passes were run on 3 standardized prompts and top-1 token predictions were verified for coherence. Multi-token generation (5 tokens) was tested via naive autoregressive loop. CPU reference comparison is in process, but all outputs are coherent and grammatically correct.
Test Results:
Sample first-token predictions (all models):
Compatibility
Tested with:
Additional Information
Key porting challenges solved:
Known limitations:
Related Issues
N/A -- This is a new model contribution.
By submitting this PR, I confirm that: