Skip to content

Add Trinity model family (AfmoeForCausalLM) contrib#55

Open
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/trinity-model
Open

Add Trinity model family (AfmoeForCausalLM) contrib#55
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/trinity-model

Conversation

@jimburtoft
Copy link

@jimburtoft jimburtoft commented Feb 27, 2026

Description

Unified NxDI implementation for the Arcee AI Trinity model family (AfmoeForCausalLM). A single modeling_trinity.py supports all three model sizes -- Nano (~6B), Mini (26B), and Large (250B) -- with config-driven differences only.
Trinity is a Mixture-of-Experts architecture with several unique features: gated attention (sigmoid gate before o_proj), mixed sliding/full attention, QK normalization, conditional RoPE, expert bias in routing, and route_scale baked into weights.

Model Information

Model Name: Trinity (Nano, Mini, Large)
Model Architecture: Mixture-of-Experts decoder-only transformer (AfmoeForCausalLM)
Purpose: Text generation (causal language modeling)

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)
    • Integration test validates model accuracy via logit comparison and top-k token verification
    • Test can compile and run the model on Neuron (validated on trn2 and inf2)
  • README.md with the following sections:
    • Usage Example: Code examples for all three model sizes (Nano, Mini, Large)
    • Compatibility Matrix: Table showing tested instance types (trn2.3xlarge, trn2.48xlarge, inf2.8xlarge, inf2.xlarge) with SDK 2.27
    • Example Checkpoints: Links to arcee-ai/Trinity-Nano-Preview, arcee-ai/Trinity-Mini, arcee-ai/Trinity-Large-Preview
    • Testing Instructions: Commands to run the test suite
  • Source Code (src/)
    • modeling_trinity.py (1328 lines) following NxD Inference patterns
    • Properly structured in the contrib folder hierarchy

Optional Components

  • Unit Tests (CPU or Neuron-based)
    • Unit test directory included (test/unit/) but no unit tests yet

Folder Structure

/contrib/models/Trinity/
README.md

  • /src
    init.py
    modeling_trinity.py
  • /test
    init.py
    • /unit
      init.py
    • /integration
      init.py
      test_model.py

Testing

How did you test this change?
Each model size was compiled and loaded on the appropriate Neuron instance. Forward passes were run on 3 standardized prompts and top-1 token predictions were verified for coherence. Multi-token generation (5 tokens) was tested via naive autoregressive loop. CPU reference comparison is in process, but all outputs are coherent and grammatically correct.

Test Results:

Model Instance TP Compile Load Forward Status
Nano trn2.3xlarge 2 5.1 min 2.2 min 0.50s PASS
Nano inf2.8xlarge 1 reused 47.7s 0.73s PASS
Nano inf2.xlarge 1 -- OOM -- FAIL (16GB system RAM)
Mini trn2.3xlarge 4 4.9 min 4.1 min 0.37s PASS
Large trn2.48xlarge 64 8.6 min 15.6 min 1.15s PASS
Large trn2.48xlarge 32 10.1 min -- -- FAIL (HBM OOM per NC)

Sample first-token predictions (all models):

  • "Hello, how are you?" -> I
  • "Explain quantum computing in simple terms." -> Quantum / What / Answer (varies by size)
  • "Write a Python function that calculates the Fibonacci sequence." -> The

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.27 (NxDI 0.7.15063, neuronx-cc 2.22.12471)
  • Instance Type(s): trn2.3xlarge, trn2.48xlarge, inf2.8xlarge, inf2.xlarge
  • PyTorch Version: 2.9.0 (torch-neuronx 2.9.0.2.11)
  • Python Version: 3.12
  • Transformers Version: 4.56.2
  • DLAMI: Deep Learning AMI Neuron (Ubuntu 24.04) 20260126
  • Venv: /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/

Additional Information

Key porting challenges solved:

  1. Gated attention -- Sigmoid gate applied to attention output before o_proj. Solved via inline override of attention forward methods (required for Neuron tracer compatibility).
  2. route_scale -- NxDI MoE v2 does not support route_scale natively. Baked into expert down_proj weights during weight conversion.
  3. Expert bias -- Created custom RouterTopKWithBias subclass since NxDI routing does not support learned bias.
  4. Gate weight padding at high TP -- When num_attention_heads is not divisible by tp_degree (e.g., Large: 48 heads / TP=64), gate weights are padded with interleaved layout matching Q projection.
  5. Trinity-Large requires TP=64 on trn2.48xlarge (all 64 logical NeuronCores). TP=32 causes HBM OOM because sharded weights (~23.5GB) fill the ~24GB HBM per physical NC with no room for scratchpad.

Known limitations:

  • MoE v2 NKI kernel accumulates in bf16, causing slightly higher numerical divergence vs CPU reference. Top-1 token accuracy is preserved.
  • NKI flash attention requires padding_side="right" on the tokenizer.
  • inf2.xlarge cannot run Nano (16GB system RAM insufficient for weight loading).
    Related Issues
    N/A -- This is a new model contribution.

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines (../contrib/CONTRIBUTING.md)
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Unified NxDI implementation supporting all three Arcee AI Trinity sizes
(Nano ~6B, Mini ~26B, Large ~250B) from a single modeling_trinity.py.

Validated on SDK 2.27 (NxDI 0.7.15063, neuronx-cc 2.22.12471):
- Nano: inf2.8xlarge (TP=1) and trn2.3xlarge (TP=2)
- Mini: trn2.3xlarge (TP=4)
- Large: trn2.48xlarge (TP=64)
Add layer_to_cache_size_mapping in setup_attr_for_model() to provide
per-layer KV cache sizes for mixed attention models. Without this,
KVCacheManager sizes all layers to sliding_window, causing a tensor
shape mismatch in compute_for_token_gen when seq_len > sliding_window.

Update README with validated max sequence lengths:
- Nano TP=2: 40960, TP=4: 49152 (trn2.3xlarge)
- Mini TP=4: 32768 (trn2.3xlarge)
- Large TP=64: 30720 (trn2.48xlarge)

All verified with actual token generation at max seq_len.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants