Ultra-low memory footprint, high inference speed, and mathematical precision.
Trion Core is a next-generation Large Language Model (LLM) engine based on the BitNet b1.58 architecture. Unlike standard models, it stores weights in 1.58-bit {-1, 0, 1} precision instead of 16-bit FP16.
This revolutionary approach enables:
- 70% Reduction in VRAM/Memory usage.
- Matrix Multiplications (MatMul) are simplified into Additions.
- Significantly lower training time and energy consumption.
Trion Core utilizes Absmean Quantization to compress weights into ternary values.
For a weight matrix
The resulting
Activations
Heavy matrix multiplications are replaced by Sparse Additions, dramatically boosting performance on consumer hardware like the GTX 1050.
- v2.0: Trainable ternary weights (STE)
- Activation quantization
- KV-cache optimized inference
- Larger-scale dataset experiments
System data flow visualized (GitHub Mermaid integration):
graph TD
A[Input Text] -->|Tokenizer| B(Token IDs)
B --> C{Trion Embedding}
C -->|FP32| D[Layer 1: BitGhostBlock]
D -->|RMSNorm| E[Attention Mechanism]
E -->|Identity Init| F[MLP: 1.58-bit Linear]
F -->|BitQuant| G[Layer N...]
G --> H[RMSNorm Final]
H --> I[Output Head]
I -->|Logits| J[Next Token Prediction]
style C fill:#222,stroke:#00bcd4,stroke-width:2px
style F fill:#440000,stroke:#ff0000,stroke-width:2px
style I fill:#222,stroke:#00bcd4,stroke-width:2px