Currently a small(er) nanoGPT, written in tinygrad, from scratch. Tinygrad allows acceleration on virtually any backend: probably the simplest way to write accelerated training on modern Macs.
- CausalAttention with RoPE [layers/attention.py]
- Basic FFNs [layers/feedforward.py]
- SwiGLU FFNs [layers/feedforward.py]
- Mixture of Experts [layers/moe.py]
- SGD [optimizers/sgd.py]
- Adam [optimizers/sgd.py]
- Muon [optimizers/muon.py]
- LayerNorm [utils/transformer_methods.py]
- Cross Entropy [utils/loss_functions.py]
- Naive character-level tokenization [utils/dataloader.py]
- Byte-pair encoding [utils/dataloader.py]
- Diffusion text modelling