This repo collects post-training methods for Large Language Models (LLMs) with small, focused implementations and runnable examples. The goal is to make alignment and reinforcement post-training practical, understandable, and reproducible.
- Post-training methods that start from a pretrained model.
- Minimal, readable implementations over full-scale training stacks.
- RL fundamentals in Gymnasium to build intuition for later LLM alignment.
python3 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip
pip install -r requirements.txtgymnasium/: RL foundations (CartPole examples).chess/: Toy chess Q-learning (KQ vs K).requirements.txt: Python dependencies.README.md: learning path and run instructions.
Runs a single random rollout to verify environment setup.
Code: gymnasium/cartpole_random.py
python gymnasium/cartpole_random.pyTrains a discretized Q-learning agent and evaluates it.
Code: gymnasium/cartpole_q_learning.py
python gymnasium/cartpole_q_learning.pyEvaluation renders by default.
Trains a Q-learning agent on a toy chess endgame (King + Queen vs King).
Code: chess/chess_q_learning.py
python chess/chess_q_learning.pyEach step builds on the previous one; diagrams show simplified dataflow.
Topics:
- Value-based RL (Q-learning, intuition and DQN basics)
- Exploration vs exploitation
- Why policy optimization is needed
- Where classical RL begins to fail for LLMs
Diagram:
flowchart LR
Env["Environment"] -->|"State s(t)"| Agent["Agent"]
Agent -->|"Action a(t)"| Env
Env -->|"Reward r(t+1)"| Agent
Env -->|"State s(t+1)"| Agent
Topics:
- Dataset formatting (prompt/response pairs)
- Loss functions (cross-entropy)
- Establishing evaluation baselines
- Makes pretrained models follow instructions
Diagram:
flowchart LR
Pretrained["Pretrained Model"] --> Train["SFT Training"]
Data["SFT Data (Instruction + Response)"] --> Train
Train --> Instruction["Instruction-Following Model"]
Includes:
- DPO
- IPO
- KTO
- ORPO
Focus:
- Align models directly using preference pairs
- Often cheaper and more stable than PPO
Diagram:
flowchart LR
Prompt[User Prompt] --> Base[Base Model]
Base --> Responses[Candidate Responses]
Responses --> Prefs[Preference Labels]
Prefs --> Opt[Preference Optimization]
Base --> Opt
Opt --> Aligned[Aligned Model]
Coverage:
- Reward model training from preference data
- PPO-based RLHF
- GRPO-style policy optimization (often reducing explicit reward models)
- Why RL can still help (reasoning, safety shaping, controllability)
Diagram:
flowchart LR
SFT[Supervised Fine-Tuning] --> PPO[PPO RLHF]
Pref[Preference Data] --> RM[Reward Model Training]
RM --> PPO
PPO --> Aligned[Final Aligned Model]
SFT --> GRPO[GRPO Optimization]
Pref --> GRPO
GRPO --> Aligned
- Replaces human labels with LLM judgement
- Enables preference optimization at scale
- Reduces dependence on human annotators
Diagram:
flowchart LR
Prompt[User Prompt] --> Policy[Policy Model]
Policy --> Responses[Candidate Responses]
Responses --> Judge[Judge Model]
Judge --> Prefs[AI Preference Labels]
Prefs --> Update[Preference Optimization]
Policy --> Update
Update --> Updated[Updated Policy Model]
- Constitutional AI foundations
- Self-critique loops guided by a ruleset
- Safety layered across the pipeline
Diagram:
flowchart LR
Output[Model Output] --> Critique[Self-Critique]
Constitution[Constitution / Rules] --> Critique
Critique --> Revise[Revision]
Revise --> Safer[Safer Output]
- Standardized evaluation harnesses
- Benchmark frameworks
- Reasoning, helpfulness, and safety scoring
Diagram:
flowchart LR
Model --> Bench[Benchmark Suite]
Bench --> Metrics[Metrics + Regression Tracking]
- Alignment-preserving distillation
- Smaller, deployable aligned models
- Practical deployment-focused tradeoffs
Diagram:
flowchart LR
Teacher[Large Aligned Model] --> Data[Distillation Data]
Student[Smaller Model] --> Distill[Distillation Training]
Data --> Distill
Distill --> Small[Smaller Aligned Model]