GPT Implementation: Deep Dive & Analysis

A custom GPT architecture implementation built from first principles, featuring memory-efficient training and deep interpretability analysis.

Read the full technical breakdown on Medium: Building a 163M-Parameter GPT Model from Scratch

Project Overview

This repository contains a manual implementation of the GPT-2 Transformer decoder, designed to explore the internal mechanics of Large Language Models without relying on high-level abstractions like Hugging Face's Trainer.

The project is optimized for a single NVIDIA T4 GPU, using advanced resource management techniques. It also emphasizes mechanistic interpretability, with diagnostic tools that visualize semantic learning and highlight failure points.

Key Features & Optimizations

Feature	Description
Custom Causal Attention	Manual Query-Key-Value attention with masked self-attention.
Mixed Precision (AMP)	`torch.cuda.amp` reduces VRAM usage by ~40% while preserving convergence.
Gradient Accumulation	Decouples batch size from GPU memory; enables effective batch sizes of 64+ on a single GPU.
Deep Inspection Hooks	Forward-pass hooks to extract and visualize attention scores without training overhead.

Training Journey

Phase 1: Proof of Concept (Shakespeare)

Dataset: Tiny Shakespeare
Objective: Architectural validation
Result: Loss 1.1 (Overfitting)
Insight: Model captures Old English grammar and vocabulary; attention masking and positional encodings validated.

Phase 2: Scale-Up (Wikipedia)

Dataset: Enwik8 (Wikipedia English)
Compute: Single NVIDIA T4 (Google Colab)
Constraint: 1.5 hours training time
Result: Loss 3.4
Challenge: Model plateaued; developed Analysis Suite to diagnose internal states instead of blind tuning.

Interpretability & Analysis

A Depth Scan tool visualizes attention head entropy across layers:

Layer 0: Healthy Syntax

Diagonal and local attention patterns dominate.
Observation: Model learns local dependencies, e.g., adjectives modifying nearby nouns.

Layer 11: Semantic Collapse

Final layers show vertical attention stripes ("Attention Sinks").
Observation: Model over-focuses on high-frequency or dominant tokens, reducing semantic specificity.

Getting Started

1. Clone the Repository

git clone https://github.com/yourusername/gpt-from-scratch.git
cd gpt-from-scratch

### 2. Install Dependencies

```bash
pip install -r requirements.txt

3. Train the Model

python train.py

(Default configuration uses Mixed Precision & Gradient Accumulation)

4. Run Analysis

python analyze.py --checkpoint checkpoints/model_final.pt

Generates attention maps and performs a depth scan.

Future Roadmap

RoPE Implementation: Replace absolute positional encodings with Rotary Embeddings.
DDP Support: Scale training across multiple GPUs using DistributedDataParallel.
KV Caching: Optimize inference for longer sequence generation.

Built from first principles using PyTorch.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
images		images
models		models
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
generate.py		generate.py
gpu_check.py		gpu_check.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT Implementation: Deep Dive & Analysis

Table of Contents

Project Overview

Key Features & Optimizations

Training Journey

Phase 1: Proof of Concept (Shakespeare)

Phase 2: Scale-Up (Wikipedia)

Interpretability & Analysis

Layer 0: Healthy Syntax

Layer 11: Semantic Collapse

Getting Started

1. Clone the Repository

3. Train the Model

4. Run Analysis

Future Roadmap

About

Uh oh!

Releases

Packages

Languages

amrgaberM/GPT-Implementation

Folders and files

Latest commit

History

Repository files navigation

GPT Implementation: Deep Dive & Analysis

Table of Contents

Project Overview

Key Features & Optimizations

Training Journey

Phase 1: Proof of Concept (Shakespeare)

Phase 2: Scale-Up (Wikipedia)

Interpretability & Analysis

Layer 0: Healthy Syntax

Layer 11: Semantic Collapse

Getting Started

1. Clone the Repository

3. Train the Model

4. Run Analysis

Future Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages