Skip to content

A simple mini LLM training project to train your LLM from scratch, from the dataset to the DPO without needing a datacenter.

License

Notifications You must be signed in to change notification settings

barbidoux/Lumi-Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Lumi-Lab - Production-Grade Mini-LLM Training Pipeline

A complete, config-driven codebase for training mini-LLMs (decoder-only transformers) optimized for personal machines with RTX 4090 GPUs. Implements a modular pipeline: corpus preparation, tokenizer training, pre-training, and SFT (Supervised Fine-Tuning). DPO (Direct Preference Optimization) is available as an experimental feature.

⚠️ EXPERIMENTAL PROJECT: Lumi-Lab is in active development and experimental phase. While pre-training and SFT have been tested on tiny models (23M params), this is a research/educational project, not production software. DPO implementation is untested. Use at your own risk and expect bugs. Contributions, testing, and feedback are welcome!

πŸ’‘ Works with more modest GPUs too! While optimized for RTX 4090 (16GB), this pipeline works on any NVIDIA GPU with CUDA support (even 8GB cards like RTX 3060). You'll just need more time and patience - reduce batch sizes, use gradient accumulation, and maybe grab an extra coffee β˜• while training.

✨ Key Features

  • βš™οΈ 100% Config-Driven: All hyperparameters in JSON configs (WebUI-ready architecture)
  • πŸ”§ Modular Pipeline: Corpus β†’ Tokenizer β†’ Packing β†’ Training β†’ Evaluation
  • 🌐 Multi-Dataset Training: Weighted sampling for curriculum learning
  • 🎯 Chinchilla-Optimal: Scientifically-based token budgets (20 tokens/parameter)
  • πŸ”’ Frozen Tokenizer System: One global tokenizer prevents subtle bugs
  • πŸ“Š Industrial SFT: Two-stage approach with LoRA for memory efficiency
  • ⚑ FlashAttention-2: ~50% VRAM reduction, automatic fallback to SDPA
  • πŸ”„ Production Checkpoints: Complete state preservation (RNG, optimizer, scheduler)
  • 🎲 Deterministic Training: Perfect reproducibility for research

πŸ“Š Benchmark Results

Pipeline validation on Micro model (9.4M parameters) - Full FROM SCRATCH test

End-to-End Pipeline Performance

Stage Script Duration Data Volume Key Metric
Corpus Preparation 10_prepare_corpus.py 30 min 200M tokens 166k tok/s streaming
Dataset Packing 30_pack_dataset.py 2 min 200M tokens 800 tok/s tokenization
Pre-training 40_pretrain.py 19.5 min 6,200 steps Loss: 6.80 β†’ 5.80
Pretrain Evaluation 50_evaluate_pretrain.py 12 sec 100 samples PPL: 609.86, BoolQ: 30%
SFT Corpus Prep 60_prepare_sft_corpus.py 2 min 6M tokens 3 datasets, 24k convos
SFT Training 70_train_sft.py 11m 50s 2,700 steps Loss: 4.70
SFT Evaluation 80_evaluate_sft.py 67.5 sec 50 samples PPL: 266.40, BoolQ: 49%
TOTAL - ~75 min 206M tokens Full pipeline validated

Quality Improvements (Pre-training β†’ Post-SFT)

Metric Pre-training Post-SFT Improvement
Perplexity 609.86 266.40 56% reduction
BoolQ Accuracy 30% 49% +19 points
Smoke Test Quality - 0.571/1.0 Meaningful responses

Note: Micro model (9.4M params) is designed for pipeline validation, not production. Larger models (23M+) achieve significantly better metrics following the same training protocol.

Tiny Model (23M Parameters) - Full Training Report

Complete end-to-end training executed December 12, 2025. See Full Training Report for detailed analysis.

Stage Duration Key Output
Tokenizer Training 3 min 32K vocab (SentencePiece)
Pretraining (30K steps) 2h 33m Loss: 10.5 β†’ 2.92
SFT Training (LoRA) 36 min Loss: 6.38 β†’ 2.93
Total Pipeline ~8 hours RTX 4090

Quality Improvements (23M Model)

Metric After Pretrain After SFT Improvement
Perplexity 587.78 137.32 -76.6%
BoolQ Accuracy 31% 37% +6 points
Smoke Test Quality N/A 0.51/1.0 8 categories tested

Smoke Test Category Breakdown

Category Score Category Score
Basic Knowledge 0.73 Instruction Following 0.53
Wikipedia Style 0.53 Conversation 0.53
Multilingual 0.53 Language Understanding 0.43
Creativity 0.43 Reasoning 0.37

Key Insight: The 23M model shows 4.3x perplexity improvement after SFT, demonstrating effective fine-tuning. However, small model capacity limits complex reasoning (0.37 score). For production use, scale to 124M+ parameters.


πŸ“‹ Table of Contents

πŸš€ Quick Start

Complete Pipeline - Tiny 23M Model (Chinchilla-Optimal)

# 0. Setup environment
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
mkdir -p logs

# 1. Prepare tokenizer corpus (100M tokens from 4 datasets)
python scripts/10_prepare_corpus.py \
    --config config/pretrain/corpus/tokenizer_training_mix.json \
    --output-dir data/datasets/tokenizer_corpus 

# 2. Train global tokenizer (32K vocab - ONLY ONCE!)
python scripts/20_train_tokenizer.py \
    --config config/pretrain/tokenizer/spm32k.json \
    --output-dir data/models/tokenizers/spm_32k

# 3. Prepare pretrain corpus (600M tokens, Chinchilla-optimal)
python scripts/10_prepare_corpus.py \
    --config config/pretrain/corpus/tiny_23M_chinchilla_500M.json \
    --output-dir data/datasets/tiny_23M_corpus 

# 4. Pack dataset (tokenize + create sequences)
python scripts/30_pack_dataset.py \
    --config config/pretrain/packing/default.json \
    --corpus-dir data/datasets/tiny_23M_corpus \
    --tokenizer-dir data/models/tokenizers/spm_32k \
    --output-dir data/processed/tiny_23M_1024

# 5. Pre-train model (23M params, ~2-3 hours on RTX 4090)
accelerate launch --mixed_precision bf16 scripts/40_pretrain.py \
    --config config/pretrain/training/chinchilla_tiny_500m.json \
    --data_dirs data/processed/tiny_23M_1024 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/pretrain

# 6. Evaluate
python scripts/50_evaluate_pretrain.py \
    --model_path checkpoints/pretrain/tiny/final \
    --tokenizer_path data/models/tokenizers/spm_32k \
    --output_dir evaluation_results/pretrain

Expected Results:

  • Loss: ~3.0-3.5 (600M tokens)
  • Perplexity: ~20-35 on validation
  • Training time: ~2-3 hours (RTX 4090, FlashAttention-2)
  • VRAM: ~6-8 GB

Fast Testing Pipeline (Demo Dataset)

# Quick validation with tiny dataset (~10M tokens, 15 minutes)
python scripts/10_prepare_corpus.py \
    --config config/pretrain/corpus/demo_fast.json \
    --output-dir data/datasets/demo_corpus

python scripts/20_train_tokenizer.py \
    --config config/pretrain/tokenizer/spm32k.json \
    --output-dir data/models/tokenizers/spm_32k

python scripts/30_pack_dataset.py \
    --config config/pretrain/packing/default.json \
    --corpus-dir data/datasets/demo_corpus \
    --tokenizer-dir data/models/tokenizers/spm_32k \
    --output-dir data/processed/demo_1024

accelerate launch --mixed_precision bf16 scripts/40_pretrain.py \
    --config config/pretrain/training/chinchilla_tiny_500m.json \
    --data_dirs data/processed/demo_1024 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/demo \
    --num_workers 2

βš™οΈ Installation

System Requirements

  • GPU: RTX 4090 (24 GB VRAM) recommended
    • Also works on more modest GPUs (RTX 3060 8GB, RTX 3070 8GB, etc.)
    • You'll just need to reduce batch sizes and be patient β˜•
  • RAM: 32 GB recommended (minimum 16 GB)
  • Python: 3.11+
  • CUDA: 11.8+ or 12.1+
  • Disk Space: 100 GB free (datasets + checkpoints)

Quick Install

git clone https://github.com/your-username/Lumi-Lab.git
cd Lumi-Lab
pip install -r requirements.txt

FlashAttention-2 Installation (Recommended)

FlashAttention-2 provides ~50% VRAM reduction and ~30% speedup:

# For CUDA 12.1+
pip install flash-attn --no-build-isolation

# For CUDA 11.8
pip install flash-attn==2.5.8 --no-build-isolation

Automatic Fallback: If FlashAttention-2 is not available, Lumi-Lab automatically falls back to PyTorch SDPA (Scaled Dot-Product Attention). No code changes needed.

Requirements:

  • CUDA 11.6+ or CUDA 12.x
  • PyTorch 2.0+ with CUDA
  • GPU architecture SM 8.0+ (RTX 30/40 series)

Installation Help: FlashAttention GitHub

Verify Installation

python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"
python -c "from transformers import __version__; print(f'Transformers: {__version__}')"
python -c "from accelerate import __version__; print(f'Accelerate: {__version__}')"

Expected output:

PyTorch: 2.3.0+cu121
CUDA: True
Transformers: 4.40.0+
Accelerate: 0.30.0+

πŸ”’ One Tokenizer to Rule Them All

Lumi-Lab implements a frozen tokenizer system that ensures perfect consistency across all datasets and prevents subtle bugs from tokenizer mismatches.

🎯 Why This Matters

Traditional multi-dataset training often fails silently due to:

  • ❌ Different tokenizers across datasets β†’ incompatible token spaces
  • ❌ Vocabulary size mismatches β†’ index out of bounds errors
  • ❌ Special token inconsistencies β†’ corrupted training signals
  • ❌ Silent degradation β†’ models that "work" but perform poorly

Our Solution: Train one global tokenizer once, freeze it, use it everywhere.

πŸ”¨ Workflow

# 1. Train the global tokenizer (ONLY ONCE at project start)
python scripts/20_train_tokenizer.py \
    --config config/pretrain/tokenizer/spm32k.json \
    --output-dir data/models/tokenizers/spm_32k

# 2. Prepare all datasets with the frozen tokenizer
python scripts/30_pack_dataset.py \
    --config config/pretrain/packing/default.json \
    --corpus-dir data/datasets/corpus_A \
    --tokenizer-dir data/models/tokenizers/spm_32k \
    --output-dir data/processed/corpus_A_1024

python scripts/30_pack_dataset.py \
    --config config/pretrain/packing/default.json \
    --corpus-dir data/datasets/corpus_B \
    --tokenizer-dir data/models/tokenizers/spm_32k \
    --output-dir data/processed/corpus_B_1024

# 3. Train with guaranteed consistency
accelerate launch scripts/40_pretrain.py \
    --config config/pretrain/training/chinchilla_tiny_500m.json \
    --data_dirs data/processed/corpus_A_1024 data/processed/corpus_B_1024 \
    --data_weights 0.5 0.5 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/pretrain

πŸ›‘οΈ Built-in Safety Features

  • SHA256 Verification: Every dataset includes tokenizer hash
  • Blocking Errors: Training fails immediately on mismatch (no silent bugs)
  • Automatic Validation: Checked at dataset load time
  • Clear Error Messages: Exact fix instructions provided

πŸ“„ Tokenizer Metadata

Every packed dataset includes verification metadata in final_manifest.json:

{
  "tokenizer_config_hash": "a1b2c3d4e5f6...",
  "statistics": {
    "vocab_size": 32768,
    "total_tokens": 600000824,
    "total_sequences": 585938
  }
}

❌ Common Mistakes (Now Prevented!)

# ❌ OLD WAY: Inconsistent tokenizers
python scripts/10_prepare_corpus.py --config dataset_A.json  # Creates tokenizer A
python scripts/10_prepare_corpus.py --config dataset_B.json  # Creates tokenizer B
# β†’ Silent incompatibility, degraded training

# βœ… NEW WAY: One frozen tokenizer
python scripts/20_train_tokenizer.py --config tokenizer_config.json  # THE tokenizer
python scripts/30_pack_dataset.py --tokenizer-dir tokenizer/ ...     # Uses THE tokenizer
python scripts/30_pack_dataset.py --tokenizer-dir tokenizer/ ...     # Uses THE tokenizer
# β†’ Guaranteed consistency

πŸ—οΈ Project Structure

Lumi-Lab/
β”œβ”€β”€ config/                                # All configurations (100% config-driven)
β”‚   β”œβ”€β”€ architectures/                     # Model architectures
β”‚   β”‚   β”œβ”€β”€ tiny.json                     # 23M params (6 layers, 256 dim)
β”‚   β”‚   β”œβ”€β”€ small.json                    # 42M params (12 layers, 512 dim)
β”‚   β”‚   └── base.json                     # 124M params (24 layers, 768 dim)
β”‚   β”œβ”€β”€ pretrain/                          # Pre-training configs
β”‚   β”‚   β”œβ”€β”€ corpus/                        # Corpus preparation configs
β”‚   β”‚   β”‚   β”œβ”€β”€ tokenizer_training_mix.json      # 100M tokens for tokenizer
β”‚   β”‚   β”‚   β”œβ”€β”€ tiny_23M_chinchilla_500M.json    # 600M tokens (Chinchilla-optimal)
β”‚   β”‚   β”‚   β”œβ”€β”€ model_124M_phase_a.json          # Phase A: high-quality (1.5B tokens)
β”‚   β”‚   β”‚   └── model_124M_phase_b.json          # Phase B: diversity (1.5B tokens)
β”‚   β”‚   β”œβ”€β”€ sources/                       # Data source configs
β”‚   β”‚   β”‚   β”œβ”€β”€ c4_english.json
β”‚   β”‚   β”‚   β”œβ”€β”€ gutenberg_books.json
β”‚   β”‚   β”‚   β”œβ”€β”€ fineweb_edu.json
β”‚   β”‚   β”‚   └── vietgpt_wikipedia.json
β”‚   β”‚   β”œβ”€β”€ tokenizer/                     # Tokenizer training
β”‚   β”‚   β”‚   └── spm32k.json               # 32K vocab SentencePiece
β”‚   β”‚   β”œβ”€β”€ packing/                       # Dataset packing
β”‚   β”‚   β”‚   └── default.json              # Sequence length, split ratio
β”‚   β”‚   └── training/                      # Training hyperparameters
β”‚   β”‚       └── chinchilla_tiny_500m.json # Complete training config
β”‚   β”œβ”€β”€ sft/                               # Supervised fine-tuning
β”‚   β”‚   β”œβ”€β”€ datasets/                      # SFT dataset configs
β”‚   β”‚   β”‚   β”œβ”€β”€ alpaca.json
β”‚   β”‚   β”‚   β”œβ”€β”€ dolly15k.json
β”‚   β”‚   β”‚   β”œβ”€β”€ oasst1.json
β”‚   β”‚   β”‚   └── mixed_sft.json
β”‚   β”‚   └── training/                      # SFT training configs (LoRA)
β”‚   β”‚       β”œβ”€β”€ lora_tiny_23m.json
β”‚   β”‚       β”œβ”€β”€ lora_small.json
β”‚   β”‚       └── lora_base.json
β”‚   β”œβ”€β”€ evaluation/                        # Evaluation configs
β”‚   β”‚   β”œβ”€β”€ sft_standard.json
β”‚   β”‚   └── smoke_test_prompts.json
β”‚   └── dpo/                               # DPO configs
β”‚       β”œβ”€β”€ datasets/                      # DPO dataset configs
β”‚       β”‚   β”œβ”€β”€ orca_dpo.json
β”‚       β”‚   └── hh_rlhf.json
β”‚       └── training/                      # DPO training configs
β”‚           └── dpo_tiny.json
β”‚
β”œβ”€β”€ scripts/                               # Training pipeline scripts
β”‚   β”œβ”€β”€ 10_prepare_corpus.py              # Clean, deduplicate, shard corpus
β”‚   β”œβ”€β”€ 20_train_tokenizer.py             # Train SentencePiece tokenizer (32K vocab)
β”‚   β”œβ”€β”€ 30_pack_dataset.py                # Tokenize + pack into sequences
β”‚   β”œβ”€β”€ 40_pretrain.py                    # Pre-train from scratch (multi-dataset)
β”‚   β”œβ”€β”€ 50_evaluate_pretrain.py           # Evaluate pretrained model
β”‚   β”œβ”€β”€ 60_prepare_sft_corpus.py          # Prepare SFT corpus (templates)
β”‚   β”œβ”€β”€ 70_train_sft.py                   # SFT with LoRA adapters
β”‚   β”œβ”€β”€ 80_evaluate_sft.py                # Evaluate SFT model
β”‚   β”œβ”€β”€ 90_prepare_dpo_corpus.py          # Prepare DPO corpus (⚠️ experimental, untested)
β”‚   β”œβ”€β”€ 95_train_dpo.py                   # DPO alignment training (⚠️ experimental, untested)
β”‚   β”œβ”€β”€ 98_evaluate_dpo.py                # Evaluate DPO model (⚠️ experimental, untested)
β”‚   └── 100_serve.py                      # Inference server (CLI + API)
β”‚
β”œβ”€β”€ utils/                                 # Shared utilities
β”‚   β”œβ”€β”€ dataset_utils.py                  # PackedDataset, MultiDatasetSampler
β”‚   β”œβ”€β”€ model_utils.py                    # Model creation, checkpoints
β”‚   β”œβ”€β”€ sft_templates.py                  # ChatML, Instruct, Alpaca templates
β”‚   β”œβ”€β”€ sft_evaluation.py                 # SFT-specific evaluation
β”‚   β”œβ”€β”€ dpo_utils.py                      # DPO multi-dataset loading, metrics
β”‚   β”œβ”€β”€ dpo_evaluation.py                 # DPO-specific evaluation
β”‚   └── debug/                            # Debug utilities
β”‚       β”œβ”€β”€ corpus_cache.py
β”‚       β”œβ”€β”€ tokenizer_validation.py
β”‚       └── adaptive_estimation.py
β”‚
β”œβ”€β”€ data/                                  # Data directory (created automatically)
β”‚   β”œβ”€β”€ datasets/                          # Raw corpus (cleaned, sharded)
β”‚   β”‚   β”œβ”€β”€ tokenizer_corpus/             # 100M tokens for tokenizer training
β”‚   β”‚   └── tiny_23M_corpus/              # 600M tokens for pretraining
β”‚   β”œβ”€β”€ models/                            # Trained models
β”‚   β”‚   └── tokenizers/                   # Tokenizers
β”‚   β”‚       └── spm_32k/                  # 32K vocab SentencePiece
β”‚   β”‚           β”œβ”€β”€ spm32k.model
β”‚   β”‚           β”œβ”€β”€ spm32k.vocab
β”‚   β”‚           β”œβ”€β”€ tokenizer_config.json
β”‚   β”‚           └── TOKENIZER_CARD.md     # Tokenizer documentation
β”‚   └── processed/                         # Packed datasets (tokenized sequences)
β”‚       β”œβ”€β”€ tiny_23M_1024/                # 585K sequences, 1024 tokens each
β”‚       └── demo_1024/                    # Demo dataset
β”‚
β”œβ”€β”€ checkpoints/                           # Model checkpoints
β”‚   β”œβ”€β”€ pretrain/                          # Pretrained models
β”‚   β”‚   └── tiny/
β”‚   β”‚       β”œβ”€β”€ step_5000/
β”‚   β”‚       β”œβ”€β”€ step_10000/
β”‚   β”‚       └── final/
β”‚   └── sft/                               # Fine-tuned models
β”‚       └── tiny_lora/
β”‚
β”œβ”€β”€ evaluation_results/                    # Evaluation outputs
β”‚   β”œβ”€β”€ pretrain/
β”‚   └── sft/
β”‚
β”œβ”€β”€ logs/                                  # Training logs
β”‚
β”œβ”€β”€ CLAUDE.md                              # Complete technical documentation
β”œβ”€β”€ MANIFESTO.md                           # Project philosophy
β”œβ”€β”€ SECURITY.md                            # Security guidelines
β”œβ”€β”€ CITATION.md                            # Academic citation
β”œβ”€β”€ Makefile                               # Convenience commands
β”œβ”€β”€ requirements.txt                       # Python dependencies
└── README.md                              # This file

🧠 Architecture

LLaMA-like Decoder-Only Transformer

Lumi-Lab implements a modern decoder-only transformer following LLaMA design principles:

Core Components

  • πŸ”§ RMSNorm: Root Mean Square normalization (faster, more stable than LayerNorm)
  • ⚑ SwiGLU: Swish-Gated Linear Units activation (superior to ReLU/GELU)
  • πŸ“ RoPE: Rotary Position Embeddings (better length extrapolation)
  • 🚫 No Bias Terms: Cleaner scaling, fewer parameters
  • 🎯 Causal Attention: Autoregressive text generation
  • πŸ’Ύ FlashAttention-2: Memory-efficient attention with ~50% VRAM savings

Why LLaMA Architecture?

  • βœ… Proven: State-of-the-art results from 6M to 70B+ parameters
  • βœ… Efficient: Optimized for modern GPU training
  • βœ… Scalable: Consistent design principles across all model sizes
  • βœ… Compatible: Full HuggingFace Transformers integration
  • βœ… Memory-Optimal: RMSNorm + no bias + FlashAttention-2

Model Architecture Diagram

flowchart TB
    subgraph Input["Input Processing"]
        A[Token IDs] --> B[Embedding Layer<br/>vocab=32768, d_model]
        B --> C[RoPE Position<br/>Encoding]
    end

    subgraph Block["Transformer Block Γ— N layers"]
        C --> D[RMSNorm]
        D --> E[Multi-Head Attention<br/>head_dim=64, causal]
        E --> F[+ Residual]
        F --> G[RMSNorm]
        G --> H[SwiGLU FFN<br/>d_ff = 4 Γ— d_model]
        H --> I[+ Residual]
    end

    subgraph Output["Output Layer"]
        I --> J[Final RMSNorm]
        J --> K[LM Head<br/>vocab=32768]
        K --> L[Logits]
    end

    style A fill:#e3f2fd
    style L fill:#e8f5e9
    style E fill:#fff3e0
    style H fill:#fce4ec
Loading

πŸ“– For complete documentation, see CLAUDE.md

πŸ“Š Model Configurations

Model Layers d_model Heads head_dim FFN Parameters VRAM (bf16) Chinchilla Tokens
tiny 6 256 4 64 1024 23M ~6-8 GB ~500M (20Γ—)
small 12 512 8 64 2048 42M ~8-12 GB ~840M (20Γ—)
base 24 768 12 64 3072 124M ~12-18 GB ~2.5B (20Γ—)

Scaling Strategy

Consistent design principles:

  • head_dim = 64 (constant across all sizes)
  • d_ff = d_model Γ— 4 (FFN ratio = 4.0)
  • vocab_size = 32768 (SentencePiece)
  • sequence_length = 1024 (adjustable)

Chinchilla-Optimal Training

Based on Chinchilla paper, optimal token count β‰ˆ 20 tokens per parameter:

  • 23M model: ~500M tokens (600M for safety margin)
  • 42M model: ~840M tokens (~1B recommended)
  • 124M model: ~2.5B tokens (~3B recommended)

Training with fewer tokens = underfitting. Training with more = diminishing returns.

πŸ“– For complete training guidelines, see CLAUDE.md

πŸ”„ Complete Training Pipeline

Pipeline Overview

flowchart LR
    subgraph Corpus["πŸ“š Corpus Preparation"]
        A[Raw Data<br/>HuggingFace] --> B[10_prepare_corpus.py<br/>Clean, Dedupe, Shard]
        B --> C[Cleaned JSONL<br/>+ Manifest]
    end

    subgraph Tokenizer["πŸ”€ Tokenization"]
        C --> D[20_train_tokenizer.py<br/>SentencePiece 32K]
        D --> E[spm_32k.model<br/>SHA256 verified]
        C --> F[30_pack_dataset.py<br/>Tokenize + Pack]
        E --> F
        F --> G[Binary Sequences<br/>.bin + .idx]
    end

    subgraph Training["πŸš€ Training Pipeline"]
        G --> H[40_pretrain.py<br/>Multi-dataset, Deterministic]
        H --> I[Base Model<br/>Checkpointed]
        I --> J[60_prepare_sft_corpus.py<br/>ChatML Templates]
        J --> K[70_train_sft.py<br/>LoRA Adapters]
        K --> L[SFT Model<br/>Aligned]
    end

    subgraph Eval["πŸ“Š Evaluation"]
        I --> M[50_evaluate_pretrain.py]
        L --> N[80_evaluate_sft.py]
        M --> O[Perplexity, BoolQ]
        N --> O
    end

    style A fill:#e1f5fe
    style L fill:#c8e6c9
    style O fill:#fff3e0
Loading

Stage 1: Corpus Preparation

Purpose: Transform raw datasets into cleaned, deduplicated, sharded text corpus.

python scripts/10_prepare_corpus.py \
    --config config/pretrain/corpus/tiny_23M_chinchilla_500M.json \
    --output-dir data/datasets/tiny_23M_corpus \
    --shard-size 10000 \
    --log-level INFO

Operations:

  • Loading from HuggingFace datasets
  • Text cleaning (control chars, URLs, duplicates)
  • MinHash deduplication (configurable threshold)
  • Quality filtering (length, language detection)
  • Sharding (memory-efficient processing)

Output: data/datasets/tiny_23M_corpus/

  • Sharded JSONL files (shard_0000.jsonl, shard_0001.jsonl, ...)
  • manifest.json (metadata, statistics)
  • DATA_CARD.md (documentation)

Stage 2: Tokenizer Training

Purpose: Train a single global SentencePiece tokenizer (32K vocabulary).

python scripts/20_train_tokenizer.py \
    --config config/pretrain/tokenizer/spm32k.json \
    --output-dir data/models/tokenizers/spm_32k \
    --log-level INFO

Critical: Run this ONLY ONCE at project start. All datasets must use the same frozen tokenizer.

Output: data/models/tokenizers/spm_32k/

  • spm32k.model (SentencePiece model)
  • spm32k.vocab (vocabulary)
  • tokenizer_config.json (HuggingFace format)
  • TOKENIZER_CARD.md (documentation + SHA256 hash)

Quality Metrics:

  • Compression ratio: ~3.5-4.5 chars/token
  • UNK rate: <0.1%
  • Overall quality: 95-100/100

Stage 3: Dataset Packing

Purpose: Tokenize corpus and pack into fixed-length sequences for training.

python scripts/30_pack_dataset.py \
    --config config/pretrain/packing/default.json \
    --corpus-dir data/datasets/tiny_23M_corpus \
    --tokenizer-dir data/models/tokenizers/spm_32k \
    --output-dir data/processed/tiny_23M_1024 \
    --log-level INFO

Operations:

  • Token encoding with frozen tokenizer
  • Sequence packing (length=1024)
  • Train/val split (95/5 default)
  • Binary format (.bin + .idx) for memory-mapped loading
  • Tokenizer SHA256 verification

Output: data/processed/tiny_23M_1024/

  • train.bin + train.idx (training data)
  • val.bin + val.idx (validation data)
  • final_manifest.json (metadata + tokenizer hash)

Stage 4: Pre-training

Purpose: Train the language model from scratch.

accelerate launch --mixed_precision bf16 scripts/40_pretrain.py \
    --config config/pretrain/training/chinchilla_tiny_500m.json \
    --data_dirs data/processed/tiny_23M_1024 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/pretrain \
    --num_workers 4 \
    --log-level INFO

Features:

  • Multi-dataset training: Weighted sampling for curriculum learning
  • Deterministic training: Full RNG state preservation
  • Auto-resume: Resumes from latest checkpoint automatically
  • Progress tracking: Shows equivalent epochs, overfitting detection
  • Evaluation: Validation perplexity every eval_steps
  • Checkpointing: Complete state saved every save_steps

Expected Metrics (23M model, 600M tokens):

  • Training time: ~2-3 hours (RTX 4090, FlashAttention-2)
  • Final loss: ~3.0-3.5
  • Validation perplexity: ~20-35
  • VRAM usage: ~6-8 GB (bf16 + FlashAttention-2)

Multi-Dataset Example:

accelerate launch --mixed_precision bf16 scripts/40_pretrain.py \
    --config config/pretrain/training/chinchilla_tiny_500m.json \
    --data_dirs data/processed/corpus_A_1024 data/processed/corpus_B_1024 \
    --data_weights 0.7 0.3 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/pretrain \
    --num_workers 4

This trains with 70% corpus A, 30% corpus B (deterministic weighted sampling).

Resume Training:

# Automatic: finds latest checkpoint
accelerate launch --mixed_precision bf16 scripts/40_pretrain.py \
    --config config/pretrain/training/chinchilla_tiny_500m.json \
    --data_dirs data/processed/tiny_23M_1024 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/pretrain \
    --resume_from_checkpoint auto

# Manual: specify checkpoint path
accelerate launch --mixed_precision bf16 scripts/40_pretrain.py \
    --config config/pretrain/training/chinchilla_tiny_500m.json \
    --data_dirs data/processed/tiny_23M_1024 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/pretrain \
    --resume_from_checkpoint checkpoints/pretrain/tiny/step_10000

Stage 5: Evaluation

Purpose: Evaluate pretrained model with perplexity and benchmarks.

python scripts/50_evaluate_pretrain.py \
    --model_path checkpoints/pretrain/tiny/final \
    --tokenizer_path data/models/tokenizers/spm_32k \
    --output_dir evaluation_results/pretrain/tiny \
    --log-level INFO

Metrics:

  • WikiText-2 Perplexity: Standard language modeling metric
  • BoolQ Zero-shot: Yes/No reasoning capability
  • Generation Quality: Coherence, fluency, factuality

Expected Results (23M model):

  • Perplexity: ~20-35 (WikiText-2)
  • BoolQ Accuracy: ~55-65% (random = 50%)

πŸ“– Detailed Usage Guide

Config-Driven Philosophy

All hyperparameters are in JSON configs. The CLI only accepts:

  • βœ… Paths: --config, --data_dirs, --tokenizer_dir, --output_dir
  • βœ… Runtime flags: --num_workers, --log-level, --resume_from_checkpoint
  • ❌ No hyperparameters: --learning_rate, --batch_size, etc. β†’ Use config files!

Why?

  • WebUI-Ready: Easy to build UI on top (just file pickers + buttons)
  • Reproducible: All experiments fully documented in configs
  • Version-Controlled: Configs tracked in git, hyperparameters never lost
  • No CLI Hell: No 50-argument commands

Multi-Dataset Training Strategy

Curriculum Learning: Train on multiple datasets with different characteristics.

# 1. Prepare multiple corpuses
python scripts/10_prepare_corpus.py \
    --config config/pretrain/corpus/high_quality.json \
    --output-dir data/datasets/high_quality_corpus

python scripts/10_prepare_corpus.py \
    --config config/pretrain/corpus/diverse.json \
    --output-dir data/datasets/diverse_corpus

# 2. Pack both with the same tokenizer
python scripts/30_pack_dataset.py \
    --config config/pretrain/packing/default.json \
    --corpus-dir data/datasets/high_quality_corpus \
    --tokenizer-dir data/models/tokenizers/spm_32k \
    --output-dir data/processed/high_quality_1024

python scripts/30_pack_dataset.py \
    --config config/pretrain/packing/default.json \
    --corpus-dir data/datasets/diverse_corpus \
    --tokenizer-dir data/models/tokenizers/spm_32k \
    --output-dir data/processed/diverse_1024

# 3. Train with weighted sampling (70% high-quality, 30% diverse)
accelerate launch --mixed_precision bf16 scripts/40_pretrain.py \
    --config config/pretrain/training/chinchilla_tiny_500m.json \
    --data_dirs data/processed/high_quality_1024 data/processed/diverse_1024 \
    --data_weights 0.7 0.3 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/pretrain_multidataset \
    --num_workers 4

Benefits:

  • Better generalization
  • Reduced overfitting
  • Curriculum learning (easy β†’ hard)
  • Domain adaptation

Supervised Fine-Tuning (SFT)

Two-stage industrial SFT pipeline with LoRA for memory efficiency.

Stage 1: Prepare SFT Corpus

python scripts/60_prepare_sft_corpus.py \
    --config config/sft/datasets/alpaca.json \
    --tokenizer-dir data/models/tokenizers/spm_32k \
    --output-dir data/sft_processed/alpaca_1024 \
    --log-level INFO

Features:

  • Conversation templates (ChatML, Instruct, Alpaca)
  • Quality filtering
  • Tokenization with frozen global tokenizer
  • Proper label masking (only train on assistant responses)

Stage 2: Train with LoRA

accelerate launch --mixed_precision bf16 scripts/70_train_sft.py \
    --config config/sft/training/lora_tiny_23m.json \
    --model_path checkpoints/pretrain/tiny/final \
    --data_dirs data/sft_processed/alpaca_1024 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/sft/tiny_alpaca \
    --log-level INFO

LoRA Benefits:

  • ~95% VRAM reduction vs full fine-tuning
  • Fast training (10-20x speedup)
  • Modular adapters (swap for different use cases)
  • No catastrophic forgetting of pretrain knowledge

Multi-Dataset SFT:

accelerate launch --mixed_precision bf16 scripts/70_train_sft.py \
    --config config/sft/training/lora_tiny_23m.json \
    --model_path checkpoints/pretrain/tiny/final \
    --data_dirs data/sft_processed/alpaca_1024 data/sft_processed/oasst1_1024 \
    --data_weights 0.6 0.4 \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/sft/tiny_mixed \
    --num_workers 4

πŸ“– For complete SFT configuration, see CLAUDE.md

Direct Preference Optimization (DPO)

⚠️ EXPERIMENTAL - UNTESTED: The DPO implementation is experimental and has not been validated. Code is complete and follows the same architecture as pretrain/SFT, but requires real-world testing. Use at your own risk. Contributions and testing are welcome!

DPO aligns models with human preferences using chosen/rejected response pairs. Run after SFT to improve helpfulness, harmlessness, and instruction-following quality.

Step 9a: Prepare DPO Corpus

python scripts/90_prepare_dpo_corpus.py \
    --config config/dpo/datasets/orca_dpo.json \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir data/dpo_processed/orca

Available datasets (config/dpo/datasets/):

  • orca_dpo.json: Intel ORCA DPO pairs (~13K, GPT-4 distilled) - Recommended for tiny/small models
  • hh_rlhf.json: Anthropic Helpful-Harmless RLHF (~161K pairs)

Output: Sharded JSONL corpus with manifest + tokenizer verification

Step 9b: DPO Training

accelerate launch --mixed_precision bf16 scripts/95_train_dpo.py \
    --config config/dpo/training/dpo_tiny.json \
    --model_path checkpoints/sft/tiny/final \
    --data_dirs data/dpo_processed/orca \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --output_dir checkpoints/dpo/tiny

Key config parameters (config/dpo/training/dpo_tiny.json):

  • beta=0.1: KL penalty (higher = stronger preference enforcement)
  • learning_rate=5e-7: Much lower than SFT (careful alignment)
  • max_steps=2000: DPO converges faster than SFT

Multi-dataset training:

--data_dirs data/dpo_processed/orca data/dpo_processed/hh_rlhf \
--data_weights 0.7 0.3

VRAM: ~8-10GB (tiny model with LoRA + gradient checkpointing) Time: ~2-3 hours on RTX 4090 (2K steps)

Step 9c: Evaluate DPO

python scripts/98_evaluate_dpo.py \
    --config config/evaluation/dpo_standard.json \
    --model_path checkpoints/dpo/tiny/final \
    --tokenizer_dir data/models/tokenizers/spm_32k \
    --eval_data_dir data/dpo_processed/orca \
    --output_file evaluation_results/dpo/tiny/results.json

DPO-specific metrics:

  • Reward margin: Difference between chosen/rejected rewards (higher = better alignment)
  • Win rate: % of times chosen > rejected (should be > 0.5)
  • Perplexity comparison: Chosen vs rejected responses
  • BoolQ accuracy: General reasoning benchmark
  • Generation quality: Smoke tests + comparison prompts

Expected results (tiny 23M model):

  • Reward margin: > 0.2
  • Win rate: > 0.70
  • BoolQ accuracy: ~72-76% (slight improvement over SFT)

Inference and Serving

Interactive CLI

python scripts/100_serve.py \
    --model_path checkpoints/sft/tiny_alpaca/final \
    --tokenizer_path data/models/tokenizers/spm_32k \
    --mode chat \
    --temperature 0.7 \
    --top_p 0.9 \
    --max_length 256

Modes:

  • chat: Interactive conversational interface
  • generate: Single prompt generation
  • api: REST API server (FastAPI)

REST API Server

python scripts/100_serve.py \
    --model_path checkpoints/sft/tiny_alpaca/final \
    --tokenizer_path data/models/tokenizers/spm_32k \
    --mode api \
    --port 8000

Example Request:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing in simple terms.",
    "max_length": 200,
    "temperature": 0.7
  }'

🎯 GPU Optimizations & Lower-End Hardware

Lumi-Lab is optimized for RTX 4090 (24 GB VRAM) but works on any NVIDIA GPU with CUDA support. Even 8GB cards like RTX 3060 or RTX 3070 will work - you'll just need to adjust batch sizes, use gradient accumulation, and grab more coffee β˜• while training takes longer.

Memory Optimization Techniques

Technique VRAM Reduction Speed Impact When to Use
FlashAttention-2 ~50% +30% speed Always (if available)
bf16 Mixed Precision ~50% +20% speed Always (default)
Gradient Checkpointing ~40% -10% speed Large models or OOM
LoRA (SFT) ~95% +200% speed SFT stage
Gradient Accumulation Variable Neutral Increase effective batch size

VRAM Usage by Model Size

Model Parameters Batch=8 (bf16 + Flash) Batch=4 (bf16 + Flash) Batch=2 (bf16 + Flash) Works on 8GB GPU?
tiny 23M ~6-8 GB ~4-5 GB ~3-4 GB βœ… Yes
small 42M ~8-12 GB ~5-7 GB ~4-5 GB βœ… Yes (batch=2)
base 124M ~12-18 GB ~8-10 GB ~6-7 GB ⚠️ Tight (batch=1-2)

Note for 8GB GPUs: Start with tiny model, batch_size=2-4, gradient_accumulation=8-16. It will work, just take longer!

Troubleshooting OOM (Out of Memory)

If you encounter OOM errors:

  1. Enable FlashAttention-2 (if not already):

    pip install flash-attn --no-build-isolation
  2. Reduce batch size in config:

    {
      "training_params": {
        "per_device_train_batch_size": 4,  // Was 8
        "gradient_accumulation_steps": 8   // Was 4 (keeps effective batch size)
      }
    }
  3. Enable gradient checkpointing in config:

    {
      "hardware_params": {
        "gradient_checkpointing": true
      }
    }
  4. Use smaller sequence length:

    {
      "packing_params": {
        "sequence_length": 512  // Was 1024
      }
    }
  5. Reduce num_workers (dataloader workers use RAM):

    --num_workers 2  # Or even 1

Performance Tuning

For Maximum Speed:

{
  "hardware_params": {
    "use_flash_attn": true,
    "bf16": true,
    "gradient_checkpointing": false,
    "dataloader_num_workers": 4
  },
  "training_params": {
    "per_device_train_batch_size": 16,
    "gradient_accumulation_steps": 2
  }
}

For Minimum VRAM:

{
  "hardware_params": {
    "use_flash_attn": true,
    "bf16": true,
    "gradient_checkpointing": true,
    "dataloader_num_workers": 1
  },
  "training_params": {
    "per_device_train_batch_size": 4,
    "gradient_accumulation_steps": 8
  }
}

Effective Batch Size = batch_size Γ— gradient_accumulation_steps Γ— num_gpus

Example: 4 Γ— 8 Γ— 1 = 32 (same as batch=32 in one step, but uses 1/8th the memory)

πŸ”§ Troubleshooting

Common Issues

1. Tokenizer Mismatch Error

ValueError: Tokenizer mismatch between datasets
   Dataset hash:   a1b2c3d4...
   Tokenizer hash: e5f6g7h8...

Fix:

# Re-pack the dataset with the correct tokenizer
python scripts/30_pack_dataset.py \
    --config config/pretrain/packing/default.json \
    --corpus-dir data/datasets/problem_corpus \
    --tokenizer-dir data/models/tokenizers/spm_32k \
    --output-dir data/processed/problem_dataset_1024

2. CUDA Out of Memory

RuntimeError: CUDA out of memory. Tried to allocate X GB

Fix: See RTX 4090 Optimizations section above.

3. FlashAttention Not Available

WARNING: FlashAttention-2 not available, falling back to SDPA

This is OK! SDPA (PyTorch native) works fine, just uses more VRAM.

To fix (optional):

pip install flash-attn --no-build-isolation

4. NaN/Inf Loss During Training

Step 1234: loss=nan, lr=3.0e-04

Causes:

  • Learning rate too high
  • Mixed precision instability
  • Bad data

Fix:

  1. Reduce learning rate in config:

    {
      "training_params": {
        "learning_rate": 1e-4  // Was 3e-4
      }
    }
  2. Increase gradient clipping:

    {
      "training_params": {
        "max_grad_norm": 0.5  // Was 1.0
      }
    }
  3. Check data quality: Inspect corpus for corrupted samples

5. Resume Training Not Working

FileNotFoundError: checkpoint not found

Fix:

# Use auto-detection
accelerate launch scripts/40_pretrain.py \
    --config config.json \
    --data_dirs data/processed/dataset \
    --tokenizer_dir tokenizer/ \
    --output_dir checkpoints/pretrain \
    --resume_from_checkpoint auto  # Finds latest automatically

# Or specify exact path
--resume_from_checkpoint checkpoints/pretrain/tiny/step_10000

Getting Help

If you encounter issues:

  1. Check logs: logs/ directory contains detailed training logs
  2. Check configs: Verify all paths and hyperparameters
  3. Check GPU: nvidia-smi to monitor VRAM/utilization
  4. Check documentation: CLAUDE.md for complete command reference
  5. Open an issue: GitHub Issues

πŸ“ˆ Monitoring and Evaluation

Training Metrics

Monitor these metrics during training:

  • Loss: Should decrease steadily (target: ~3.0-3.5 for 600M tokens)
  • Learning Rate: Should increase during warmup, then decay
  • Perplexity: Should decrease (target: ~20-35 for 23M model)
  • Equivalent Epochs: Watch for overfitting after ~3-4 epochs
  • VRAM Usage: Should be stable (no memory leaks)

Progress Tracking During Training

During training, you'll see:

Step 2500: loss=4.23, lr=2.5e-04
πŸ“Š Evaluation...
   Progress: 2500/30000 steps (8.3%), ~1.28 epochs
   Validation perplexity: 45.32

Step 5000: loss=3.87, lr=3.0e-04
πŸ’Ύ Checkpoint saved at step 5000 -> checkpoints/pretrain/tiny/step_5000
πŸ“Š Evaluation...
   Progress: 5000/30000 steps (16.7%), ~2.56 epochs
   Validation perplexity: 38.15

Progress Information:

  • Current step / total steps
  • Progress percentage
  • Equivalent epochs (for overfitting detection!)
  • Validation perplexity

Overfitting Detection:

  • If perplexity increases after ~3-4 equivalent epochs β†’ stop early
  • If perplexity plateaus β†’ consider more data or longer training
  • If perplexity decreases steadily β†’ continue training

Evaluation Benchmarks

Perplexity (WikiText-2):

  • Random model: >10,000
  • Undertrained: 100-200
  • Good: 20-50
  • Excellent: <20

BoolQ Zero-shot:

  • Random: 50%
  • Undertrained: 50-55%
  • Good: 60-70%
  • Excellent: >70%

Expected Performance Targets

Model Training Time Final Loss Perplexity BoolQ VRAM
tiny (23M, 600M tokens) 2-3h 3.0-3.5 20-35 55-65% 6-8 GB
small (42M, 1B tokens) 8-12h 2.5-3.0 15-25 65-75% 8-12 GB
base (124M, 3B tokens) 24-48h 2.0-2.5 10-20 70-80% 12-18 GB

πŸ”¬ Technical Reference

Script Reference

Script Purpose Input Output
10_prepare_corpus.py Corpus preparation HuggingFace datasets JSONL shards + manifest
20_train_tokenizer.py Tokenizer training JSONL corpus SentencePiece model
30_pack_dataset.py Dataset packing JSONL + tokenizer Binary sequences (.bin/.idx)
40_pretrain.py Pre-training Packed data Model checkpoints
50_evaluate_pretrain.py Pretrain evaluation Model checkpoint Metrics JSON
60_prepare_sft_corpus.py SFT corpus prep Instruct datasets Formatted conversations
70_train_sft.py SFT training Base model + SFT data LoRA adapters
80_evaluate_sft.py SFT evaluation SFT model Comprehensive metrics
90_prepare_dpo_corpus.py DPO corpus prep ⚠️ Preference data Triplet format
95_train_dpo.py DPO training ⚠️ SFT model + DPO data Aligned model
98_evaluate_dpo.py DPO evaluation ⚠️ DPO model Preference metrics
100_serve.py Model serving Any model CLI/API interface

Configuration Hierarchy

config/
β”œβ”€β”€ architectures/           # Model sizes (micro, tiny, small, base)
β”‚   └── *.json              # n_layer, d_model, n_head, vocab_size
β”‚
β”œβ”€β”€ pretrain/
β”‚   β”œβ”€β”€ corpus/             # Data sources + token budgets
β”‚   β”œβ”€β”€ tokenizer/          # SentencePiece params (vocab_size, model_type)
β”‚   β”œβ”€β”€ packing/            # Sequence length, train/val split
β”‚   └── training/           # LR, batch, optimizer, scheduler
β”‚
β”œβ”€β”€ sft/
β”‚   β”œβ”€β”€ datasets/           # Sources, templates, quality filters
β”‚   └── training/           # LoRA params (r, alpha, target_modules)
β”‚
β”œβ”€β”€ dpo/                    # ⚠️ Experimental
β”‚   β”œβ”€β”€ datasets/           # Preference pair configs
β”‚   └── training/           # DPO-specific params (beta, reference)
β”‚
└── evaluation/             # Metrics, generation params, benchmarks

Key Design Patterns

Pattern Implementation Benefit
Config-Driven All hyperparams in JSON WebUI-ready, reproducible
Manifest Validation SHA256 tokenizer hashes Prevents silent mismatches
Memory Streaming Sharded JSONL + .bin/.idx Handles unlimited corpus sizes
Deterministic Training Full RNG state preservation Perfect reproducibility
Multi-Dataset Sampling Weighted interleaving Curriculum learning
LoRA Adapters Rank-decomposed fine-tuning 95% VRAM reduction

πŸ“š Additional Documentation

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the Good Luck With That Public License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • LLaMA (Meta AI): Architecture inspiration
  • FlashAttention-2 (Dao et al.): Memory-efficient attention
  • Chinchilla (DeepMind): Optimal scaling laws
  • HuggingFace: Transformers library
  • PyTorch Team: PyTorch framework

β˜• Support the Project

If you find Lumi-Lab useful and want to support continued development, consider buying me a coffee β˜•!

This project is built with limited resources (one laptop and lots of coffee) and your support helps me spend more time:

  • πŸ”§ Improving the pipeline and documentation
  • ⚑ Adding new features and optimizations
  • πŸ“š Creating educational content and tutorials
  • β˜• Keeping the coffee machine running

ko-fi

πŸ“§ Contact


Built with ❀️ (and β˜•) for the ML community

"I don't have an excavator, I have a fork. But technical complexity has never been an obstacle, only an exciting puzzle. I would rather dig happily with a fork than wait for someone to hand me a bulldozer."

Train your own mini-LLM, learn the fundamentals, and have fun!

About

A simple mini LLM training project to train your LLM from scratch, from the dataset to the DPO without needing a datacenter.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published