Introduction to generative language modeling using an n-gram model.
This project is an assignment for the Park Tudor data science class. See assignment.md for detailed instructions.
This repo requires Python 3.12 or later. There are no additional dependencies.
| Name | Description |
|---|---|
| assignment.md | The instructions for the assignment |
| tiny_shakespeare.txt | The dataset we use to train our language model |
| -- | -- |
| dataset.py | Utilities for loading and splitting the dataset |
| model.py | The n-gram model implementation |
| -- | -- |
| train.py | A CLI script to train the model |
| generate.py | A CLI script to generate text with the model |
| grade.py | A CLI script to grade the assignment |
| -- | -- |
| grading_utils.py | Utilities for grading, can be ignored |
The Tiny Shakespeare dataset has been downloaded from the GitHub of Andrej Karpathy.