This project is a minimalist implementation of a GPT-style language model built entirely from scratch in Python using PyTorch. The design closely follows the architecture described in the "Attention is All You Need" research paper, and incorporates tokenization using OpenAI's tiktoken library. This is intended as an educational, yet functional base for understanding and experimenting with transformer-based LLMs.
- GPT architecture (customized GPT-2 style)
- Single transformer block for simplicity and efficiency
- Token and positional embeddings
- Multi-head self-attention layers
- Feed-forward layers
- Layer normalization
- Dropout regularization
- Training loop with validation
- Supports custom datasets
This model was trained on the text of a Harry Potter book, consisting of:
- 66,569 characters
- 14,708 tokens (after tokenization)
This relatively compact dataset allows for quick experimentation while still providing rich language patterns for the model to learn.
You can train this model on:
- Public domain books (e.g., from Project Gutenberg)
- Small dialogue datasets like
blended_skill_talk - Movie quotes or other curated text files
pip install torch tiktoken datasets tqdmModify p1.py to load and preprocess your dataset, and then run:
python main.pyYou can save the model using:
torch.save(model.state_dict(), "gpt_model_124M.pth")Create a script like test.py to load the model weights and generate text based on a starting prompt.
Make sure to include the following in your .gitignore to avoid pushing large or unnecessary files:
__pycache__/
*.pyc
*.pth
*.pkl
*.pt
.env
venv/
datasets/
*.log
.cache/
.ipynb_checkpoints
Below is the transformer block architecture used in this implementation:
- This model uses a single transformer block rather than 12, as in the original GPT-2, to reduce computational overhead.
- Each block contains masked multi-head attention, layer norms, and feed-forward networks.
- Input is tokenized and passed through token & positional embeddings.
Note: This project is intended for educational purposes and is optimized for training on small to medium datasets on limited hardware.
