A complete PyTorch implementation and replication of the groundbreaking Vision Transformer (ViT) paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" applied to the FoodVision Mini dataset.
This project demonstrates how to replicate a state-of-the-art machine learning research paper from scratch, building a Vision Transformer architecture using PyTorch and applying it to image classification tasks. The implementation focuses on understanding the core concepts by building each component step-by-step.
- Getting Setup
- Data Preparation
- Model Architecture
- Training
- Evaluation
- Transfer Learning
- Results
- Requirements
- Usage
- Key Concepts
- Python 3.8+
- PyTorch 1.12+ (with CUDA support recommended)
- torchvision 0.13+
- GPU with CUDA support (recommended)
torch>=1.12.0
torchvision>=0.13.0
matplotlib
torchinfoThe notebook automatically handles dependency installation and downloads required helper modules from the pytorch-going-modular repository.
The project uses the FoodVision Mini dataset containing three food categories:
- π Pizza
- π₯© Steak
- π£ Sushi
Dataset Statistics:
- Image size: 224Γ224 pixels (as per ViT paper specifications)
- Batch size: 32
- Data split: Train/Test
- Transforms: Resize and normalization
The dataset is automatically downloaded and prepared using the provided data setup utilities.
The ViT architecture is built from four main mathematical equations, implemented as modular components:
- Converts input images into sequences of learnable patches
- Patch size: 16Γ16 pixels
- Adds positional embeddings and class token
- Core attention mechanism of the Transformer
- Enables the model to focus on relevant image patches
- Multiple attention heads for diverse representation learning
- Feed-forward network component
- Used in Transformer Encoder blocks and output layer
- Provides non-linear transformations
- Alternating layers of MSA and MLP
- Residual connections and layer normalization
- Multiple encoder blocks stacked together
The implementation includes:
- Custom ViT: Built from scratch following the paper
- Pretrained ViT: Using torchvision's pretrained models for transfer learning
- Optimizer: Adam (as commonly used with Transformers)
- Loss Function: Cross-entropy loss
- Device: CUDA GPU (if available), CPU fallback
- Epochs: Configurable (typically 10-50 epochs)
- Data Loading: Efficient DataLoaders with proper transforms
- Forward Pass: Through custom ViT architecture
- Loss Calculation: Cross-entropy for classification
- Backpropagation: Gradient computation and parameter updates
- Validation: Regular evaluation on test set
- Training/Validation Loss
- Training/Validation Accuracy
- Loss curves visualization
- Model performance comparison
- Loss and accuracy curves
- Sample predictions on test images
- Attention visualization (where applicable)
The project demonstrates the power of transfer learning by:
- Using pretrained ViT models from torchvision
- Fine-tuning on the FoodVision Mini dataset
- Comparing custom implementation vs. pretrained performance
Benefits of Transfer Learning:
- Faster convergence
- Better performance with limited data
- Reduced computational requirements
The notebook provides comprehensive results including:
- Model accuracy comparisons
- Training time analysis
- Loss convergence patterns
- Predictions on custom images (including the famous "pizza-dad" image)
- Open in Google Colab: Click the "Open in Colab" badge
- Local Setup:
git clone https://github.com/NANDAGOPALNG/Vision_Transformer_Paper_Replication cd Vision_Transformer_Paper_Replication jupyter notebook Vision_Transformers_paper_replicating.ipynb
# 1. Environment Setup
device = "cuda" if torch.cuda.is_available() else "cpu"
# 2. Data Preparation
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
train_dir=train_dir,
test_dir=test_dir,
transform=manual_transforms,
batch_size=BATCH_SIZE
)
# 3. Model Creation
vit_model = VisionTransformer(
img_size=224,
patch_size=16,
num_classes=len(class_names),
# ... other parameters
)
# 4. Training
results = engine.train(
model=vit_model,
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
optimizer=optimizer,
loss_fn=loss_fn,
epochs=epochs,
device=device
)- Patch-based Processing: Treats image patches as "words" in a sequence
- Self-Attention: Captures global dependencies across the entire image
- No Convolutions: Pure transformer architecture for computer vision
- Scalability: Performance improves with larger datasets and models
- Modular Design: Each component built as separate, reusable classes
- Paper Fidelity: Close adherence to original paper specifications
- Educational Focus: Step-by-step breakdown for learning purposes
- Practical Application: Real-world dataset and evaluation
By working through this implementation, you'll understand:
- How to read and replicate machine learning research papers
- Vision Transformer architecture and its components
- PyTorch implementation of complex neural networks
- Transfer learning techniques and benefits
- Computer vision pipeline: data β model β training β evaluation
- Original ViT Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Attention is All You Need: Original Transformer Paper
- Learn PyTorch: Educational Resource
Important: This implementation focuses on educational understanding rather than production optimization. The goal is to demonstrate how mathematical concepts from research papers translate into working code.
Feel free to contribute by:
- Improving code documentation
- Adding new features or optimizations
- Fixing bugs or issues
- Enhancing visualization tools
This project is for educational purposes. Please refer to the original paper and PyTorch licensing for commercial use.
Author: NANDAGOPALNG
Project Type: Educational Implementation
Framework: PyTorch
Domain: Computer Vision, Deep Learning, Transformer Architecture