Skip to content

A PyTorch implementation that replicates the Vision Transformer (ViT) paper from scratch, building each component step-by-step and applying it to food image classification with pizza, steak, and sushi.

Notifications You must be signed in to change notification settings

NANDAGOPALNG/Vision_Transformer_Paper_Replication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Vision Transformer Paper Replication

A complete PyTorch implementation and replication of the groundbreaking Vision Transformer (ViT) paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" applied to the FoodVision Mini dataset.

🎯 Project Overview

This project demonstrates how to replicate a state-of-the-art machine learning research paper from scratch, building a Vision Transformer architecture using PyTorch and applying it to image classification tasks. The implementation focuses on understanding the core concepts by building each component step-by-step.

πŸ“‹ Table of Contents

  1. Getting Setup
  2. Data Preparation
  3. Model Architecture
  4. Training
  5. Evaluation
  6. Transfer Learning
  7. Results
  8. Requirements
  9. Usage
  10. Key Concepts

πŸš€ Getting Setup

Prerequisites

  • Python 3.8+
  • PyTorch 1.12+ (with CUDA support recommended)
  • torchvision 0.13+
  • GPU with CUDA support (recommended)

Dependencies

torch>=1.12.0
torchvision>=0.13.0
matplotlib
torchinfo

The notebook automatically handles dependency installation and downloads required helper modules from the pytorch-going-modular repository.

πŸ“Š Data Preparation

The project uses the FoodVision Mini dataset containing three food categories:

  • πŸ• Pizza
  • πŸ₯© Steak
  • 🍣 Sushi

Dataset Statistics:

  • Image size: 224Γ—224 pixels (as per ViT paper specifications)
  • Batch size: 32
  • Data split: Train/Test
  • Transforms: Resize and normalization

The dataset is automatically downloaded and prepared using the provided data setup utilities.

πŸ—οΈ Model Architecture

Vision Transformer Components

The ViT architecture is built from four main mathematical equations, implemented as modular components:

1. Patch Embedding (Equation 1)

  • Converts input images into sequences of learnable patches
  • Patch size: 16Γ—16 pixels
  • Adds positional embeddings and class token

2. Multi-Head Self-Attention (MSA) - Equation 2

  • Core attention mechanism of the Transformer
  • Enables the model to focus on relevant image patches
  • Multiple attention heads for diverse representation learning

3. Multilayer Perceptron (MLP) - Equation 3

  • Feed-forward network component
  • Used in Transformer Encoder blocks and output layer
  • Provides non-linear transformations

4. Transformer Encoder

  • Alternating layers of MSA and MLP
  • Residual connections and layer normalization
  • Multiple encoder blocks stacked together

Model Variants

The implementation includes:

  • Custom ViT: Built from scratch following the paper
  • Pretrained ViT: Using torchvision's pretrained models for transfer learning

πŸ”§ Training

Training Configuration

  • Optimizer: Adam (as commonly used with Transformers)
  • Loss Function: Cross-entropy loss
  • Device: CUDA GPU (if available), CPU fallback
  • Epochs: Configurable (typically 10-50 epochs)

Training Process

  1. Data Loading: Efficient DataLoaders with proper transforms
  2. Forward Pass: Through custom ViT architecture
  3. Loss Calculation: Cross-entropy for classification
  4. Backpropagation: Gradient computation and parameter updates
  5. Validation: Regular evaluation on test set

πŸ“ˆ Evaluation

Metrics Tracked

  • Training/Validation Loss
  • Training/Validation Accuracy
  • Loss curves visualization
  • Model performance comparison

Visualization Tools

  • Loss and accuracy curves
  • Sample predictions on test images
  • Attention visualization (where applicable)

🎯 Transfer Learning

The project demonstrates the power of transfer learning by:

  • Using pretrained ViT models from torchvision
  • Fine-tuning on the FoodVision Mini dataset
  • Comparing custom implementation vs. pretrained performance

Benefits of Transfer Learning:

  • Faster convergence
  • Better performance with limited data
  • Reduced computational requirements

πŸ“Š Results

The notebook provides comprehensive results including:

  • Model accuracy comparisons
  • Training time analysis
  • Loss convergence patterns
  • Predictions on custom images (including the famous "pizza-dad" image)

πŸ’» Usage

Running the Notebook

  1. Open in Google Colab: Click the "Open in Colab" badge
  2. Local Setup:
    git clone https://github.com/NANDAGOPALNG/Vision_Transformer_Paper_Replication
    cd Vision_Transformer_Paper_Replication
    jupyter notebook Vision_Transformers_paper_replicating.ipynb

Key Code Sections

# 1. Environment Setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2. Data Preparation  
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms,
    batch_size=BATCH_SIZE
)

# 3. Model Creation
vit_model = VisionTransformer(
    img_size=224,
    patch_size=16,
    num_classes=len(class_names),
    # ... other parameters
)

# 4. Training
results = engine.train(
    model=vit_model,
    train_dataloader=train_dataloader,
    test_dataloader=test_dataloader,
    optimizer=optimizer,
    loss_fn=loss_fn,
    epochs=epochs,
    device=device
)

🧠 Key Concepts

Vision Transformer Innovation

  • Patch-based Processing: Treats image patches as "words" in a sequence
  • Self-Attention: Captures global dependencies across the entire image
  • No Convolutions: Pure transformer architecture for computer vision
  • Scalability: Performance improves with larger datasets and models

Implementation Highlights

  • Modular Design: Each component built as separate, reusable classes
  • Paper Fidelity: Close adherence to original paper specifications
  • Educational Focus: Step-by-step breakdown for learning purposes
  • Practical Application: Real-world dataset and evaluation

πŸŽ“ Learning Outcomes

By working through this implementation, you'll understand:

  • How to read and replicate machine learning research papers
  • Vision Transformer architecture and its components
  • PyTorch implementation of complex neural networks
  • Transfer learning techniques and benefits
  • Computer vision pipeline: data β†’ model β†’ training β†’ evaluation

πŸ”— References

πŸ“ Notes

Important: This implementation focuses on educational understanding rather than production optimization. The goal is to demonstrate how mathematical concepts from research papers translate into working code.

🀝 Contributing

Feel free to contribute by:

  • Improving code documentation
  • Adding new features or optimizations
  • Fixing bugs or issues
  • Enhancing visualization tools

πŸ“„ License

This project is for educational purposes. Please refer to the original paper and PyTorch licensing for commercial use.


Author: NANDAGOPALNG
Project Type: Educational Implementation
Framework: PyTorch
Domain: Computer Vision, Deep Learning, Transformer Architecture

About

A PyTorch implementation that replicates the Vision Transformer (ViT) paper from scratch, building each component step-by-step and applying it to food image classification with pizza, steak, and sushi.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published