Specialized Track: Performance Engineering & Optimization for AI/ML Systems
Master GPU optimization, CUDA programming, model compression, and high-performance inference systems for production AI/ML workloads.
- Overview
- Learning Path
- Prerequisites
- Course Structure
- Projects
- Setup Instructions
- Learning Resources
- Assessment & Certification
- Community & Support
- Career Path
This learning repository is designed to transform senior AI infrastructure engineers into specialized AI/ML Performance Engineersβexperts who optimize deep learning models and infrastructure for production deployment at scale.
- GPU Architecture & CUDA Programming: Master low-level GPU optimization and custom kernel development
- Performance Profiling: Use NVIDIA Nsight, PyTorch Profiler, and other tools to identify bottlenecks
- Model Compression: Implement quantization, pruning, knowledge distillation, and TensorRT conversion
- Transformer Optimization: Build custom CUDA kernels for Flash Attention, RoPE, and LayerNorm
- High-Performance Inference: Design LLM serving systems with continuous batching and PagedAttention
- Distributed Optimization: Optimize multi-GPU training and inference pipelines
- Production Deployment: Deploy optimized models with monitoring and cost optimization
This course is designed for:
- Senior AI Infrastructure Engineers looking to specialize in performance optimization
- ML Platform Engineers who need deep GPU and optimization expertise
- Performance Engineers transitioning to AI/ML workloads
- Research Engineers deploying models to production at scale
Required Skills:
- Strong Python programming (3+ years)
- PyTorch or TensorFlow experience (2+ years)
- Linux/Unix system administration
- Git version control
- Docker and Kubernetes basics
- Understanding of transformer architectures (GPT, BERT, LLaMA)
Recommended Background:
- Computer architecture fundamentals
- C++ programming
- Distributed systems concepts
- Production ML experience
Hardware Requirements:
- NVIDIA GPU (minimum: RTX 3090, A10G, or cloud GPU instance)
- Recommended: A100, A10G, or H100
- 64GB+ RAM
- 500GB+ SSD storage
- Ubuntu 20.04/22.04 or similar Linux distribution
Software Prerequisites:
- CUDA Toolkit 12.0+
- Python 3.10+
- PyTorch 2.1+
- Docker
- Git
Prerequisites ββ> GPU Fundamentals ββ> CUDA Programming ββ> Performance Profiling
β
βΌ
Production Deployment <ββ Distributed Inference <ββ Model Compression
β β β
βΌ βΌ βΌ
Project 3 Advanced Topics Transformer Optimization
(LLM Inference) β
βΌ
Project 1 & 2
(Compression & CUDA Kernels)
- Total Duration: 200-250 hours (10-12 weeks full-time, 20-25 weeks part-time)
- Lessons: 8 modules, 2-3 weeks each
- Projects: 3 major projects, 40-80 hours each
- Assessments: Weekly quizzes + 3 practical exams
Learning Objectives:
- Understand GPU architecture (CUDA cores, Tensor Cores, memory hierarchy)
- Master GPU memory management (global, shared, registers, L1/L2 cache)
- Learn CUDA execution model (grids, blocks, threads, warps)
- Understand memory bandwidth and compute-bound operations
Topics:
- NVIDIA GPU architecture evolution (Pascal β Ampere β Hopper)
- CUDA programming model fundamentals
- Memory hierarchy and bandwidth optimization
- Warp-level operations and thread divergence
- Occupancy and resource utilization
Deliverables:
- Quiz: GPU architecture and CUDA model
- Exercise: Memory bandwidth analysis
- Lab: Simple CUDA kernel profiling
Learning Objectives:
- Write efficient CUDA kernels from scratch
- Optimize memory access patterns (coalescing, alignment)
- Use shared memory and warp primitives
- Integrate CUDA with PyTorch (C++ extensions)
Topics:
- CUDA kernel syntax and launch configurations
- Memory coalescing and alignment
- Shared memory optimization and bank conflicts
- Warp-level primitives (
__shfl, reductions) - PyTorch C++ extensions with pybind11
- Autograd integration for custom operators
Deliverables:
- Quiz: CUDA programming concepts
- Exercise: Implement vectorized operations
- Lab: Build PyTorch CUDA extension
Learning Objectives:
- Profile GPU applications with NVIDIA Nsight Compute
- Analyze system-wide performance with Nsight Systems
- Perform roofline analysis
- Identify memory vs compute bottlenecks
Topics:
- NVIDIA Nsight Compute deep dive
- NVIDIA Nsight Systems for end-to-end profiling
- PyTorch Profiler and TensorBoard integration
- Roofline model and performance analysis
- Memory bandwidth vs compute utilization
- Kernel optimization strategies
Deliverables:
- Quiz: Profiling tools and metrics
- Exercise: Roofline analysis case study
- Lab: Profile and optimize transformer model
Learning Objectives:
- Understand transformer architecture bottlenecks
- Implement Flash Attention algorithm
- Build custom CUDA kernels for RoPE, LayerNorm, GELU
- Optimize attention memory usage
Topics:
- Transformer architecture deep dive
- Attention mechanism bottlenecks
- Flash Attention algorithm and implementation
- Rotary Position Embeddings (RoPE) optimization
- Fused kernel design (LayerNorm + GELU)
- Memory-efficient attention patterns
Deliverables:
- Quiz: Transformer optimization techniques
- Exercise: Flash Attention analysis
- Project 2: Custom CUDA Kernels for Transformers (60 hours)
Learning Objectives:
- Implement post-training quantization (PTQ) and QAT
- Apply structured pruning techniques
- Implement knowledge distillation
- Convert models to TensorRT
Topics:
- Quantization: INT8, FP16, mixed precision
- PyTorch quantization APIs
- Pruning: magnitude-based, structured, iterative
- Knowledge distillation frameworks
- TensorRT conversion and optimization
- Calibration strategies for quantization
Deliverables:
- Quiz: Compression techniques
- Exercise: Quantization sensitivity analysis
- Project 1: Automated Model Compression Pipeline (40 hours)
Learning Objectives:
- Implement tensor parallelism for large models
- Design efficient multi-GPU serving systems
- Optimize cross-GPU communication
- Build load balancing systems
Topics:
- Tensor parallelism fundamentals
- Pipeline parallelism for inference
- NCCL and inter-GPU communication
- Load balancing strategies
- Multi-GPU memory management
- Scaling efficiency analysis
Deliverables:
- Quiz: Distributed inference
- Exercise: Tensor parallelism implementation
- Lab: Multi-GPU serving system
Learning Objectives:
- Design high-throughput inference APIs
- Implement continuous batching
- Build monitoring and observability systems
- Optimize cost per inference
Topics:
- REST and gRPC APIs for inference
- Continuous batching and request scheduling
- Prometheus and Grafana monitoring
- SLA management and autoscaling
- Cost optimization strategies
- Deployment with Docker and Kubernetes
Deliverables:
- Quiz: Production deployment
- Exercise: Design serving architecture
- Project 3: High-Performance LLM Inference System (80 hours)
Learning Objectives:
- Implement speculative decoding
- Use PagedAttention for memory efficiency
- Explore INT4 quantization
- Learn latest optimization techniques
Topics:
- PagedAttention implementation
- Speculative decoding algorithms
- INT4 and sub-byte quantization
- Continuous batching advanced patterns
- Flash Decoding for inference
- Latest research in LLM optimization
Deliverables:
- Quiz: Advanced optimization
- Exercise: PagedAttention analysis
- Lab: Implement speculative decoding
Complexity: Intermediate+
Build a production-ready compression pipeline that applies quantization, pruning, knowledge distillation, and TensorRT conversion to reduce model size by 75% and increase inference speed by 3x while maintaining 98%+ accuracy.
Technologies: PyTorch, TensorRT, ONNX, Neural Compressor
Performance Targets:
- 3x inference speedup
- 75% model size reduction
- <2% accuracy degradation
Key Features:
- Post-training quantization (INT8/FP16)
- Quantization-aware training
- Structured pruning with fine-tuning
- TensorRT engine building
- Automated benchmarking
Complexity: Advanced
Develop custom CUDA kernels to optimize critical transformer operations, achieving 3x+ speedup over standard PyTorch implementations through Flash Attention, fused RoPE, optimized LayerNorm, and GELU.
Technologies: CUDA, C++, PyTorch C++ Extensions, Triton, Nsight
Performance Targets:
- Flash Attention: 3x speedup
- Fused kernels: 3.5x speedup
- 80%+ memory bandwidth utilization
- 70%+ compute utilization
Key Features:
- Flash Attention v2 implementation
- Fused RoPE kernel
- Welford-based LayerNorm
- Vectorized GELU
- PyTorch integration
Complexity: Advanced+
Build a production-grade LLM serving system capable of 1000+ requests/second with P99 latency <100ms using continuous batching, PagedAttention, and advanced scheduling.
Technologies: PyTorch, vLLM, FastAPI, FlashAttention, Triton
Performance Targets:
- 1000+ req/sec throughput
- <100ms P99 latency
- 85%+ GPU utilization
- 70% memory savings with PagedAttention
Key Features:
- Continuous batching engine
- PagedAttention implementation
- Dynamic request scheduling
- Streaming inference support
- Prometheus monitoring
# Clone repository
git clone https://github.com/ai-infra-curriculum/ai-infra-performance-learning.git
cd ai-infra-performance-learning
# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Verify CUDA installation
python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'CUDA Version: {torch.version.cuda}')"# Ubuntu 22.04 (adjust for your distribution)
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
sudo sh cuda_12.2.0_535.54.03_linux.run
# Add to PATH
echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# Verify
nvcc --version# Nsight Compute
sudo apt-get install nvidia-nsight-compute
# Nsight Systems
sudo apt-get install nvidia-nsight-systems
# Verify
ncu --version
nsys --version# CMake (for CUDA compilation)
sudo apt-get install cmake
# Build essentials
sudo apt-get install build-essential
# pybind11 for PyTorch extensions
pip install pybind11See individual project directories for detailed setup instructions:
GPU & CUDA:
- CUDA C++ Programming Guide (NVIDIA)
- "Programming Massively Parallel Processors" by Hwu, Kirk, Hajj
- CUDA Best Practices Guide (NVIDIA)
Model Optimization:
- "A White Paper on Neural Network Quantization" (Qualcomm)
- "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (Google)
- "Learning both Weights and Connections for Efficient Neural Networks" (Han et al.)
Transformer Optimization:
- "Flash Attention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022)
- "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (Dao, 2023)
- "Attention is All You Need" (Vaswani et al., 2017)
LLM Serving:
- "Efficient Memory Management for Large Language Model Serving with PagedAttention" (vLLM paper)
- "Orca: A Distributed Serving System for Transformer-Based Generative Models"
- PyTorch - Deep learning framework
- TensorRT - NVIDIA inference optimization
- ONNX Runtime - Cross-platform inference
- vLLM - LLM serving reference
- Flash Attention - Optimized attention
- Triton - GPU programming language
- DeepSpeed - Optimization library
- NVIDIA Deep Learning Institute - GPU Programming
- NVIDIA DLI - Optimizing Deep Learning Models
- Coursera - GPU Programming Specialization
- YouTube: CUDA Programming tutorials by NVIDIA
- NVIDIA Developer Forums
- PyTorch Discussion Forums
- r/MachineLearning Performance threads
- MLPerf benchmarking community
Each module includes:
- Pre-quiz: Assess baseline knowledge
- Mid-module checkpoints: Verify understanding
- Post-quiz: Comprehensive module assessment
Passing Score: 80% or higher
Three major practical exams aligned with projects:
- Compression Exam (Module 5): Compress a given model to meet performance targets
- CUDA Exam (Module 4): Implement custom CUDA kernel from specification
- Inference Exam (Module 7): Deploy serving system meeting SLA requirements
Passing Criteria: Meet all performance targets
Requirements:
- Complete all 8 modules with 80%+ quiz scores
- Submit all 3 projects with passing grades
- Pass all 3 practical examinations
Certificate: "AI/ML Performance Engineer - Advanced Specialization"
- Discussion Forum: GitHub Discussions
- Office Hours: Weekly live Q&A sessions (schedule TBD)
- Slack Community: Join #ai-performance-engineering channel
- Email Support: [email protected]
We welcome contributions! See CONTRIBUTING.md for guidelines.
Areas for contribution:
- Additional exercises and labs
- Bug fixes and improvements
- Performance benchmarks
- Documentation enhancements
- New optimization techniques
This project follows the Contributor Covenant Code of Conduct. Please read and adhere to it in all interactions.
AI Infrastructure Engineer (Level 2)
β
AI/ML Performance Engineer (Level 2.5D) β YOU ARE HERE
β
ββββββββββββββββ¬βββββββββββββββ
β β β
Senior Performance Principal Performance
Engineer (L3) Architect Team Lead
| Skill | Entry | After Course | Expert |
|---|---|---|---|
| GPU Programming | Basic | Advanced | βββββ |
| CUDA Kernels | None | Intermediate | ββββ |
| Model Compression | Basic | Advanced | βββββ |
| Performance Profiling | Basic | Advanced | βββββ |
| LLM Serving | None | Advanced | ββββ |
| Production Deployment | Intermediate | Advanced | βββββ |
Based on industry data (2024 US market):
- Entry Performance Engineer: $140K - $180K
- Senior Performance Engineer: $180K - $240K
- Principal Performance Engineer: $240K - $350K+
Specialized AI/ML Performance Engineers command 20-30% premium over general infrastructure roles.
- Senior AI Infrastructure Architect track
- Principal AI Infrastructure Engineer (technical leadership)
- AI Infrastructure Team Lead (people management)
- Specialized roles: MLOps, ML Platform, AI Security
| Week | Module | Activities | Hours |
|---|---|---|---|
| 1-2 | Module 1-2 | GPU Fundamentals + CUDA | 50 |
| 3 | Module 3 | Performance Profiling | 25 |
| 4-5 | Project 1 | Model Compression Pipeline | 40 |
| 6-7 | Module 4 | Transformer Optimization | 40 |
| 8-10 | Project 2 | Custom CUDA Kernels | 60 |
| 11-12 | Module 5-6 | Compression + Distributed | 65 |
| 13-16 | Project 3 | LLM Inference System | 80 |
| 17-18 | Module 7-8 | Production + Advanced | 45 |
Total: ~18 weeks (full-time) or 36 weeks (part-time, 20 hrs/week)
This learning repository is licensed under the MIT License.
Course materials, code examples, and projects are freely available for educational purposes.
- GitHub: @ai-infra-curriculum
- Email: [email protected]
- Website: ai-infra-curriculum.com
This curriculum was developed with input from:
- Senior ML Infrastructure Engineers at major tech companies
- NVIDIA Developer Relations team
- Academic researchers in GPU optimization
- Production ML teams deploying LLMs at scale
Ready to become an AI/ML Performance Engineering expert?
Start with Module 1: GPU Fundamentals β
or jump into Project 1: Model Compression β