| Original | Smiling Edit | Moustache Edit | Glasses Edit |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
This project implements a CVAE model capable of editing facial images by adding or modifying three key attributes:
- Eyeglasses
- Smiling
- Mustache
The model learns to generate realistic facial variations while conditioning on selected attributes, enabling controlled face synthesis and attribute transfer.
The CVAE architecture consists of four main components:
Encoder Network
- Input concatenates RGB image with attribute channels (3 + 3 = 6 channels)
- Four convolutional layers with LeakyReLU activation
- Channel progression: 128 → 256 → 512 → 1024
- Stride-2 convolutions for spatial downsampling
- Output flattened to 1024 × 4 × 4 = 16,384 dimensions
Latent Space
- Continuous representation of facial features
- Dimension: 128 (configurable)
- Mean (μ) and log-variance (logvar) computed via fully connected layers
- Sampled from learned Gaussian distribution using reparameterization trick
- Conditioned by concatenating attribute vectors (z + c)
Decoder Network
- Fully connected layer expands latent + attribute vectors to 1024 × 4 × 4
- Four transposed convolutional layers with ReLU activation
- Channel progression: 1024 → 512 → 256 → 128 → 3
- Tanh activation for final layer (output range [-1, 1])
- Generates realistic 64 × 64 RGB facial images
Loss Function
- Reconstruction Loss (MSE): Ensures fidelity to input images
- KL Divergence: Regularizes the latent space distribution
- Total Loss:
L = MSE + β * KLDwhere β = 4.0
- Attribute conditioning enables controlled generation
- Variational inference provides diverse outputs
- Spatially-conditioned attributes for better control
- End-to-end differentiable training pipeline
CelebA (CelebFaces Attributes Dataset)
- 202,599 face images (178 × 218 pixels original)
- 40 binary attribute labels per image
- High-quality celebrity face photographs
- Preprocessing: Center crop to 178 × 218, resize to 64 × 64
Selected Attributes (3 out of 41 available)
- Eyeglasses - presence of eyewear
- Smiling - smiling expression
- Mustache - presence of facial hair
Image Normalization
- Mean: [0.5, 0.5, 0.5]
- Std Dev: [0.5, 0.5, 0.5]
- Normalized range: [-1, 1]
- Python 3.8+
- PyTorch 1.9+
- torchvision
- numpy, pandas, matplotlib
- Pillow
- tqdm
- kaggle (for dataset download)
git clone https://github.com/chaitra-samant/cvae-celeba-project.git
cd cvae-celeba-projectpython -m venv venv
source venv/bin/activate # macOS/Linux
.\venv\Scripts\activate # Windows
pip install -r requirements.txt- Visit https://kaggle.com/account
- Scroll to API section
- Click "Create New API Token" (downloads
kaggle.json) - Place the file in the appropriate location:
macOS/Linux: ~/.kaggle/kaggle.json
Windows: C:\Users\<Your-Username>\.kaggle\kaggle.json
Create the .kaggle directory if it does not exist.
python download_data.pyThis downloads and extracts the dataset (~1.3 GB).
The model can be trained using the Jupyter notebook or Python scripts:
python main.pypython main.py --epochs 75 --lr 5e-5 --batch-size 128 --latent-dim 128Available Arguments:
--epochs: Number of training epochs (default: 75)--lr: Learning rate (default: 5e-5)--batch-size: Batch size for training (default: 128)--latent-dim: Dimensionality of latent space (default: 128)--beta: KL divergence weight (default: 4.0)
- Generated samples:
samples_64/ - Trained model checkpoint:
cvae_eyeglasses_smiling_mustache.pth - Training logs and metrics saved automatically
The model generates facial variations by conditioning on specific attributes. For each sample, the model generates four variations using the same latent vector:
Example 1: Expression Modification
![]() |
![]() |
| Neutral | Smiling Only |
Example 2: Original v/s All 3 attributes
![]() |
![]() |
| Neutral | Eyeglasses + Smiling + Mustache |
- Initial Learning Rate: 5e-5 (Adam optimizer)
- Total Epochs: 75
- Batch Size: 128
- Image Resolution: 64 × 64 pixels
- Total Training Samples: 64 generated variations
cvae-celeba-project/
├── main.py # Main training script
├── download_data.py # Dataset download utility
├── requirements.txt # Project dependencies
├── notebooks/
│ └── cvae-model.ipynb # Complete training notebook
├── models/
│ └── cvae.py # CVAE architecture
├── data/
│ └── celeba_loader.py # Data loading utilities
├── utils/
│ ├── training.py # Training loop functions
│ └── visualization.py # Image visualization tools
├── samples_64/ # Generated samples directory
├── results/ # Output images and results
└── README.md # This file
Model Configuration
- Model: Conditional Variational Autoencoder
- Input Size: 64 × 64 RGB images
- Latent Dimension: 128
- Number of Attributes: 3
- Base Channels: 128
Training Hyperparameters
- Optimizer: Adam
- Learning Rate: 5e-5
- Batch Size: 128
- Number of Epochs: 75
- Beta (KL weight): 4.0
- Loss: Reconstruction (MSE) + KL Divergence
Data Loading
- Num Workers: 4 (parallel data loading)
- Pin Memory: Enabled (GPU optimization)
- Drop Last Batch: Enabled (consistent batch sizes)
Hardware Requirements
- Minimum GPU Memory: 8 GB VRAM (for batch size 128)
- Recommended GPU: NVIDIA RTX 2060 or better
- Training Time: ~15-20 hours on single GPU
- Support for additional facial attributes (from 40 available)
- Real-time attribute editing interface
- Improved image quality with higher resolution models (128×128, 256×256)
- Interactive web application for face editing
- StyleGAN2 integration for enhanced image quality
- Disentangled representation learning
- Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114
- Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep Learning Face Attributes in the Wild. ICCV
- Yan, X., Yang, J., Sohn, K., & Lee, H. (2016). Attribute2Image: Conditional Image Generation from Visual Attributes. ECCV
- Sohn, K., Lee, H., & Yan, X. (2015). Learning Structured Output Representation using Deep Conditional Generative Models. NIPS







