This repository is part of my personal learning journey to understand Vision-Language Models (VLMs) โ specifically, how models interpret an image and generate meaningful captions from it.
My curiosity stemmed from a simple question:
"How does a model actually understand an image and then generate a relevant caption for it?"
This project is built alongside Umar Jabilโs excellent video, which I highly recommend watching.
All credit goes to him for the concepts and guidance.
This file implements the Vision Transformer (ViT) portion of the SigLIP architecture. It includes:
-
SigLIPVisionConfig
Configuration class for setting model parameters. -
SigLIPVisionEmbeddings
Responsible for converting image patches into embedding vectors. -
SigLIPMLP
Standard feed-forward network used within transformer blocks. -
SigLIPAttention
Multi-head self-attention mechanism for capturing spatial relationships. -
SigLIPEncoderLayer
A single transformer encoder block: Attention + MLP + LayerNorm. -
SigLIPEncoder
Stacked encoder layers forming the transformer backbone. -
SigLIPVisionTransformer
Combines embeddings and encoder to form the ViT. -
SigLIPVisionModel
High-level model class wrapping the Vision Transformer.
This file provides a lightweight preprocessing module for preparing image and text inputs for PaLI-Gemma-style vision-language models. It includes:
-
add_image_tokens_to_prompt
Prepends<image>tokens and a<bos>token to the prompt, and appends a newline for compatibility with training format. -
rescale
Scales pixel values (e.g., by1/255.0) and converts the array to a float dtype. -
resize
Resizes aPIL.Imageto a target(height, width)using the specified resampling method. -
normalize
Normalizes an image by subtracting the mean and dividing by the standard deviation. -
process_images
Applies resize โ CHW conversion โ rescaling โ normalization to a list of images. -
PaligemmaProcessor
Main callable that processes images and text intopixel_values,input_ids, andattention_maskfor model input.
- This repository is educational in nature and may not be optimized for performance.
- You are encouraged to tweak, explore, and build on top of this to better understand the internals of vision-language models.
- Check out the linked video to follow along with the architecture and coding.
Happy building! ๐ ๏ธ
Feel free to fork, contribute, or drop suggestions.