VLM-from-scratch

This repository is part of my personal learning journey to understand Vision-Language Models (VLMs) — specifically, how models interpret an image and generate meaningful captions from it.

My curiosity stemmed from a simple question:

"How does a model actually understand an image and then generate a relevant caption for it?"

This project is built alongside Umar Jabil’s excellent video, which I highly recommend watching.
All credit goes to him for the concepts and guidance.

📁 File Overview

`modelling_siglip.py`

This file implements the Vision Transformer (ViT) portion of the SigLIP architecture. It includes:

SigLIPVisionConfig
Configuration class for setting model parameters.
SigLIPVisionEmbeddings
Responsible for converting image patches into embedding vectors.
SigLIPMLP
Standard feed-forward network used within transformer blocks.
SigLIPAttention
Multi-head self-attention mechanism for capturing spatial relationships.
SigLIPEncoderLayer
A single transformer encoder block: Attention + MLP + LayerNorm.
SigLIPEncoder
Stacked encoder layers forming the transformer backbone.
SigLIPVisionTransformer
Combines embeddings and encoder to form the ViT.
SigLIPVisionModel
High-level model class wrapping the Vision Transformer.

`processing_paligemma.py`

This file provides a lightweight preprocessing module for preparing image and text inputs for PaLI-Gemma-style vision-language models. It includes:

add_image_tokens_to_prompt
Prepends <image> tokens and a <bos> token to the prompt, and appends a newline for compatibility with training format.
rescale
Scales pixel values (e.g., by 1/255.0) and converts the array to a float dtype.
resize
Resizes a PIL.Image to a target (height, width) using the specified resampling method.
normalize
Normalizes an image by subtracting the mean and dividing by the standard deviation.
process_images
Applies resize → CHW conversion → rescaling → normalization to a list of images.
PaligemmaProcessor
Main callable that processes images and text into pixel_values, input_ids, and attention_mask for model input.

🚧 Notes

This repository is educational in nature and may not be optimized for performance.
You are encouraged to tweak, explore, and build on top of this to better understand the internals of vision-language models.
Check out the linked video to follow along with the architecture and coding.

Happy building! 🛠️
Feel free to fork, contribute, or drop suggestions.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
modelling_gemma.py		modelling_gemma.py
modelling_siglip.py		modelling_siglip.py
processing_paligemma.py		processing_paligemma.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLM-from-scratch

📁 File Overview

`modelling_siglip.py`

`processing_paligemma.py`

🚧 Notes

About

Uh oh!

Releases

Packages

Languages

Vidit-Ostwal/VLM-from-scratch

Folders and files

Latest commit

History

Repository files navigation

VLM-from-scratch

📁 File Overview

modelling_siglip.py

processing_paligemma.py

🚧 Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`modelling_siglip.py`

`processing_paligemma.py`

Packages