Skip to content

Conversation

@Phoenixcoder-6
Copy link

The question was How can transformers be used for tasks other than natural language processing, such as computer vision?
I have found the most understandable answer of that.

In NLP:

  • A sentence is a sequence of words.
  • Transformer sees each word as a token.
  • It learns how words relate to each other using self-attention.

Example:
In the sentence “The cat sat on the mat”,
"cat" and "sat" are related,
"cat" and "mat" are also related (because the cat is on the mat).

The transformer automatically figures out these relationships.

In Computer Vision:

An image is not a sequence, it’s a 2D grid (height x width x channels).
So before feeding it to a transformer, we:

  • Cut the image into small square patches (say 16×16 pixels).
  • Flatten each patch into a long vector (just line up the pixel values).
  • Embed each vector into a fixed-size vector (like an embedding layer for text).

Now, treat patches like "words" and the image like a "sentence"!
Then self-attention can figure out which parts of the image should attend to which others.
Maybe eyes should attend to nose for face recognition.
Maybe wheels should attend to car body in car detection.

✨ Why does this help?
In traditional CNNs, each convolutional filter looks only at a small local region (say 3×3 pixels).
In transformers, every patch can look at every other patch — even if they are far away!
So transformers can capture global relationships better.

@Phoenixcoder-6
Copy link
Author

Check this out

@sumansuhag
Copy link

Hi!
Phoenixcoder-6, you did a great job explaining how Transformers work in Computer Vision, especially with your 'patches as words' comparison! The way they capture relationships across the whole image is truly a game-changer for understanding complex scenes.

This makes us think: besides just focusing on performance, how can we use the global context of Transformers in CV to make models easier to interpret? Why do they pay attention to certain patches that are far apart? Also, do you think we'll see more models that combine CNNs and Transformers? That could bring together the best of both worlds in feature extraction for better efficiency and accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants