Update Deep Learning Questions & Answers for Data Scientists.md #27

Phoenixcoder-6 · 2025-04-26T16:01:07Z

The question was How can transformers be used for tasks other than natural language processing, such as computer vision?
I have found the most understandable answer of that.

In NLP:

A sentence is a sequence of words.
Transformer sees each word as a token.
It learns how words relate to each other using self-attention.

Example:
In the sentence “The cat sat on the mat”,
"cat" and "sat" are related,
"cat" and "mat" are also related (because the cat is on the mat).

The transformer automatically figures out these relationships.

In Computer Vision:

An image is not a sequence, it’s a 2D grid (height x width x channels).
So before feeding it to a transformer, we:

Cut the image into small square patches (say 16×16 pixels).
Flatten each patch into a long vector (just line up the pixel values).
Embed each vector into a fixed-size vector (like an embedding layer for text).

Now, treat patches like "words" and the image like a "sentence"!
Then self-attention can figure out which parts of the image should attend to which others.
Maybe eyes should attend to nose for face recognition.
Maybe wheels should attend to car body in car detection.

✨ Why does this help?
In traditional CNNs, each convolutional filter looks only at a small local region (say 3×3 pixels).
In transformers, every patch can look at every other patch — even if they are far away!
So transformers can capture global relationships better.

Phoenixcoder-6 · 2025-04-26T16:02:04Z

Check this out

sumansuhag · 2025-06-06T15:43:01Z

Hi!
Phoenixcoder-6, you did a great job explaining how Transformers work in Computer Vision, especially with your 'patches as words' comparison! The way they capture relationships across the whole image is truly a game-changer for understanding complex scenes.

This makes us think: besides just focusing on performance, how can we use the global context of Transformers in CV to make models easier to interpret? Why do they pay attention to certain patches that are far apart? Also, do you think we'll see more models that combine CNNs and Transformers? That could bring together the best of both worlds in feature extraction for better efficiency and accuracy.

Update Deep Learning Questions & Answers for Data Scientists.md

59b4ab8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Deep Learning Questions & Answers for Data Scientists.md #27

Update Deep Learning Questions & Answers for Data Scientists.md #27

Uh oh!

Phoenixcoder-6 commented Apr 26, 2025

Uh oh!

Phoenixcoder-6 commented Apr 26, 2025

Uh oh!

sumansuhag commented Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update Deep Learning Questions & Answers for Data Scientists.md #27

Are you sure you want to change the base?

Update Deep Learning Questions & Answers for Data Scientists.md #27

Uh oh!

Conversation

Phoenixcoder-6 commented Apr 26, 2025

Uh oh!

Phoenixcoder-6 commented Apr 26, 2025

Uh oh!

sumansuhag commented Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants