Distance Metrics in Embedding Space

This project visualizes how different distance metrics affect clustering in embedding spaces for a small sample news dataset.

Overview

Embeddings map objects to vectors in a continuous space. The choice of distance metric significantly impacts how we measure similarity between these vectors and consequently how algorithms like K-means form clusters. This visualization helps to understand these differences intuitively.

Visualizations

Recommendations

Showing top 10 recommendations for a random article using the different metrics:

Considered metrics

Figure from https://www.maartengrootendorst.com/blog/distances/

Euclidean Distance: The straight-line distance between two points in space, calculated as the square root of the sum of squared differences between coordinates. It creates spherical clusters and works best in low-dimensional spaces with dense, real-valued data where scale is meaningful, but suffers in high dimensions and is sensitive to outliers.
Manhattan Distance: The sum of absolute differences between coordinates, representing the distance a taxi would drive in a grid-like city layout. It creates more diamond-shaped clusters, is more robust to outliers than Euclidean distance, and often performs better in high-dimensional spaces, making it suitable for discrete features.
Minkowski Distance: A generalization of both Euclidean (p=2) and Manhattan (p=1) distances, allowing flexibility in how differences between dimensions are aggregated. By adjusting the parameter p, you can control the balance between focusing on large differences (higher p) versus treating all differences more equally (lower p).
Cosine Distance: Measures the angle between vectors regardless of their magnitude, focusing purely on orientation. It's particularly valuable for text and high-dimensional sparse data where the direction of vectors often carries more semantic meaning than their length, making it the standard choice for comparing document embeddings and semantic search.
Jaccard Distance: Measures dissimilarity between sets, calculated as 1 minus the ratio of the intersection to the union of the sets. It focuses entirely on presence/absence rather than values, treating features as binary and ignoring magnitude completely, making it ideal for binary data, set comparisons, and collaborative filtering.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
frames_cosine		frames_cosine
frames_euclidean		frames_euclidean
frames_jaccard		frames_jaccard
frames_manhattan		frames_manhattan
frames_minkowski		frames_minkowski
plots		plots
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distance Metrics in Embedding Space

Overview

Visualizations

Recommendations

Considered metrics

About

Uh oh!

Releases

Packages

Languages

sn2727/distance-metrics-recommendation-clustering

Folders and files

Latest commit

History

Repository files navigation

Distance Metrics in Embedding Space

Overview

Visualizations

Recommendations

Considered metrics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages