This project visualizes how different distance metrics affect clustering in embedding spaces for a small sample news dataset.
Embeddings map objects to vectors in a continuous space. The choice of distance metric significantly impacts how we measure similarity between these vectors and consequently how algorithms like K-means form clusters. This visualization helps to understand these differences intuitively.
Showing top 10 recommendations for a random article using the different metrics:

Figure from https://www.maartengrootendorst.com/blog/distances/
-
Euclidean Distance: The straight-line distance between two points in space, calculated as the square root of the sum of squared differences between coordinates. It creates spherical clusters and works best in low-dimensional spaces with dense, real-valued data where scale is meaningful, but suffers in high dimensions and is sensitive to outliers.
-
Manhattan Distance: The sum of absolute differences between coordinates, representing the distance a taxi would drive in a grid-like city layout. It creates more diamond-shaped clusters, is more robust to outliers than Euclidean distance, and often performs better in high-dimensional spaces, making it suitable for discrete features.
-
Minkowski Distance: A generalization of both Euclidean (p=2) and Manhattan (p=1) distances, allowing flexibility in how differences between dimensions are aggregated. By adjusting the parameter p, you can control the balance between focusing on large differences (higher p) versus treating all differences more equally (lower p).
-
Cosine Distance: Measures the angle between vectors regardless of their magnitude, focusing purely on orientation. It's particularly valuable for text and high-dimensional sparse data where the direction of vectors often carries more semantic meaning than their length, making it the standard choice for comparing document embeddings and semantic search.
-
Jaccard Distance: Measures dissimilarity between sets, calculated as 1 minus the ratio of the intersection to the union of the sets. It focuses entirely on presence/absence rather than values, treating features as binary and ignoring magnitude completely, making it ideal for binary data, set comparisons, and collaborative filtering.





