Why Should Euclidean Distance Not Be the Default Distance Measure?
Last Updated on October 19, 2022 by Editorial Team
Author(s): Harjot Kaur
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
The sustained development of technologies, data storage resources, and computing resources, gives rise to the production, storage, and processing of an exponentially growing volume of data. Typically, data mining applications involve the handling of this huge amount of high-dimensional data.
The major techniques used in data mining rely on the measures deduced from the Euclidean distance to assess differences in distances between objects. However, many studies show that most of those existing techniques are unsuitable for high-dimensional data due to the phenomena known as the “curse of dimensionality”. This curse of mining high-dimensional data arises when the concepts like proximity, distance, or the nearest neighbor become less meaningful with the increasing dimensionality of data sets. Studies show that the relative distance between the farthest point and the nearest point converges to 0 with increased dimensionality d:
In other words, the discrimination between the nearest and the farthest neighbors in the sample population becomes rather weaker in high-dimensional space.
Further, the paper, On the Surprising Behavior of Distance Metrics in High Dimensional Space, proves that the Euclidean (L2) norm (read more on vector norms here), is often not a desirable metric for high-dimensional data mining applications. For a wide variety of distance functions, because of the concentration of distance in high-dimensional spaces, the ratio of the distances of the nearest and farthest neighbors to a given target is almost one. As a result, there is no variation between the distances of different data points.
Typically, most real-world problems operate in a high-dimensional data space, hence, Euclidean distance is generally not a desirable metric for high-dimensional data mining applications.
The aforementioned paper investigates the behavior of the Lk norm in high-dimensional space. Based on these results, for a given value of the high-dimensionality d, it may be preferable to use a lower value of k. In other words, for a high-dimensional application, the L1 distance is more favorable than the L2.
Sahar Sohangir and Dingding Wang, in the paper, Improved sqrt‑cosine similarity measurement propose an improved sqrt-cosine (ISC) similarity measurement metric. ISC is expressed as follows:
ISC is an extension of the Hellinger distance, which is of the L1 norm (it is proven that in the high-dimensional data, the L1 norm works better than the L2 norm). In the above equation, instead of using the L1 norm, we use the square root of the L1 norm.
The paper impress that most applications consider cosine similarity as “state of the art” in similarity measurement. ISC performs favorably when compared with cosine similarity and other popular techniques for measuring similarities in the high-dimensional data space.
Guidepost to selecting the ‘best’ distance measure
We must understand that there may never be the ‘best’ distance measure, but there always exists a ‘right’ measure. The selection of the right distance measure is dependent on factors like data distribution, data dimensionality, data type, the expectations/goal we pursue, etc. For instance, if we are dealing with text data, we know cosine similarity tends to work better. Similarly, data with higher noise and outliers may not be easily dealt with Euclidean distance. Again, Euclidean distance performs very well on two-dimensional data, where the aim is to measure magnitude.
An excerpt from the paper, A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data, gives a fair amount of direction to selecting the ‘right’ distance measure.
Conclusion
Generally, most real-world problems operate in a high-dimensional data space which makes Euclidean distance not the most desirable measure. For high-dimensional data space, improved sqrt-cosine (ISC) has been proven to perform better than most measures. However, there cannot be one distance measure fitting every situation. Data practitioners must carefully examine factors like data distribution, data dimensionality, data type, noise, the expectations/goal we pursue, etc., before choosing the ‘right’ measure and STOP setting ‘Euclidean distance as the default distance measure’.
Lastly, thank you for your patience in reading up till the end and if you’ve found this piece useful, then give me a clap or two! and if not, do write back with your comments and questions; I will be happy to answer and connect for a discussion on Linkedin.
References:
a) https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=printable
c) http://kops.uni-konstanz.de/bitstream/handle/123456789/5849/P506.pdf?sequence=1&isAllowed=y
d) https://bib.dbvis.de/uploadedFiles/155.pdf
e) https://journalofbigdata.springeropen.com/articles/10.1186/s40537-017-0083-6
Why Should Euclidean Distance Not Be the Default Distance Measure? was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI