Why Should Euclidean Distance Not Be the Default Distance Measure?

Last Updated on October 19, 2022 by Editorial Team

Author(s): Harjot Kaur

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

The sustained development of technologies, data storage resources, and computing resources, gives rise to the production, storage, and processing of an exponentially growing volume of data. Typically, data mining applications involve the handling of this huge amount of high-dimensional data.

The major techniques used in data mining rely on the measures deduced from the Euclidean distance to assess differences in distances between objects. However, many studies show that most of those existing techniques are unsuitable for high-dimensional data due to the phenomena known as the “curse of dimensionality”. This curse of mining high-dimensional data arises when the concepts like proximity, distance, or the nearest neighbor become less meaningful with the increasing dimensionality of data sets. Studies show that the relative distance between the farthest point and the nearest point converges to 0 with increased dimensionality d:

Source: Effectiveness of the Euclidean distance in high dimensional spaces

In other words, the discrimination between the nearest and the farthest neighbors in the sample population becomes rather weaker in high-dimensional space.

Further, the paper, On the Surprising Behavior of Distance Metrics in High Dimensional Space, proves that the Euclidean (L2) norm (read more on vector norms here), is often not a desirable metric for high-dimensional data mining applications. For a wide variety of distance functions, because of the concentration of distance in high-dimensional spaces, the ratio of the distances of the nearest and farthest neighbors to a given target is almost one. As a result, there is no variation between the distances of different data points.

Typically, most real-world problems operate in a high-dimensional data space, hence, Euclidean distance is generally not a desirable metric for high-dimensional data mining applications.

Is there a better distance measure for the high-dimensional data?

The aforementioned paper investigates the behavior of the Lk norm in high-dimensional space. Based on these results, for a given value of the high-dimensionality d, it may be preferable to use a lower value of k. In other words, for a high-dimensional application, the L1 distance is more favorable than the L2.

Sahar Sohangir and Dingding Wang, in the paper, Improved sqrt‑cosine similarity measurement propose an improved sqrt-cosine (ISC) similarity measurement metric. ISC is expressed as follows:

Source: Improved sqrt‑cosine similarity measurement

ISC is an extension of the Hellinger distance, which is of the L1 norm (it is proven that in the high-dimensional data, the L1 norm works better than the L2 norm). In the above equation, instead of using the L1 norm, we use the square root of the L1 norm.

The paper impress that most applications consider cosine similarity as “state of the art” in similarity measurement. ISC performs favorably when compared with cosine similarity and other popular techniques for measuring similarities in the high-dimensional data space.

Guidepost to selecting the ‘best’ distance measure

We must understand that there may never be the ‘best’ distance measure, but there always exists a ‘right’ measure. The selection of the right distance measure is dependent on factors like data distribution, data dimensionality, data type, the expectations/goal we pursue, etc. For instance, if we are dealing with text data, we know cosine similarity tends to work better. Similarly, data with higher noise and outliers may not be easily dealt with Euclidean distance. Again, Euclidean distance performs very well on two-dimensional data, where the aim is to measure magnitude.

An excerpt from the paper, A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data, gives a fair amount of direction to selecting the ‘right’ distance measure.

Source: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=printable

Conclusion

Generally, most real-world problems operate in a high-dimensional data space which makes Euclidean distance not the most desirable measure. For high-dimensional data space, improved sqrt-cosine (ISC) has been proven to perform better than most measures. However, there cannot be one distance measure fitting every situation. Data practitioners must carefully examine factors like data distribution, data dimensionality, data type, noise, the expectations/goal we pursue, etc., before choosing the ‘right’ measure and STOP setting ‘Euclidean distance as the default distance measure’.

Lastly, thank you for your patience in reading up till the end and if you’ve found this piece useful, then give me a clap or two! and if not, do write back with your comments and questions; I will be happy to answer and connect for a discussion on Linkedin.

References:

a) https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=printable

b) https://www.researchgate.net/profile/Jonathan-Goldstein-8/publication/2845566_When_Is_Nearest_Neighbor_Meaningful/links/09e4150b3eb298bf21000000/When-Is-Nearest-Neighbor-Meaningful.pdf

c) http://kops.uni-konstanz.de/bitstream/handle/123456789/5849/P506.pdf?sequence=1&isAllowed=y

d) https://bib.dbvis.de/uploadedFiles/155.pdf

e) https://journalofbigdata.springeropen.com/articles/10.1186/s40537-017-0083-6

Why Should Euclidean Distance Not Be the Default Distance Measure? was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Why Should Euclidean Distance Not Be the Default Distance Measure?

Author(s): Harjot Kaur

Is there a better distance measure for the high-dimensional data?

Guidepost to selecting the ‘best’ distance measure

Conclusion

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Exploring Deep Learning Models: Comparing ANN vs CNN for Image Recognition

LAI #72: From Python Groundwork to Function Calling, ICL Theory, and Load Balancing MoEs

Quantum AI Is Coming. Here’s What No One Is Telling You (But Should)

Tool Descriptions Are Critical: Making Better LLM Tools + Research Capability

Top 5 AI Chatbot projects to showcase on your Portfolio: with Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Why Should Euclidean Distance Not Be the Default Distance Measure?

Author(s): Harjot Kaur

Is there a better distance measure for the high-dimensional data?

Guidepost to selecting the ‘best’ distance measure

Conclusion

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥