Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Here Is Why You Probably Use numpy.std Incorrectly
Latest

Here Is Why You Probably Use numpy.std Incorrectly

Last Updated on December 17, 2022 by Editorial Team

Author(s): Leon Eversberg

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

How to estimate the standard deviation correctly inΒ Python

A bell-shaped normal distribution. Image by OpenClipart-Vectors onΒ Pixabay

A normally distributed population with size N can be described with its mean ΞΌ and its standard deviation Οƒ. This distribution is also known as a bellΒ curve.

The standard deviation (std) can be computed using Eq.Β 1.

Eq. 1: The standard deviation Οƒ of a population with size N and meanΒ ΞΌ

However, if we compute the standard deviation in Python with NumPy and pandas, we get different results.

In this article, you will learn the reason why this is theΒ case.

np.std vs pandasΒ std

Here is the output of NumPy’s std and the output of pandas std for some random data pointsΒ X.

import numpy as np
import pandas as pd

X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

df = pd.DataFrame({'X': X})

print(f"numpy std(X): {np.std(X)}")
>> numpy std(X): 2.8722813232690143

print(f"pandas std(df): {df.std()}")
>> pandas std(df): X 3.02765

As you can see, NumPy gives us a standard deviation of 2.87 and pandas gives us a standard deviation of 3.02. So, which one isΒ true?

Bessel’s Correction

To understand the problem, we have to dive deeper into the topic of the standard deviation.

As Engineers or Scientists, we usually do not know the true population mean ΞΌ. However, we can calculate the sample mean from our data points x with the well-known Eq.Β 2.

Eq. 2: The sample mean for data pointsΒ x

By replacing the population mean ΞΌ with the sample mean m, and the population size N with the sample size n in Eq. 1, we get the sample standard deviation s according to Eq.Β 3.

Eq. 3: The sample standard deviation

Given enough data points, we want the sample standard deviation s to be as close as possible to the true standard deviation of the population Οƒ. Mathematically, this can be expressed using the expectedΒ value.

It turns out that this is not true for the sample standard deviation. Eq. 4 shows that the sample standard deviation is biased because of the additional term (nβ€Šβ€”β€Š1)/n. For the full proof, see reference [1].

Eq. 4: The sample standard deviation is biasedΒ [1]

However, this can be corrected by replacing the n in Eq. 3 with (nβ€Šβ€”β€Š1).

This is called Bessel’s correction, leading us to the unbiased estimator of Eq.Β 5.

Eq. 5: Unbiased estimate of the population standard deviation

Explaining np.std vs pandasΒ std

Given this knowledge, we can now explain the difference between np.std and pandas std functions.

By default, NumPy uses 1/n(Eq. 3), whereas pandas uses Bessel’s correction with 1/(n-1)(Eq. 4). We can change NumPy’s calculation by specifying the parameter ddof.

ddof: int,Β optional

Means Delta Degrees of Freedom. The divisor used in calculations is N – ddof, where N represents the number of elements. By default ddof isΒ zero.

Going back to our initial Python example, we can now use the parameter ddof in np.std to get an unbiased estimate of the standard deviation.

import numpy as np
import pandas as pd

X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

df = pd.DataFrame({'X': X})

print(f"pandas std(df): {df.std()}")
>> pandas std(df): X 3.02765

print(f"numpy std (X, ddof=1): {np.std(X, ddof=1)}")
>> numpy std (X, ddof=1): 3.0276503540974917

Setting the parameter ddof = 1 in np.std now gives us a standard deviation of 3.02. We get the same result withΒ pandas.

Conclusion

Bessel’s correction fixes the bias from the sample standard deviation s by replacing n with nβ€Šβ€”β€Š1.

NumPy’s std function uses the formula nβ€Šβ€”β€Šddof. By default, NumPy uses ddof = 0 and pandas uses ddof =Β 1.

To sum it up, if you have some data points from a population and you want to estimate the unbiased standard deviation with NumPy, use np.std(X, ddof =Β 1).

References

[1] The Department of Mathematics and Computer Science, Bessel’s Correction (accessed: 14.12.2022)

M. HolickΓ½, Introduction to Probability and Statistics for Engineers (2013), Springer, Berlin, Heidelberg.

pandas.DataFrame.std: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html (accessed: 14.12.2022)

numpy.std: https://numpy.org/doc/stable/reference/generated/numpy.std.html (accessed: 14.12.2022)


Here Is Why You Probably Use numpy.std Incorrectly was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓