Here Is Why You Probably Use numpy.std Incorrectly
Last Updated on December 17, 2022 by Editorial Team
Author(s): Leon Eversberg
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
How to estimate the standard deviation correctly inΒ Python
A normally distributed population with size N can be described with its mean ΞΌ and its standard deviation Ο. This distribution is also known as a bellΒ curve.
The standard deviation (std) can be computed using Eq.Β 1.
However, if we compute the standard deviation in Python with NumPy and pandas, we get different results.
In this article, you will learn the reason why this is theΒ case.
np.std vs pandasΒ std
Here is the output of NumPyβs std and the output of pandas std for some random data pointsΒ X.
import numpy as np
import pandas as pd
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df = pd.DataFrame({'X': X})
print(f"numpy std(X): {np.std(X)}")
>> numpy std(X): 2.8722813232690143
print(f"pandas std(df): {df.std()}")
>> pandas std(df): X 3.02765
As you can see, NumPy gives us a standard deviation of 2.87 and pandas gives us a standard deviation of 3.02. So, which one isΒ true?
Besselβs Correction
To understand the problem, we have to dive deeper into the topic of the standard deviation.
As Engineers or Scientists, we usually do not know the true population mean ΞΌ. However, we can calculate the sample mean from our data points x with the well-known Eq.Β 2.
By replacing the population mean ΞΌ with the sample mean m, and the population size N with the sample size n in Eq. 1, we get the sample standard deviation s according to Eq.Β 3.
Given enough data points, we want the sample standard deviation s to be as close as possible to the true standard deviation of the population Ο. Mathematically, this can be expressed using the expectedΒ value.
It turns out that this is not true for the sample standard deviation. Eq. 4 shows that the sample standard deviation is biased because of the additional term (nβββ1)/n. For the full proof, see reference [1].
However, this can be corrected by replacing the n in Eq. 3 with (nβββ1).
This is called Besselβs correction, leading us to the unbiased estimator of Eq.Β 5.
Explaining np.std vs pandasΒ std
Given this knowledge, we can now explain the difference between np.std and pandas std functions.
By default, NumPy uses 1/n(Eq. 3), whereas pandas uses Besselβs correction with 1/(n-1)(Eq. 4). We can change NumPyβs calculation by specifying the parameter ddof.
ddof: int,Β optional
Means Delta Degrees of Freedom. The divisor used in calculations is N – ddof, where N represents the number of elements. By default ddof isΒ zero.
Going back to our initial Python example, we can now use the parameter ddof in np.std to get an unbiased estimate of the standard deviation.
import numpy as np
import pandas as pd
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df = pd.DataFrame({'X': X})
print(f"pandas std(df): {df.std()}")
>> pandas std(df): X 3.02765
print(f"numpy std (X, ddof=1): {np.std(X, ddof=1)}")
>> numpy std (X, ddof=1): 3.0276503540974917
Setting the parameter ddof = 1 in np.std now gives us a standard deviation of 3.02. We get the same result withΒ pandas.
Conclusion
Besselβs correction fixes the bias from the sample standard deviation s by replacing n with nβββ1.
NumPyβs std function uses the formula nβββddof. By default, NumPy uses ddof = 0 and pandas uses ddof =Β 1.
To sum it up, if you have some data points from a population and you want to estimate the unbiased standard deviation with NumPy, use np.std(X, ddof =Β 1).
References
[1] The Department of Mathematics and Computer Science, Besselβs Correction (accessed: 14.12.2022)
M. HolickΓ½, Introduction to Probability and Statistics for Engineers (2013), Springer, Berlin, Heidelberg.
pandas.DataFrame.std: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html (accessed: 14.12.2022)
numpy.std: https://numpy.org/doc/stable/reference/generated/numpy.std.html (accessed: 14.12.2022)
Here Is Why You Probably Use numpy.std Incorrectly was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI