Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Signing off,
Latest   Machine Learning

Signing off,

Last Updated on July 25, 2023 by Editorial Team

Author(s): Hitesh Hinduja

Originally published on Towards AI.

Comparative Analysis of Popular Statistical Tests: Which One to Use When?

Let me begin by sharing my experience in detail. During my early years in the corporate world, my mentor imparted a piece of advice that has stayed with me to this day. He said, “If you want to excel in data science, you need to have a strong grasp of statistics. Visualize every problem as a statistical analysis and break it down into its components.”

Over the course of several months, my mentor explained to me nearly 27 different statistical tests that he had used during his 22+ year tenure in the corporate world. He gave me a comprehensive understanding of why each test was used and how it proved beneficial to his problem. Rather than simply describing the formula for each test, he focused on explaining each test based on its need and requirements.

My mentor would often tell me, “Many students are intimidated by statistics due to its complex derivations and mathematical terms. They try to understand the tests in a theoretical way.” For instance, he cited the example of Pearson’s correlation test. The standard answer to the question, “What is Pearson’s correlation test?” is “A statistical measure that shows the strength and direction of the linear relationship between two continuous variables. It is denoted by the symbol ‘r’ and can range from -1 to +1, with + denoting a positive relationship and -1 denoting a negative relationship.” While this answer is good, candidates often get confused or fail to provide a satisfactory response when asked about where the test is used and how it compares with other correlation tests.

During my mentorship under him, he taught me these tests and gave me homework to prepare a comparative analysis of all the tests he had explained. It took me 2–3 weeks to delve deeper into the subject, and I eventually created a small Excel file with a comparative analysis. We had detailed discussions, and after 4–5 months of these sessions, I became well-versed with each test that he had explained.

This is not the end of the story. Once, there was a technical discussion between the SVP of technology, my mentor, and a few others. Suddenly, I was called into the room to participate in the meeting. The audience was so technically sound that I found it difficult to grasp their discussion, and my brain was processing/grasping at just 50% of my regular capacity! Nonetheless, I tried my best to comprehend some aspects of the discussion and made notes of the topics and terms that I didn’t understand.

I still vividly remember the day when my diary was filled with nine pages of notes from that 40–45 minute meeting. Suddenly, my mentor asked me a question about improving the existing statistical model created by the previous team, and I was expected to provide suggestions on how we could enhance the model. He informed me about the three statistical tests that were used in the analysis. Since I could relate to all the lessons that I had received from him over the last few months, I started asking follow-up questions promptly.

Throughout my career in AI, statistics have consistently played a significant role. Over the past 8+ years, I have worked with and utilized over 85 different statistical tests, including but not limited to Theil’s U, ANOVA (in its multiple types), Cochran’s Q test, and many more. In my experience, every time I encounter a problem, having a strong practical understanding of the appropriate statistical test to use has been a win-win situation for me. Therefore, my advice to everyone is to have a solid foundation in statistics and be familiar with well-known statistical tests as a starting point. In this blog, we will study several statistical tests, each with a simple explanation followed by a technical explanation. Finally, we will compare all of these tests in a table. So, let’s get started, storytime!

Following are the statistical tests we will be reading about and describing the real-life scenarios:

Chi-square test

Explanation with fun

Imagine you work at a candy factory that produces bags of M&Ms. You have noticed that the bags seem to have different proportions of colors than what you expected. You suspect that the machine that sorts the candies may not be working properly. To test this hypothesis, you decide to conduct a chi-square test.

You randomly sample several bags of M&Ms and count the number of candies of each color. You then compare these counts to the expected proportions based on the manufacturer’s stated ratios. You plug these numbers into a formula and get a chi-square value. If the value is high, it means that the observed counts are significantly different from the expected counts, and you can reject the null hypothesis that the machine is working properly.

To explain this in a funny way, imagine that the M&Ms are actually aliens from different planets. The red ones are from Mars, the green ones are from Venus, the blue ones are from Neptune, and so on. Your job is to ensure that they are all sorted correctly, but you suspect that the machine is mixing them up. So, you count the number of aliens from each planet in several bags and compare them to the expected numbers based on intergalactic treaties. If the actual counts are significantly different, it means that the machine is causing an intergalactic diplomatic incident, and you need to fix it before the aliens invade Earth!

Why Chi-square test:

Let’s start with an example to understand why we need the Chi-square test.

In a class of 100 students, 50 boys and 50 girls. The class teacher wants to see if there is any relation between gender and subject preference. Here gender is a categorical variable, and subjects’ preference for yes or no is nominal data.

So, whenever we need to compare two categorical variables, whether they are related or independent of each other, the chi-square test is definitely a must-do statistical test.

The chi-square test is used to find if there is any relation between two or more categorical variables, especially those that are nominal in nature.

  • The Null hypothesis (H0) is that there is no association between the two variables.
  • The Alternate Hypothesis (H1)is that there is a significant association between the two variables.

There are two types of chi-square tests:

  • The test for independence finds relationships between two categorical variables.
  • The goodness of fit test finds the relationship between observed values and theoretical values.

(There is one more type known as the Test of Homogeneity. In context to the length of the blog, we are not covering this here)

How is the Chi-square test performed:

It compares the observed frequency and expected frequency of each categorical variable present in the contingency table.

If the calculated chi-squared statistic is greater than the critical value at a specified level of significance, then we reject the null hypothesis and conclude that there is a significant association between the two variables. It is a one-sided test.

The chi-square test depends on the size of the difference between actual and observed values, the degrees of freedom, and the sample size.

The chi-square test statistic ( χ2 )= ∑(Oi — Ei)2/Ei

Where:

  • Oi = Observed values
  • Ei = Expected values

Where is the Chi-square used:

This test is widely used in the fields of biology, marketing, social sciences, etc, where researchers or analysts are required to analyze data from multiple categorical variables to understand patterns or trends and relationships.

Advantages:

  • The chi-square test is a non-parametric test, which means that it can be used to analyze data that does not follow a normal distribution. This makes it a flexible tool that can be used in a wide range of applications.
  • Apart from this, it’s very simple to use, it also allows you to test multiple variables simultaneously and which is a great feature.

Mann-Whitney U test

Explanation with fun

Imagine you have two groups of people: Group A, who are the die-hard fans of pizza, and Group B, who swear by burgers. Now, you want to know which group eats more on average, but you don’t want to count the number of slices of pizza or burgers each person eats because that’s too much work. So instead, you decide to do the Mann-Whitney U test.

You line up all the people in both groups based on how much they eat, from the person who eats the least to the person who eats the most. Then, you give each person a rank based on their position in the line. For example, the person who eats the least gets a rank of 1, the second-least eater gets a rank of 2, and so on.

Now, you add up the ranks for each group. Let’s say group A has a total rank of 100, and group B has a total rank of 150. You realize that the higher rank sum (in this case, group B) suggests that they eat more on average than group A.

So, in short, the Mann-Whitney U test helps you figure out which group of foodies eats more without having to count every single slice or burger.

Why Mann-Whitney Test:

The Mann-Whitney U Test, also known as the Mann-Whitney-Wilcoxon test, is a non-parametric statistical test used to compare two samples or groups. It assesses whether two sampled groups are likely to derive from the same population.

  • The null hypothesis (H0) is that the two populations are equal
  • The alternative hypothesis (H1) is that the two populations are not equal

Some researchers interpret this as comparing the medians between the two populations (in contrast, parametric tests compare the means between two independent groups). In certain situations where the data are similarly shaped (see assumptions), this is valid — but it should be noted that the medians are not actually involved in the calculation of the Mann-Whitney U test statistic. Two groups could have the same median and be significantly different according to the Mann-Whitney U test.

When, Where, and How to Use the Mann-Whitney U Test

Non-parametric tests (sometimes referred to as ‘distribution-free tests’) are used when you assume the data in your populations of interest do not have a Normal distribution. You can think of the Mann-Whitney U-test as analogous to the Student’s t-test, which you would use when assuming your two populations are normally distributed, as defined by their means and standard deviation (the parameters of the distributions).

The Mann-Whitney U Test is a common statistical test that is used in many fields, including economics, biological sciences, and epidemiology. It is particularly useful when you are assessing the difference between two independent groups with low numbers of individuals in each group (usually less than 30), which are not normally distributed, and where the data are continuous.

If you are interested in comparing more than two groups that have skewed data, a Kruskal-Wallis One-Way analysis of variance (ANOVA) should be used.

U1=n1n2+((n1*(n1+1))/2)-R1

U2=n1n2+((n2*(n2+1))/2)-R2

Where:

  • n1= sample size of sample1
  • n2 =sample size of sample2
  • R1= sum of ranks of sample1
  • R2=sum of ranks of sample2

After calculation, the smallest of U1 and U2 should be selected as the U statistic.

Mann-Whitney U Test Assumptions

Some key assumptions for Mann-Whitney U Test are detailed below:

  • The variable being compared between the two groups must be continuous (able to take any number in a range — for example, age, weight, height, or heart rate). This is because the test is based on ranking the observations in each group.
  • The data are assumed to take a non-Normal, or skewed distribution. If your data are normally distributed, the unpaired Student’s t-test should be used to compare the two groups instead.
  • While the data in both groups are not assumed to be Normal, the data are assumed to be similar in shape across the two groups.
  • The data should be two randomly selected independent samples, meaning the groups have no relationship to each other. If samples are paired (for example, two measurements from the same group of participants), then a paired samples t-test should be used instead.
  • Sufficient sample size is needed for a valid test, usually more than 5 observations in each group.

Kruskal-Wallis test

Explanation with fun

Once upon a time, there was a kingdom ruled by three kings who had a never-ending rivalry. They were always arguing about whose kingdom had the best knights. One day, they decided to settle the argument once and for all by having a knight’s tournament.

Each king selected their top five knights to compete, but the problem was that they couldn’t agree on the tournament format. The first king wanted to have a single-elimination tournament, where the winner of each match moves on to the next round. The second king wanted to have a round-robin tournament where each knight would fight against every other knight. The third king suggested a hybrid of the two, where there would be multiple rounds of single-elimination matches.

They couldn’t decide which format to use, so they turned to the kingdom’s statistician, Kruskal. Kruskal suggested using the Kruskal-Wallis test to compare the average rankings of each knight across the different tournament formats.

The kings were skeptical at first, thinking that the Kruskal-Wallis test was some kind of magic spell. But Kruskal explained that it’s just a statistical test that compares the ranks of the knights across the different groups.

In the end, they ran the Kruskal-Wallis test and found that there was no significant difference in the average rankings of the knights across the three tournament formats. The kings were surprised but pleased that they could finally settle the argument without any hurt feelings or injuries.

And so, the kingdom lived happily ever after, with the three kings still bickering but now armed with the knowledge of the Kruskal-Wallis test.

(partial credits to ChatGPT for improvising this story)

When, Why, and How to use the Kruskal-Wallis Test

The Kruskal-Wallis test is a non-parametric alternative to One-Way ANOVA, which is used to compare three or more groups. This test is used when the assumptions for ANOVA are not met, such as the assumption of normality. It is also known as the one-way ANOVA on ranks, as the ranks of the data values are used in the test instead of the actual data points.

The Kruskal-Wallis test determines whether the medians of two or more groups are different. To perform this test, you need to calculate the test statistic called the H statistic and compare it to a distribution cut-off point.

The hypotheses for the test are:

  • H0: population medians are equal.
  • H1: population medians are not equal.

Where:

  • n = sum of sample sizes for all samples,
  • c = number of samples,
  • Tj = sum of ranks in the jth sample,
  • nj = size of the jth sample.

The Kruskal Wallis test will tell you if there is a significant difference between groups. However, it won’t tell you which groups are different. To identify the different groups, a Post Hoc test needs to be conducted.

Where is this test used?

  • Medical research: The Kruskal-Wallis test can be used to compare the efficacy of different treatments in clinical trials.
  • Psychology: It can be used to compare the scores of different groups on psychological tests or surveys.
  • Sociology: The Kruskal-Wallis test can be used to compare the income levels of people from different regions.
  • Biology: It can be used to compare the growth rates of different plants or the survival rates of different animal species in different environments.

Fisher’s exact test

Explanation with fun

Imagine you’re a farmer who has two types of chickens — red chickens and blue chickens. You’ve been keeping track of the number of eggs each chicken lays in a week, and you’ve noticed that the red chickens tend to lay more eggs on average than the blue chickens. But you’re not sure if this difference is just due to chance or if it’s really significant.

So, you decide to conduct a study to see if there’s really a difference in egg-laying ability between the two types of chickens. You randomly select 10 red chickens and 10 blue chickens, and you count the number of eggs each chicken lays in a week. You end up with the following data:

Red chickens: 25, 24, 23, 26, 28, 27, 26, 22, 25, 24

Blue chickens: 19, 18, 20, 21, 22, 19, 20, 18, 20, 21

You calculate the average number of eggs laid by each group — the red chickens have an average of 25.0 eggs per week, while the blue chickens have an average of 19.8 eggs per week. It seems like there might be a real difference between the two groups, but how can you be sure?

Enter Fisher’s exact test. This test is used to determine if there is a significant difference between two groups for a categorical variable (in this case, the variable is “chicken color” and the categories are “red” and “blue”). The test calculates the probability of observing the data you collected, assuming that there is no real difference between the two groups. If the probability is very low (typically less than 5%), then you can reject the null hypothesis and conclude that there is a significant difference between the groups.

So, you run Fisher’s exact test on your chicken data and it gives you a p-value of 0.03. This means that there is only a 3% chance of observing the data you collected if there is really no difference between the red and blue chickens. Since this probability is so low, you can conclude that there is a significant difference between the two groups — the red chickens really do lay more eggs on average than the blue chickens.

And that, my friends, is how Fisher’s exact test can help you determine if your chickens are really laying eggs like champs or if it’s just a matter of chance!

Doesn’t this sound similar to the chi-square test? If you are feeling that means you have been reading stories with all your heart!

Don’t worry, we will come to that part a little later in this blog on how it is different from the chi-square test along with when to use what kind of test, summarised in a table format.

What is Fisher’s exact test?

Fisher’s exact test determines whether a statistically significant association exists between two categorical variables.

For example, does a relationship exist between gender (Male/Female) and voting Yes or No on a referendum?

The hypotheses for the test are:

  • H0: There is no association between gender and voting. They are independent.
  • H1: A relationship between gender and voting exists in the population.

Typically, you’ll display data for Fisher’s exact test in a two-way contingency table. Frequently, this analysis assesses 2X2 contingency tables, but there are extensions for two-way tables with any number of rows and columns.

When to Use Fisher’s Exact Test vs. Chi-Square

When reading the description above, you might have thought that Fisher’s exact test sounds like the Chi-Square Test of Independence. And you’re right! They both serve the same purpose — assessing a relationship between categorical variables.

However, differences in the underlying methodology affect when you should use each method.

The Chi-Square Test of Independence is a more traditional hypothesis test that uses a test statistic (chi-square) and its sampling distribution to calculate the p-value. However, the chi-square sampling distribution only approximates the correct distribution, providing better p-values as the cell values in the table increase. Consequently, chi-square p-values are invalid when you have small cell counts.

On the other hand, Fisher’s exact test doesn’t use the chi-square statistic and sampling distribution. Instead, it calculates the number of all possible contingency tables with the same row and column totals (i.e., marginal distributions) as the observed table. Then it calculates the probability for the p-value by finding the proportion of possible tables that are more extreme than the observed table. Technically, Fisher’s exact test is appropriate for all sample sizes. However, the number of possible tables grows at an exponential rate and soon becomes unwieldy. Hence, statisticians use this test for smaller sample sizes.

p= ( ( a + b ) ! ( c + d ) ! ( a + c ) ! ( b + d ) ! ) / a! b! c! d! N!

Where:

  • ‘a,’ ‘b,’ ‘c’ and ‘d’ are the individual frequencies of the 2X2 contingency table
  • ’N’ is the total frequency.

Chi-square is generally best for larger samples and Fisher’s is better for smaller samples.

Here are the guidelines for when to use Fisher’s exact test:

  • Cell counts are smaller than 20
  • A cell has an expected value of 5 or less.
  • The column or row marginal values are extremely uneven

Shapiro-Wilk test

Explanation with fun

Imagine you’re at a party with your friends, and someone brings out a new game they made up called “The Normality Test”. The game involves everyone standing in a line and taking turns telling a joke. The catch is, each person’s joke has to be funnier than the previous one, or they’re out of the game.

Now, imagine that your friend Bob is really excited to play this game, but he’s not very good at telling jokes. In fact, his jokes are usually pretty terrible. So when it’s Bob’s turn, he tells a joke that falls completely flat. Everyone groans and shakes their head. But Bob insists that it was a good joke and that everyone is just being too critical.

This is where the Shapiro-Wilk test comes in. It’s like a statistical version of “The Normality Test”. It checks whether a set of data is “normal” or “abnormal” based on how closely it fits a theoretical normal distribution. If the data is too far from normal, it fails the test.

So, in our party game example, we could use the Shapiro-Wilk test to determine whether Bob’s sense of humor is “normal” or “abnormal”. If his joke is so bad that it doesn’t even get a chuckle, we might say that his humor is abnormal and that he failed the Shapiro-Wilk test.

Of course, this is just a silly example, but the Shapiro-Wilk test is actually a useful tool for checking whether data is normally distributed, which is an important assumption for many statistical analyses.

(partial credits to ChatGPT for improvising this story)

Crisp Explanation

The Shapiro-Wilk test is a way to tell if a random sample comes from a normal distribution. The test gives you a W value; small values indicate your sample is not normally distributed (you can reject the null hypothesis that your population is normally distributed if your values are under a certain threshold). The formula for the W value is:

where:

  • xi is the ordered random sample values
  • ai are constants generated from the covariances, variances, and means of the sample (size n) from a normally distributed sample.

The test has limitations, most importantly that the test has a bias by sample size. The larger the sample, the more likely you’ll get a statistically significant result.

Quick Summary of the technical details that we covered in the previous sections:

Summary of the five statistical tests. * indicates some special cases. For example, in the chi-square test for independence, we may assume that the data follows a bivariate distribution and thus can be called parametric because of the underlying distribution of the data being analyzed. But chi-square is generally considered a non-parametric test

So that’s it from our side folks. In conclusion, statistical tests play a crucial role in analyzing data and making inferences from it. The choice of the appropriate test depends on the type of data, research question, and assumptions made about the data. In this blog, we have discussed the chi-square test, Mann-Whitney U test, Kruskal-Wallis test, Fisher’s exact test, and Shapiro-Wilk test. Each test has its own limitations, assumptions, and applications. Understanding these tests can help researchers to choose the appropriate test for their research question and analyze their data correctly. With personal experiences and a non-technical approach, this blog can be a helpful guide for both beginners and experts in the field of statistics.

Just before I go, how do we decide on the test if we are in two minds? U+1F602

“And if all else fails, you can always resort to the ultimate statistical tool: flipping a coin and crossing your fingers. Just kidding, please don’t do that. Stick to these tested and proven statistical tests for accurate and reliable results!”

Hitesh Hinduja & Durga Sai Rongala

Hitesh Hinduja U+007C LinkedIn

DURGA SAI RONGALA U+007C LinkedIn

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓