Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Fishing: The Bayesian Way of Analyzing Zero-inflated Data
Latest

Fishing: The Bayesian Way of Analyzing Zero-inflated Data

Last Updated on January 7, 2023 by Editorial Team

Last Updated on June 17, 2022 by Editorial Team

Author(s): Dr. Marc Jacobs

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

In past posts, I have shown several ways to apply Bayesian analysis for mostly normally distributed data. In this example, I want to use a discrete response, and what better distribution to start with than an inflated one. You read correctly — this post will be about modeling, in a Bayesian way, a zero-inflated response.

A zero-inflated response distribution is one in which there are more zero’s than you would normally like, but they are there. Hence, you need to deal with them and there are various ways to cope. The most straightforward is the application of a zero-inflated model or applying a hurdle model. They work a bit differently, but what they do have in common is that you split your modeling into a part where you model the zero’s, and a part in which you model the rest. This will be done via two separate models (the hurdle) or via a mixture distribution (the zero-inflated model).

Once again, the codes are at the bottom to not contaminate the story. Also, the codes, in the end, show more information than what I have here. The trick is to run everything I did and augment it. Not just follow my steps.

Alright, so let's go and take a look at the dataset, which can be found here. As you can see, we have a lot of discrete data, both binary as well as ordinal or multinomial. Discrete data is fun to analyze albeit a bit challenging, but should not be a big issue for any analyst. Just think proportions | probabilities | and rates and you will be fine.

The dataset as pulled from the website. Has some nice data. Lets explore!

Next up is the drawing part in which I want to see what I have. The dataset has information on fish, of course, and the metric that matters the most is the count. We are working with count data here.

The relationship between several variables and the number of fish caught.

The zero-inflated nature of the data becomes immediately apparent (although you could already see it coming in the dotplot) when doing a distribution plot. The spike at the beginning will give you problems if you try and analyze via the standard Poisson, or even when using a Gamma-Poisson (Negative Binomial).

Big peak at the beginning is the tell-tale of a zero-inflated distribution which needs to be addressed.
Same here. Trying to split the data could give you a glimpse about factors that influence the zero-inflation, so you directly model it (for instance by using a hurdle-model) but it does not seem that one singular factor can be found.

Of to the modeling part. Of course, I am going Bayesian which means I will have to deal with the three musketeers of Bayesian analysis: prior, likelihood, and posterior. Since I know that the response resembles a zero-inflated distribution, I can ask the brms package to show me the priors the model would automatically assume. Thereby seeing how it wants me to feed the hyperparameters should I decide to bring my own priors (of course I will!).

Priors from the model based on the data and the priors I used. As you can see, the model does not give informative priors, and I do, because I made sure that the variation in the prior distribution is small enough for them to pick a direction. Hence, they are informative. How I chose them is for now irrelevant, since I encountered this dataset, and I did not do any prior investigation. In a full-fledged Bayesian analysis, the research on finding informative priors is of course KEY.
And the results coming from the model. Just check the differences between the priors and the posteriors. That will give you an idea about the influence of the dataset, the likelihood!
Chains converged very nicely.
No problems here. I am looking at decent sampling space, and that is what I have found. There is relationship between the mean and the sd of the sampling space but that is not a problem. Also, that the mean & sd of the observed response is not in line with the samples drawn from the posterior is no real problem. Remember: Posterior = Prior * Likelihood. Or, New Knowledge = Prior Knowledge * Latest Data. Hence, it could very well be that the latest dataset does not connect to the prior knowledge held. Long live science!

When it comes to Bayesian Analysis you do want to see that your model is able to approximate the underlying distribution via which the data came to exist (that sounds a bit paradoxical — distribution is a man-made construct). Hence, what you want to see when checking the appropriateness of the Bayesian model is that the chains converged nicely and that the distribution in the model used was appropriately sufficient. We can do that by looking at the PIT which transforms the data into a uniform distribution. If the QQ-plot shows that the samples are on the diagonal line, the underlying distribution (which is not uniform here) is probably what thought it would be, which is a zero-inflated distribution. The lines are not perfect here, but I will give it a pass. Not the easiest data to the model.

The Zero-Inflated Poisson (left) is a way better fit then the Normal Distribution.
Looking good as well.
Looking good again.
Nothing to complain.

So, from the above, it seems the model is quite able to model the data, and hence we can use the model to take a better look at the underlying relationships which you find by extracting a conditional distributions plot. What you see below is the relationship between the variables persons, child, and camper with the count.

Conditional probability plot, here in counts.

I hope this example was sufficient to get you underway with modeling a Zero-Inflated dataset. Go through the codes down below and please let me know if something is amiss!

Enjoy!


Fishing: The Bayesian Way of Analyzing Zero-inflated Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓