Fishing: The Bayesian Way of Analyzing Zero-inflated Data

Last Updated on January 7, 2023 by Editorial Team

Last Updated on June 17, 2022 by Editorial Team

Author(s): Dr. Marc Jacobs

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

In past posts, I have shown several ways to apply Bayesian analysis for mostly normally distributed data. In this example, I want to use a discrete response, and what better distribution to start with than an inflated one. You read correctly — this post will be about modeling, in a Bayesian way, a zero-inflated response.

A zero-inflated response distribution is one in which there are more zero’s than you would normally like, but they are there. Hence, you need to deal with them and there are various ways to cope. The most straightforward is the application of a zero-inflated model or applying a hurdle model. They work a bit differently, but what they do have in common is that you split your modeling into a part where you model the zero’s, and a part in which you model the rest. This will be done via two separate models (the hurdle) or via a mixture distribution (the zero-inflated model).

Once again, the codes are at the bottom to not contaminate the story. Also, the codes, in the end, show more information than what I have here. The trick is to run everything I did and augment it. Not just follow my steps.

Alright, so let's go and take a look at the dataset, which can be found here. As you can see, we have a lot of discrete data, both binary as well as ordinal or multinomial. Discrete data is fun to analyze albeit a bit challenging, but should not be a big issue for any analyst. Just think proportions | probabilities | and rates and you will be fine.

The dataset as pulled from the website. Has some nice data. Lets explore!

Next up is the drawing part in which I want to see what I have. The dataset has information on fish, of course, and the metric that matters the most is the count. We are working with count data here.

The relationship between several variables and the number of fish caught.

The zero-inflated nature of the data becomes immediately apparent (although you could already see it coming in the dotplot) when doing a distribution plot. The spike at the beginning will give you problems if you try and analyze via the standard Poisson, or even when using a Gamma-Poisson (Negative Binomial).

Big peak at the beginning is the tell-tale of a zero-inflated distribution which needs to be addressed.

Same here. Trying to split the data could give you a glimpse about factors that influence the zero-inflation, so you directly model it (for instance by using a hurdle-model) but it does not seem that one singular factor can be found.

Of to the modeling part. Of course, I am going Bayesian which means I will have to deal with the three musketeers of Bayesian analysis: prior, likelihood, and posterior. Since I know that the response resembles a zero-inflated distribution, I can ask the brms package to show me the priors the model would automatically assume. Thereby seeing how it wants me to feed the hyperparameters should I decide to bring my own priors (of course I will!).

Priors from the model based on the data and the priors I used. As you can see, the model does not give informative priors, and I do, because I made sure that the variation in the prior distribution is small enough for them to pick a direction. Hence, they are informative. How I chose them is for now irrelevant, since I encountered this dataset, and I did not do any prior investigation. In a full-fledged Bayesian analysis, the research on finding informative priors is of course KEY.

And the results coming from the model. Just check the differences between the priors and the posteriors. That will give you an idea about the influence of the dataset, the likelihood!

No problems here. I am looking at decent sampling space, and that is what I have found. There is relationship between the mean and the sd of the sampling space but that is not a problem. Also, that the mean & sd of the observed response is not in line with the samples drawn from the posterior is no real problem. Remember: Posterior = Prior * Likelihood. Or, New Knowledge = Prior Knowledge * Latest Data. Hence, it could very well be that the latest dataset does not connect to the prior knowledge held. Long live science!

When it comes to Bayesian Analysis you do want to see that your model is able to approximate the underlying distribution via which the data came to exist (that sounds a bit paradoxical — distribution is a man-made construct). Hence, what you want to see when checking the appropriateness of the Bayesian model is that the chains converged nicely and that the distribution in the model used was appropriately sufficient. We can do that by looking at the PIT which transforms the data into a uniform distribution. If the QQ-plot shows that the samples are on the diagonal line, the underlying distribution (which is not uniform here) is probably what thought it would be, which is a zero-inflated distribution. The lines are not perfect here, but I will give it a pass. Not the easiest data to the model.

The Zero-Inflated Poisson (left) is a way better fit then the Normal Distribution.

So, from the above, it seems the model is quite able to model the data, and hence we can use the model to take a better look at the underlying relationships which you find by extracting a conditional distributions plot. What you see below is the relationship between the variables persons, child, and camper with the count.

Conditional probability plot, here in counts.

I hope this example was sufficient to get you underway with modeling a Zero-Inflated dataset. Go through the codes down below and please let me know if something is amiss!

Enjoy!

Fishing: The Bayesian Way of Analyzing Zero-inflated Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Fishing: The Bayesian Way of Analyzing Zero-inflated Data

Author(s): Dr. Marc Jacobs

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Fishing: The Bayesian Way of Analyzing Zero-inflated Data

Author(s): Dr. Marc Jacobs

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement