Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Analyzing Ordinal Data in SAS using the Binary, Binomial, and Beta Distribution.
Artificial Intelligence

Analyzing Ordinal Data in SAS using the Binary, Binomial, and Beta Distribution.

Last Updated on March 24, 2022 by Editorial Team

Author(s): Dr. Marc Jacobs

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

This will post will build on previous postsβ€Šβ€”β€Šan introductory post on PROC GLIMMIX and a post showing how to analyze ordinal data using the ordinal and multinomial distribution. This post will extend those posts by analyzing the same datasetβ€Šβ€”β€Šdiarrhea scores measured in pigs across time. Here, diarrhea is measured subjectively using an ordinal scoringΒ system.

So, let's move to the Binomial distribution and its continuous counterpart, the Beta distribution. The Binary & Binomial distributions both deal with discrete proportions and transform them into probabilities/proportions. Below you can see an example coming from an ordinal model using the cumulative probability distribution of the cumlogitΒ link.

Probability plot showing the cumulative probability.

However, sometimes it is just not possible to estimate the effect of a treatment in an ordinal manner. Here, scores 2 and 3 add less than 15% to the total scale. Hence, it would perhaps be wise to combine them and to compare scoresβ€Šβ€”β€Šfor instance by aggregating four groups into two (0 & 1 vs 2 & 3). If you do so, you need to also specify what diarrhea is exactly because you can do 0 &1 vs 2 & 3. Or, you can do 0 vs 1 & 2 & 3. Statistics won't help you here to make that decision, it has to come from content knowledge.

Now, if you want to analyze a binary division you need to determine if you want to analyze it as a proportion or as aΒ rate:

  1. Proportion = ratio of the same two metrics β†’ diarrhea / totalΒ feces
  2. Rate = ratio of two distinct metrics β†’ diarrhea/total daysΒ measured

In terms of data management, the data needs to be made appropriate for analysis by a binary or binomial distribution. Since Binary / Binomial can deal with the time component (unlike the Ordinal or Multinomial distribution), we want to create a dataset that can accommodate this kind of analysis. Below you see the final dataset in which we have, per pen, the treatment, block, day, and feacal score. There is no longer a frequency metric included.

The data Feaces_2 is the one I need to use the Binary Distribution and model the feacal score across time per treatment.
And the distribution of diarrheic across time. Remember, now we have a dichotomous split, it is much easier to look at the data. But at the cost of knowledge.
The code to produce the plotΒ above.

Now, let's move on to the actual modeling. As I said, I will use the binary distribution and the logit link. That is the same link as I used for the Ordinal / Multinomial Model. It also means that comparisons will be done using the OddsΒ Ratio.

GLIMMIX codeβ€Šβ€”β€ŠBinary distribution. Notice that I have to specify an event score. Here, the presence of diarrhea.

The code for the Binary distribution did not run which is not strange, since it often does not run. This is because of the way the model needs to assess the variance in the dataβ€Šβ€”β€Šby looking between rows. If there is not enough variance, or if there is not enough data, the model just won’t converge. No matterΒ what.

Common issues: doesn’t converge with too many pens / days. Doesn’t converge when the ratio 0 / 1 is too big (f.e. 0 = 534;Β 1=45).
And when something does not converge, you get nothing, which most often shows in the ability to estimate variance. And without variance estimates, you haveΒ nothing.
Let's try again by simplifying the model, deleting observations, and decreasing the acceptance threshold.
Then it runs, but not exactly the results you would like toΒ have.

So, let's try the Binary distribution on a different dataset. Most of the time, there is just not much you can do from a model perspective if the data does not hold the granularity needed.

This should be a more fruitfulΒ dataset.
Lets run it again. As you can see, I included a covariance model on the error part using a first-order autoregressive to deal with correlated errors.
If the residuals look weird then relax, they do not. Using a Binary distribution, you cannot expect homogenous or normal error. What you should expect are error that follow the Binary distribution, meaning that they have a split and are less present at the tails then they are the middel of the distribution.

Now, let's venture from the Binary distribution to the Binomial distribution. They are very much the same, except that in a Binary the N=1 whereas for the Binomial the N=Nβ€Šβ€”β€Šyou conduct multiple independent trials from which to assess probability. The Binary distribution is often referred to as the Bernoulli distribution.

Bernoulli trial using p=0.5. Since it is a single trial, and p=0.5, the outcomes 0 or 1 are equallyΒ likely.
Binomial distribution using the same p as in the Bernoulli example. This time, we run 50 independent trials. We get a nice distribution that looks like a Normal distribution.

To go from using the Binary (or Bernoulli distribution) to using the Binomial distribution, we need to change the dataset to accommodate the Y/N necessityβ€Šβ€”β€Šthe number of wins given the number of games. Of course, the Y/N is already a proportion and thus a probability distribution byΒ itself.

Below you can see the transformation from the dataset used for the Binary distribution to the dataset used for the Binomial distribution. I am still trying to model across time, but this time I had to aggregate the data at the week level. This will make the model moreΒ stable.

From Binary to Binomial modeling.
And a plot showing the dataset to model from. Since we are dealing with proportions, there are boundaries (0 and 1) and those boundaries can make modeling this kind of data quite challenging.
PROC GLIMMIX code. In orange what I had to change, which is the specification for the outcome and the distribution. I also included a random block and an unstructured covariance matrix for the error part of the model. Hence, I am going full outΒ here.
Residuals are not really handy to look atβ€Šβ€”β€Šremember, this is NOT a Normal Distribution. The assumption of Normal Distribution errors no longer applies. For homogeneity the same, since we now have natural boundaries. Hence, it is better to lookΒ at

In conclusion, there is not enough variation within this dataset to get a proper model and I detected a lot of boundary values. In addition, the animals were challenged with sub-optimal feed which means that the challenge was not strong enough to get valuable diarrhea results. In other words, the dataset does not contain the level of granularity that IΒ need.

So let's go and try on a different dataset with the same type of modeling. Below you can see the results. Once again, don’t focus too much on the residuals. Even if they look very β€˜ Normal’ now, you should not expect them to be. We are not modeling data using a Normal Distribution.

Looking much betterβ€Šβ€”β€Šplenty of variance to model. The left and right plots (not the middle) show the difference between using the day as a dummy variable or actually modeling day as a time variable. I always found the right plot more informative, although some might argue it is not what the trial was designed to do. I can agree to that, but then the trial can only be used to discuss three-time points.

Now, how would it look like if I would not model per week the old dataset, but just model it overallβ€Šβ€”β€Šthe proportion of diarrhea within 42 weeks? To do so, I need to transform the data until I end up with the one on theΒ right.

From week to overall binary modeling.
The code to obtainΒ results.
And the results. The Mean in the TT Least Squares Means table is the estimated probability.

So, analyzing diarrhea scores via a binary/binomial distribution warrants the decision to specify which scores constitute diarrhea and which do notβ€Šβ€”β€Šit has to be binary. Binary data is yes / no data in its rawest form and is most difficult to analyze. Binomial data is data in the form of a numerator/denominator and often gives you are more stable model. Analyzing the data in a binary/binomial way warrants the transformation of aΒ dataset.

Let's see how far we can go using its continuous counterpartβ€Šβ€”β€Šthe Beta distribution.

Modeling diarrhea using the Beta distribution means that you are venturing into the world of continuous proportion. Below, you can see an example of the Beta distribution and its two parametersβ€Šβ€”β€Šalpha and beta. As you can immediately see, the alpha and beta can be inserted separately, but they are still entwined. This becomes clear when you look at the formulas for the mean and the variance.

The Beta distribution.
And an introductory comparison plot showing the Linear Regression, Logit Regression, and Beta Regression. The nice thing about the Beta is that it can model proportions, in a continuous way. Meaning that you can almost approach it as a Normal Distribution, but then for proportions.
Dataβ€Šβ€”β€Šfrom Binomial to Beta. This means that we have to actually include, in the data frame, the proportion. This is done by dividing Y by N. An important note is that in a Beta the values 0 and 1 are NOT allowed. Hence, if you obtain a 0 or a 1, you must add or subtract a very small number from it. Like 0.00001. This way, you trick the algorithm, without actually changing anything.
Plotting the proportions across days and treatments.
The same as above, but then using different plots. These plots can help tremendously in forecasting convergence succes. And, if it does not converge, dictate to some extent where the bottleneck mightΒ be.
To get some feeling with the data, you can ask for the raw means. The means represent the % of pens with diarrhea scores 2 andΒ 3
The actual code to model theΒ data.
Residuals do not look good. Since we are modeling using a Beta I would expect the residuals to be much more like the Normal. Of course, the absence of proper residuals is seldom reflected inΒ LSMEANS.
Modeling time using a natural cubic spline instead of dummy variables.
Residuals looking better and a very nice model plot to the right. Please note the very large confidence intervals.

Lets again use a different dataset to see if we have more success in using the Beta thisΒ time.

Plots of theΒ data.
Residuals looking very good and so do the estimates coming from dummy variables and splines. Nevertheless, the large amount of variation shows that there really is no differences between treatments. And too much noise to even model aΒ slope.

In summary, the Beta Distribution models proportions, just like the binary and the binomial distribution. Compared to the binary/binomial distribution, the beta distribution models proportions on a continuous spectrum. To use the beta distribution, you need to have proportions in the dataset, and no proportion can be 0 or 1. Compared to the other distributions, the beta distribution is the easiest to model and the easiest to understand.

Hope you enjoyedΒ it!


Analyzing Ordinal Data in SAS using the Binary, Binomial, and Beta Distribution. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓