Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Using PROC GLIMMIX in SAS — examples
Artificial Intelligence

Using PROC GLIMMIX in SAS — examples

Last Updated on March 24, 2022 by Editorial Team

Author(s): Dr. Marc Jacobs

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Analyzing non-normal data in SAS — log data, mortality, litter, and preference scores.

This post is the last in a series showing the potential of PROC GLIMMIX which is the de facto tool for using Generalized Linear Mixed Models. After an introductory post and showing examples using the Multinomial, Binary, Binomial, Beta, Poisson, and Negative Binomial it is now time to go deeper into situations often encountered when analyzing data in the animal sciences. To be more precise, I will show examples of bacterial counts, litter scores, mortality scores, and preference scores. Enjoy!

First of, bacterial counts and how to analyze data on the log scale. Data that have such a wide distribution that they need to be analyzed using a log scale can be surprisingly difficult to model inside a GLIMMIX model, especially when you use the log function. I will show you later what I mean.

Below you see entero data on two points — five and fifteen — across their blocks in a Randomized Complete Block Design.

Entero data on the raw scale (left) and the log scale (right).

Below you can see the datasets used. This was a nested design having data across departments, blocks, and pens.

Values for e.coli, entero, clostridium, and lactobacillus all have a very wide, which indicates the need for a log scale. How to apply that log scale is up to you (data vs model).
Slide output that shows the choices made on modeling the four outcomes, and their rates. Especially those rates can be quite difficult to manage.
Bacterial counts — Lacto/Clos. This rate can be analyzed using PROC MIXED and very nicely follows a Normal distribution. Even with two additional random components (variance and unstructured) the data runs very smoothly. Results can be trusted.
Estimates of the rate per treatment and day
Comparisons across days.

The above looked quite straightforward. A different song will be sung when you analyze the E.coli/Entero rate. Two straightforward plots down below easily indicate that you are dealing with a rate that is moving at the boundary of space.

And the Beta distribution does not know how to handle it. At the very least, you expect the Beta to move more towards a Normal distribution, despite its normal limits for 0 and 1.
The negative binomial does not do a better job than the Beta.

The problem in this example is not the distribution, but the data. The way the data is set up produces a strange rate that cannot be mimicked by the majority of the known distributions. Of course, a mixture distribution could be used, but what about changing the rate?

So, that is what I did. I changed the rate by using the raw data, and not the log scales. If you plot this rate, it looks much more Normal. As such, the Negative Binomial will have no problem modeling it.

Setting up the rate in a different way — by using the raw scores instead of the log-transformed scores.
The negative binomial can now modeling the data much better as can be seen by the covariance estimates. This is because I am applying the log transformation on the raw rate INSIDE the model. The absence of a standard error, however, is still an issue. For now, we leave it.
Comparing MEANS to LSMEANS. Seems acceptable to me.
Comparing Treatments — log-transformed scores

In summary, I showed the analysis of four different variables, on the log scale, which we then combined into two new variables:

  1. a ratio of E.Coli and Enterobacteriaceae
  2. a rate of Clostridium and Lactobacillus

The ratio of Lactobacillus/Clostridium can be analyzed like the FCR in PROC Mixed. The rate of log(E.Coli) / log(Enterobacteriaceae) presents problems that are more difficult to overcome — distribution is difficult to model. Best to use E.Coli / Enterobacteriaceae as a raw variable, and apply the log transformation in the model using the negative binomial distribution

Let’s proceed to the analysis of life and stillbirth. In a previous post, I already showed you how to use the Beta-Binomial to simulate a sample size for these outcomes. Here, I will show you the usability of the Negative-Binomial distribution.

The dataset looks like this in which we have columns for total born, life birth, stillbirth, and mummified. We have this data per sow and per cycle.

TG = total birth, LG = live birth, DG = still birth, MM = mummy
And the distributions for each of them. Stillbirth clearly looks like a Poisson, whereas Total and Life Birth looks like a Normal Distribution.
Distribution of Total Birth. The Normal Distribution looks like a good fit, but the data is discrete
Distribution of Life Birth. The Normal Distribution looks like a good fit, but the data is discrete

Let's analyze the data using PROC MIXED and PROC GLIMMIX.

Using the normal distribution for Total / Life Birth. Looks like a great fit, but also very strange applying a continuous distribution on discrete data.
What if there is a treatment included? Now, it looks much better, but do not let the plots fool you. It is still discrete data, and applying a continuous distribution on discrete data often means that the model will underestimate the standard error.
How can a treatment (3) produce on average 15 piglets and 1/3rd of a piglet?

The problem with using a continuous distribution on discrete data is that the results will be presented on a continuum, which results in estimating parts of a piglet.

Born alive, using the Normal Distribution.
Normal vs. Negative Binomial Distribution for Life Birth. You cannot compare the AIC scores since the link functions are not the same. However, I would choose the Negative Binomial any day over the Normal Distribution when using discrete data. Unless, perhaps, if you include a Sandwich estimator.

Below, you see the estimates provided via the Normal distribution and the Negative Binomial distribution. Of course, the Normal makes no sense here.

Normal vs. Negative Binomial Distribution for LB. For the Negative Binomial the results will be predicted proportions which you have to multiply to the total to get the predicted integer.
There is no data scale for the differences. You need to revert to the Odds Ratios

So, analyzing total, still, and life birth in a correct manner is not easy, although it seems very easy. Total birth and life birth seem to follow a normal distribution, but cannot be analyzed as such — how to interpret a treatment gain of 1/5th a piglet?

Hence, for total birth and life birth, a Negative Binomial model will do the trick to keep the data in a discrete manner (counts) whilst dealing with the skewness of the variance to the mean. The results will be predicted proportions which you have to multiply to the total to get the predicted integer. Using the Negative Binomial model will increase the standard error of the LSMEANS and LS-Differences by using the correct model.

Let's continue with data in which variation is extremely low, such as mortality or footpad lesions. Mortality data in the animal sciences are often difficult to model because they operate at the lower boundary of the scale — luckily.

When dealing with proportions, you can run into problems when there is a complete or quasi-complete separation of values. An example is when you have extreme categories in which treatment 1 shows a 99% ‘Yes’ and treatment 2 a 99% ‘No’. As a result, the treatment completely separates the scores.

This is relatively common in binary/binomial data:

  1. Every value of the response variable in a given level of the Trt factor is 0 → Probability (Succes | Trt) = 0%
  2. Every value of the response variable in a given level of the Trt factor is 1 → Probability (Succes | Trt) = 100%

The model will fail to converge → if converged, estimates are not to be trusted and the inference is not valid. Also, maximum likelihood estimation will go to infinity.

Below is a 2*2 table showing two outcomes for two groups. Comparing these groups can be done via the Odds Ratio, the Relative Risk, the Absolute Risk, and the Risk difference.

Straightforward analyses of mortality scores. Sometimes, a model does not have to be more than that.

To analyze the mortality data using a 2*2 table, we will need to transform the data.

From raw form to summary data fit to be analyzed using PROC FREQ.
Here, I am asking for the Odds Ratio, Relative Risk, Risk Difference, and exact tests. I wanted to see how helpful the statistics would be. However, by just looking at the plot, I already know statistics are just plain useless here.
Analyzing Mortality as Event Data using PROC GLIMMIX.
And the LSMEANS and differences. This is what testing at the boundary looks like.
The different datasets in which I included days as well.
Beta model for Footpad lesions. Did not converge as you can see from the failed estimates.
And the clear message it failed to converge.
Different data, different analysis. I am looking at the effect of the day by using the Cochran-Mantel-Haenszel statistics, which looks at change. Not significant. A look at the plots would have revealed the same.
Footpad — Contingency Table showing the CMH statistics. Left you can see the plots.
And the PROC GLIMMIX analysis. Looking at the total, you can already see that you are, once again, testing at the boundary.
Using a GLIMMIX model with a Binomial distribution with values at the boundary of space — a great recipe for non-convergence of truly useless estimates.

Let's see if can model mortality data using the Beta distribution using another dataset.

Create new variables which are not 0 or 1 as the Beta distribution doesn’t allow values to be 0 or 1. Here, you see the overal mortality for a period of 35 days. The spread is sufficient to start modelling although one treatments shows very strange behaviour.
The model doesn’t run perfectly, however, it’s acceptable.
Means are shown in proportions, so *100 will give you % of mortality.

Let's see if we can analyze the mortality data per week.

Mortality in Broilers — Model by Time using the Beta distribution. Once again, we are modeling at the boundary of space.

In summary, data on mortality & footpad lesions are notorious for not having enough variation to analyze. This is because no animal died, and as a result, every animal in every pen gets the same score. In some cases, it is, therefore, best to just use a contingency table to analyze the events. Also, the creation of binary classes and analyzing them as proportions can help get a model from which inferences may be made. However, when there truly is not enough variation, it is best to just report the observed number of events. No statistics!

Last, but not least, let's dive into preference studies. Preference studies are quite funny to analyze since they measure the amount of feed eaten by an animal, and split that amount by treatment. Hence, you will have, for each animal, an amount is eaten which can be summarized as a whole by adding up the number of feeds they could choose from. More specifically, if you look at the dataset below, you can see that the total amount eaten is a number that is replicated and that the proportion intake per row is the actual number of interest. However, since we only have two treatments, we are dealing with a zero-sum game. If you know how much is eaten from the first treatment, you also know the other.

Mutually dependent data.

As you can see, per week & per animal, the total feed intake of two products was measured. The design of this experiments and the aggregation of its data makes that the data is dependent:

  1. Total = 842
  2. IB = 642
  3. SO must be 200
Here you can see the data over time in the form of a spaghetti plot.
And the data in the form of a boxplot.
And the analysis using a beta distribution. The trick is analyzing this kind of data is by using the group function in the covariance structure. This way, I ask the model to build two covariance structures which are mirrors. This way, I can estimate the complete set of data, despite being a zero-sum game.

The random week type=un group=type means that I specified a separate unstructured covariance matrix for the IB and SB group. Since the data are mutually exclusive, this makes perfect sense and allows you to use ALL the data.

Here you can see how I mimicked the covariance structure by using the group= function.
And a comparison of the LSMEANS and the observed values.

In summary, preference data is by definition dependent data. If you know A-amount ate, you also know B IF you have two treatments. As such, they work like dummy variables. This dependency needs to be included in the mixed model to make it run — random week / residual subject=animalnr type=un group=type. The statement will actually analyze the data double, and so you must compare the results on the raw data with the LSMEANS to make sure you did not make a mistake. Because of the dependency of the data, a mistake is easy to make

This post marks the end of the PROC GLIMMIX series. Below you can find the last reminder table showing how what each specific distribution needs!


Using PROC GLIMMIX in SAS — examples was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓