Analyzing non-normal data in SAS — log data, mortality, litter, and preference scores.

Last Updated on July 26, 2023 by Editorial Team

Author(s): Dr. Marc Jacobs

Originally published on Towards AI.

This post is the last in a series showing the potential of PROC GLIMMIX which is the de facto tool for using Generalized Linear Mixed Models. After an introductory post and showing examples using the Multinomial, Binary, Binomial, Beta, Poisson, and Negative Binomial it is now time to go deeper into situations often encountered when analyzing data in the animal sciences. To be more precise, I will show examples of bacterial counts, litter scores, mortality scores, and preference scores. Enjoy!

First of, bacterial counts and how to analyze data on the log scale. Data that have such a wide distribution that they need to be analyzed using a log scale can be surprisingly difficult to model inside a GLIMMIX model, especially when you use the log function. I will show you later what I mean.

Below you see entero data on two points — five and fifteen — across their blocks in a Randomized Complete Block Design.

Entero data on the raw scale (left) and the log scale (right).

Below you can see the datasets used. This was a nested design having data across departments, blocks, and pens.

Values for e.coli, entero, clostridium, and lactobacillus all have a very wide, which indicates the need for a log scale. How to apply that log scale is up to you (data vs model).

Slide output that shows the choices made on modeling the four outcomes, and their rates. Especially those rates can be quite difficult to manage.

Bacterial counts — Lacto/Clos. This rate can be analyzed using PROC MIXED and very nicely follows a Normal distribution. Even with two additional random components (variance and unstructured) the data runs very smoothly. Results can be trusted.

Estimates of the rate per treatment and day

The above looked quite straightforward. A different song will be sung when you analyze the E.coli/Entero rate. Two straightforward plots down below easily indicate that you are dealing with a rate that is moving at the boundary of space.

And the Beta distribution does not know how to handle it. At the very least, you expect the Beta to move more towards a Normal distribution, despite its normal limits for 0 and 1.

The negative binomial does not do a better job than the Beta.

The problem in this example is not the distribution, but the data. The way the data is set up produces a strange rate that cannot be mimicked by the majority of the known distributions. Of course, a mixture distribution could be used, but what about changing the rate?

So, that is what I did. I changed the rate by using the raw data, and not the log scales. If you plot this rate, it looks much more Normal. As such, the Negative Binomial will have no problem modeling it.

Setting up the rate in a different way — by using the raw scores instead of the log-transformed scores.

The negative binomial can now modeling the data much better as can be seen by the covariance estimates. This is because I am applying the log transformation on the raw rate INSIDE the model. The absence of a standard error, however, is still an issue. For now, we leave it.

Comparing MEANS to LSMEANS. Seems acceptable to me.

Comparing Treatments — log-transformed scores

In summary, I showed the analysis of four different variables, on the log scale, which we then combined into two new variables:

a ratio of E.Coli and Enterobacteriaceae
a rate of Clostridium and Lactobacillus

The ratio of Lactobacillus/Clostridium can be analyzed like the FCR in PROC Mixed. The rate of log(E.Coli) / log(Enterobacteriaceae) presents problems that are more difficult to overcome — distribution is difficult to model. Best to use E.Coli / Enterobacteriaceae as a raw variable, and apply the log transformation in the model using the negative binomial distribution

Let’s proceed to the analysis of life and stillbirth. In a previous post, I already showed you how to use the Beta-Binomial to simulate a sample size for these outcomes. Here, I will show you the usability of the Negative-Binomial distribution.

The dataset looks like this in which we have columns for total born, life birth, stillbirth, and mummified. We have this data per sow and per cycle.

TG = total birth, LG = live birth, DG = still birth, MM = mummy

And the distributions for each of them. Stillbirth clearly looks like a Poisson, whereas Total and Life Birth looks like a Normal Distribution.

Distribution of Total Birth. The Normal Distribution looks like a good fit, but the data is discrete

Distribution of Life Birth. The Normal Distribution looks like a good fit, but the data is discrete

Let's analyze the data using PROC MIXED and PROC GLIMMIX.

Using the normal distribution for Total / Life Birth. Looks like a great fit, but also very strange applying a continuous distribution on discrete data.

What if there is a treatment included? Now, it looks much better, but do not let the plots fool you. It is still discrete data, and applying a continuous distribution on discrete data often means that the model will underestimate the standard error.

How can a treatment (3) produce on average 15 piglets and 1/3rd of a piglet?

The problem with using a continuous distribution on discrete data is that the results will be presented on a continuum, which results in estimating parts of a piglet.

Born alive, using the Normal Distribution.

Normal vs. Negative Binomial Distribution for Life Birth. You cannot compare the AIC scores since the link functions are not the same. However, I would choose the Negative Binomial any day over the Normal Distribution when using discrete data. Unless, perhaps, if you include a Sandwich estimator.

Below, you see the estimates provided via the Normal distribution and the Negative Binomial distribution. Of course, the Normal makes no sense here.

Normal vs. Negative Binomial Distribution for LB. For the Negative Binomial the results will be predicted proportions which you have to multiply to the total to get the predicted integer.

There is no data scale for the differences. You need to revert to the Odds Ratios

So, analyzing total, still, and life birth in a correct manner is not easy, although it seems very easy. Total birth and life birth seem to follow a normal distribution, but cannot be analyzed as such — how to interpret a treatment gain of 1/5th a piglet?

Hence, for total birth and life birth, a Negative Binomial model will do the trick to keep the data in a discrete manner (counts) whilst dealing with the skewness of the variance to the mean. The results will be predicted proportions which you have to multiply to the total to get the predicted integer. Using the Negative Binomial model will increase the standard error of the LSMEANS and LS-Differences by using the correct model.

Let's continue with data in which variation is extremely low, such as mortality or footpad lesions. Mortality data in the animal sciences are often difficult to model because they operate at the lower boundary of the scale — luckily.

When dealing with proportions, you can run into problems when there is a complete or quasi-complete separation of values. An example is when you have extreme categories in which treatment 1 shows a 99% ‘Yes’ and treatment 2 a 99% ‘No’. As a result, the treatment completely separates the scores.

This is relatively common in binary/binomial data:

Every value of the response variable in a given level of the Trt factor is 0 → Probability (Succes U+007C Trt) = 0%
Every value of the response variable in a given level of the Trt factor is 1 → Probability (Succes U+007C Trt) = 100%

The model will fail to converge → if converged, estimates are not to be trusted and the inference is not valid. Also, maximum likelihood estimation will go to infinity.

Below is a 2*2 table showing two outcomes for two groups. Comparing these groups can be done via the Odds Ratio, the Relative Risk, the Absolute Risk, and the Risk difference.

Straightforward analyses of mortality scores. Sometimes, a model does not have to be more than that.

To analyze the mortality data using a 2*2 table, we will need to transform the data.

From raw form to summary data fit to be analyzed using **PROC FREQ**.

Here, I am asking for the Odds Ratio, Relative Risk, Risk Difference, and exact tests. I wanted to see how helpful the statistics would be. However, by just looking at the plot, I already know statistics are just plain useless here.

Analyzing Mortality as Event Data using **PROC GLIMMIX**.

And the LSMEANS and differences. This is what testing at the boundary looks like.

The different datasets in which I included days as well.

Beta model for Footpad lesions. Did not converge as you can see from the failed estimates.

And the clear message it failed to converge.

Different data, different analysis. I am looking at the effect of the day by using the Cochran-Mantel-Haenszel statistics, which looks at change. Not significant. A look at the plots would have revealed the same.

Footpad — Contingency Table showing the CMH statistics. Left you can see the plots.

And the **PROC GLIMMIX** analysis. Looking at the total, you can already see that you are, once again, testing at the boundary.

Using a GLIMMIX model with a Binomial distribution with values at the boundary of space — a great recipe for non-convergence of truly useless estimates.

Let's see if can model mortality data using the Beta distribution using another dataset.

Create new variables which are not 0 or 1 as the Beta distribution doesn’t allow values to be 0 or 1. Here, you see the overal mortality for a period of 35 days. The spread is sufficient to start modelling although one treatments shows very strange behaviour.

The model doesn’t run perfectly, however, it’s acceptable.

Means are shown in proportions, so *100 will give you % of mortality.

Let's see if we can analyze the mortality data per week.

Mortality in Broilers — Model by Time using the Beta distribution. Once again, we are modeling at the boundary of space.

In summary, data on mortality & footpad lesions are notorious for not having enough variation to analyze. This is because no animal died, and as a result, every animal in every pen gets the same score. In some cases, it is, therefore, best to just use a contingency table to analyze the events. Also, the creation of binary classes and analyzing them as proportions can help get a model from which inferences may be made. However, when there truly is not enough variation, it is best to just report the observed number of events. No statistics!

Last, but not least, let's dive into preference studies. Preference studies are quite funny to analyze since they measure the amount of feed eaten by an animal, and split that amount by treatment. Hence, you will have, for each animal, an amount is eaten which can be summarized as a whole by adding up the number of feeds they could choose from. More specifically, if you look at the dataset below, you can see that the total amount eaten is a number that is replicated and that the proportion intake per row is the actual number of interest. However, since we only have two treatments, we are dealing with a zero-sum game. If you know how much is eaten from the first treatment, you also know the other.

As you can see, per week & per animal, the total feed intake of two products was measured. The design of this experiments and the aggregation of its data makes that the data is dependent:

Total = 842
IB = 642
SO must be 200

Here you can see the data over time in the form of a spaghetti plot.

And the analysis using a beta distribution. The trick is analyzing this kind of data is by using the group function in the covariance structure. This way, I ask the model to build two covariance structures which are mirrors. This way, I can estimate the complete set of data, despite being a zero-sum game.

The random week type=un group=type means that I specified a separate unstructured covariance matrix for the IB and SB group. Since the data are mutually exclusive, this makes perfect sense and allows you to use ALL the data.

Here you can see how I mimicked the covariance structure by using the **group=** function.

And a comparison of the LSMEANS and the observed values.

In summary, preference data is by definition dependent data. If you know A-amount ate, you also know B IF you have two treatments. As such, they work like dummy variables. This dependency needs to be included in the mixed model to make it run — random week / residual subject=animalnr type=un group=type. The statement will actually analyze the data double, and so you must compare the results on the raw data with the LSMEANS to make sure you did not make a mistake. Because of the dependency of the data, a mistake is easy to make

This post marks the end of the PROC GLIMMIX series. Below you can find the last reminder table showing how what each specific distribution needs!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Analyzing non-normal data in SAS — log data, mortality, litter, and preference scores.

Author(s): Dr. Marc Jacobs

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Analyzing non-normal data in SAS — log data, mortality, litter, and preference scores.

Author(s): Dr. Marc Jacobs

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥