Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Generalized Linear Mixed Models in SAS
Latest   Machine Learning

Generalized Linear Mixed Models in SAS

Last Updated on July 26, 2023 by Editorial Team

Author(s): Dr. Marc Jacobs

Originally published on Towards AI.


Distributions, link functions, scales, overdispersion, and normality of residuals.

All statistical models have a systematic and a random part. In ANOVA and GLM models, the systematic part is the fixed effect and the random part is the error / unexplained variance. In mixed models, we have at least two random parts:

  1. The variable intended to describe and use variance (e.g., block)
  2. Error / unexplained variance
Generalized Linear Mixed Models in SAS

Our understanding of the world depends on the data that we collect and observe. This is why the design of our experiment is so important. Based on these observations, we build a statistical model which includes fixed and random parts. We build a model to describe/explain the world. We apply probability theory to the model to make inferences:

  1. What is the chance that the model does not describe the world accurately?
  2. What is the chance that what we have observed is due to chance? (which, of course, can never happen since chance by itself does not play a role)
Matrix showing that the type of response variable(s) and type of effects included determine the procedure best used. In SAS, PROC GLIMMIX can account for all, even though it is still very much in development.
Comparison of the same type of analysis via PROC MIXED or PROC GLIMMIX.

Linear Mixed Models (LMMs) can only analyze normally distributed data. In Generalized Linear Mixed Models (GLMMs), the response function (y) can come from different distributions. Whilst in LMMs we analyze the response directly, in GLMMs we need to apply a link function to transform the data IN the model

The linear predictor is the aggregated form of the model formula.

There are a lot of options available in SAS to do so.

Not every link function can be applied to every distribution!

Below, is an example of data that needs to be log-transformed (either raw or inside the model) to be able to analyze it.

Comparison of data on the normal scale or the log scale. In PROC GLIMMIX, such transformations can be done inside the model, instead of having to transform the raw data as you must using PROC MIXED.

On data that comes from a log-normal distribution, using the normal distribution on the raw data is just plain wrong.

You enter, and SAS provides. SAS could not care less what you include.
That does not look very good. Remember, you did not do any inside transformations, so the residuals should be normal and they should be homogenous. They are neither. As such, the LSMEANS per treatment, as seen on the right, is not viable nor very useful.
Using the Normal Distribution on transformed data is better.
As can be seen by the residual output.
The estimate and the mean columns are now the same, since the data is transformed. Normally they differ, since they resemble the link-version of the estimate, and the inverse link. Or, said otherwise, the model scale and data scale of the observed data.

Now, let's use a different distribution instead. Since we transformed the data using a log transformation, we could opt for a lognormal distribution inside the model. This has the added benefit that an inverse transformation is done automatically.

The transformation is done inside of the model, using the distribution and link function. Here, I specified link=identity which means no transformation is applied, only a different distribution.
Residuals look good!
You can see that there is no difference between the transformed way vs the PROC GLIMMIX way. This is because I used the identity link.

The data and model scale is the same in PROC MIXED since we only use the normal distribution. There is no link function in PROC MIXED. In PROC GLIMMIX, one determines the model scale by using the link function. The model scale will lead to non-observable estimates and non-linear functions of the response variable. They are used for inference. Then, one can revert to the data scale by using the ilink function in PROC GLIMMIX. The estimates are transformed to observables. This only works for means and not differences, since the link function is a non-linear function for which no closed formula back-transformation exists.

Link functions are typically nonlinear and so a difference is not preserved in a nonlinear function. Hence, applying the inverse link to a difference usually produces a nonsense result. To estimate differences on the data scale, you have to be smarter (for logit scales, you have odds ratios). The distribution and link can change when you estimate, the linear predictor never changes. Since we used the identity link (link=identity), the results were the same on the model scale and on the data scale. The identity link does not perform a transformation within the model

Here you can see that the model and data scale remain the same because of the link=identity function.

Lets use a new example in which we take observations coming from a binomial distribution.

Binomial data
Treating Binomial data as Normal is wrong
Model scale and data scale are the same — because I use link=identity.
And the differences.
And the residuals.
Using the correct distribution.
By creating the correct dataset
Using the Normal Distribution, the mean is almost the same, but the standard errors are different
And now, to compare treatment estimates, we need to revert to Odds Ratios since the link functions are non-linear for differences.

So, using the dist=binomial and link=logit instead of dist=Gaussian and link=identity has resulted in:

  1. Smaller p-values.
  2. Not being able to directly obtain the difference between two treatment proportions.
  3. Having to resort to Odds Ratios to interpret treatment differences.
The Odds Ratio is one of those great metrics in which the value can achieve two-digit magnitudes but still mean very little.

So, Linear Mixed Models (LMMs) can only analyze normally distributed data. Generalized Linear Mixed Models (GLMMs) can analyze all distributions. To do so, it needs to transform the data in the model using the link function (model scale). To obtain model results in the original scale (data scale) we use the inverse link. Using a GLMM for data with a non-normal distribution will provide you with the correct results! For Binomial Data, Odds Ratios will now have to be used to interpret treatment difference

Lets talk more about probability and distributions. In statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment.

The world-famous Normal (Gaussian) distribution. Amazingly, many variables in the world can be approximated using the Normal Distribution. Then again, it is just a distribution, not the truth.
Left: Normal Distribution. Right: Binomial Distribution if pushed a lot. Many distributions will look like a Normal Distribution given certain values. But, even if the look like it, they are not, which can be seen by the white space in between. Whilst the left distribution is very much continuous, the right is absolutely discrete.

The probability distribution is different for continuous outcomes and discrete outcomes (integers). You can see some examples in the table below.

This is not an exhaustive table. Many probability distributions exist and then I have not even discussed mixtures, although the negative binomial (which is a gamma-poison) comes very close.
And different distributions can be used to fit different responses or purposes. Here, you can examples coming from animal science, but once again, far from exhaustive.

And if we are going to talk distributions, we absolutely NEED to talk about the mean-variance relationship which for the large part determines how distributions behave, and how useful they will be in analyzing different types of data.

The Normal Distribution is such a nice distribution because we can estimate the mean and the variance separately! Hence, if an underlying process is normal, it will be very easy to estimate the properties of the theoretical normal distribution!

Here, you can see how normal the Normal Distribution feels — just include a mean and a variance.
Some technical stuff on the Normal, but it basically says that the mean and variance can be determined, separately. That is a very big piece of freedom to have!
The Poisson Distribution, however, only has a mean. Hence, no real freedom.
Even though it can mimic a Normal Distribution in its shape, but not in its properties.
Using the Normal Distribution on Binomial Data. It may follow its shape, but for sure it cannot follow its properties.

The difficulty of dealing with non-normal data is because there is a dependency between the mean estimator and the variance estimator. In addition, the data is discrete and thus no longer intuitive (many are taught about the normal distribution in high school). So, in the absence of a mean, we now have to talk about proportions and we use distributions to model the data as we see it, extrapolate it to the population, and base inference (p-values) on it. To do so, we have two major types of distributions — continuous and discrete.

The normal distribution is continuous, mostly used, and easiest to interpret since it has two parameters that can be estimated separately — mean & variance. For distributions in which the mean-variance relationship is linked, overdispersion occurs often — the variance is underestimated leading to false positives.

And so we arrive at the topic of overdispersion, which is where the mean-variance relationship becomes extremely interesting. Below, you see the characteristics of four distributions and how their mean and variance are specified. The Poisson Distribution immediately sticks out.

Overdispersion was briefly mentioned in the previous part and is something we will highlight further before proceeding to the more practical examples. Overdispersion is not an issue in Linear Mixed Models as LMMs only deal with data coming from a Normal Distribution, and in a Normal Distribution, the variance is estimated separately from the mean. In GLMMs, however, some form of dispersion is practically always present, meaning that the variance is either underestimated (overdispersion) or overestimated (underdispersion).

Let's use an example to make the issue clear.

Dataset showing count data — the favourite type of data for a Poisson Distribution, since it is discrete and often has a large frequency at one. However, that is not the case here, so lets see how the Poisson fairs.

A simple and straightforward way to assess if the Poisson is a good distribution to fit is to plot the mean-variance relationship across groups. The dots should be on the red line.

They clearly are not.

Overdispersion often happens because the linear predictor is not complete which is a result of not having included enough / the most important variables. To detect overdispersion, you need to look if there is more variance in the model than expected, given the sample size, by dividing the variance detected by the degrees of freedom. To get this metric you need to change the estimation method in GLIMMIX, which opens quite some complex statistical theories:

  1. Pseudo-Likelihood
  2. Laplace (method=laplace)
  3. Quadrature (method=quad(qpoints=5))

Option two or three cannot be used in PROC GLIMMIX if you have a random/residual statement included. To check numerically for overdispersion, you can take a look by dividing the Pearson Chi-Square divided by the Degrees of Freedom. If the value is between 0.5–1.5 you are OK, but there is no official cut-off.

0.5–1.5 = OK

In this example, we actually detect underdispersion. When there is no dispersion, the Pearson Chi-square / DF value should be 1 — meaning that the variance is estimated correctly based on the degrees of freedom present. A value outside of the range [ 0.5; 1.5 ] would constitute too much dispersion. In extreme cases, you will see a value between 5 and 10, meaning that we underestimate the variance by a factor of 5 or 10

Let's test all of this using an example:

Detecting overdispersion — statistically. To do so the method needs to be Laplace of Quadrature (because they mimic Maximum Likelihood). Because it is a Poisson, I also included an offset variable, which brings the data more in line with the nature of the Poisson which is a rate.
Some new features that you might not recognize in a PROC MIXED code.
Dealing with the data by using a different distribution, but the same link function. This means that statistics are comparable!
The models fit the data equally well, despite the negative binomial having an additional parameter to work wonders with. The fact that the Scale parameter has a low estimate might explain why going from a Poisson Distribution with one parameter towards a Negative-Binomial with two parameters does not add much in your effort of explaining the data.

To recap, in the Normal Distribution, the mean and variance are estimated separately. In other distributions, the variance (e.g., Poisson, Binomial, Beta) is a function of the mean. This often creates overdispersion — an underestimate of the variance leading to false positives. Underdispersion is less common but can surely happen. In both cases, the best approach is changing the distribution towards one that provides you with more freedom to model.

In most models, there is the assumption that the residuals from a model are i.i.d — independent and identically distributed random variables. How this works, you can see from a simulation down below.

The observed value is determine by a ‘true’ value and some error. The error, or residual, is assumed to come from a Normal Distribution of which each draw is non-correlated with another draw. Hence, the black values is what you would see in a dataset, and the purple and lime values are the estimates of the fixed effects and the residuals. This is what Maximum Likelihood does — it works backwards.
Distribution of values simulated from a Normal Distribution. However, the observed data is not Normal. That is not a problem, as long as the residuals follow a Normal Distribution.
And, luckily, they do, but we already knew that because we simulated them.
The model used to obtain the i.i.d residuals.

Now, let's simulate a Poisson distribution and see what the residuals will look like.

Left: traditional count data distribution with the mode at one and a right skewed slope. Right: an effort to see if a normal distribution could do the trick — it could, but it would have to go negative and there are no negative counts. Hence, not a good fit and the kernel already starts to hint at a Poisson (or a Gamma).
An easier way to assess the viability of a Poisson is to create a mean-variance table. That does not look good for a Poisson.
Lets mimic the mean, and the variance.

The Poisson distribution based on the mean looks different than a distribution based on the variance. In a Poisson distribution, a larger λ will cause a longer tail. Let's see what happens if we start modeling the data.

Residuals look normal, but for sure not homogenous. Surprising how the Normal Distribution is actually able to still provide such a graph.
The residuals coming from a Poisson Distribution. For sure less Normal and also not homogenous, but why would be expect such a thing in the first place?
Here, a plot in which a Poisson and a Negative Binomial are overlaid. The Negative Binomial looks to have the better claim.
Using the Binomial Distribution does not really provide a better outlook.
Residuals coming from a Normal Distribution, Poisson Distribution, and Negative Binomial Distribution.

The above exercise has valuable lessons:

  1. Do NOT use the plots from the GLIMMIX model to assess if one model is better than the other.
  2. Stick to the Information Criteria described in the Linear Mixed Model course.
  3. It is best to look at the LSMEANS, the standard errors, and metrics for over/underdispersion
  4. In Generalized Linear Mixed Models, the errors cannot come from a normal distribution

Hence, using the residual plots to assess model fit and comparing models is tricky and will lead to incorrect results, even if they look as you are used to. It is best to look at the LSMEANS, the standard errors, and metrics for over/underdispersion.

Hope you enjoyed this! More to come on specific types of data and how to deal with them!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓