Day 4 Linear mixed models

September 8th

4.1 Announcements

  • Assignment 1 grades are posted. Remember that you have one chance to resubmit for full points.
  • Assignment 2 is due on Wednesday.
  • Last chance to join the mixed models workshop this weekend.

Semester project:

  • Proposal due September 24th – see example 1 | see example 2
  • Must contain: Background, Objectives.
  • Data for the project must be ready to go.
  • Final product: Manuscript + Tutorial.
  • Working in teams vs. solo.
  • Start as soon as possible!

This week

  • Today: intro to mixed models + blocks modeling discussion
  • Wednesday: mixed models applied to the fungicide experiment + Kahoot

4.2 Statistical models

Statistical models can be typically defined as the combination of a deterministic component (e.g., the linear predictor), and a random component (e.g., the distribution of the data).

Learn more about this way of describing statistical models in Stroup et al. (2024) page 4 (and all of Chapter 1 in general).

4.3 The general linear model

The general linear model is one of the most commonly used statistical models. We call it “general” because it assumes that the data arise from a normal distribution. [Note: Do not mix up “general” (normal distribution) with “generalized” (multiple possible distributions).] We call it “linear” because the parameters enter the model linearly. For example, the predictor \(\beta_0 +\beta_1 x + \beta_2 x^2\) corresponds to a linear model, because the parameters enter linearly. The predictor \(\alpha \exp(\beta x)\) corresponds to a nonlinear model, because the parameters enter as the power of the data.

The general linear model can be written as \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol\varepsilon, \\ \boldsymbol{\varepsilon}\sim N(\boldsymbol{0}, \mathbf{\Sigma}),\] where \(\mathbf{y}\) is the vector of the data, \(\mathbf{X}\) is the matrix containing the data, \(\boldsymbol{\beta}\) is a vector containing all parameters, \(\boldsymbol\varepsilon\) is a vector containing the residuals (i.e., the difference between observed and predicted). Note that \(\boldsymbol\varepsilon\) arises from a multivariate normal distribution, with mean zero and a variance-covariance matrix \(\mathbf{\Sigma} = \sigma^2 \mathbf{I}\).

Recall some cool properties:

  • The least squares estimate of \(\beta\), \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\), is also the the MLE, is unbiased.
  • \(\hat\sigma^2 = \frac{SSE}{df_e}\), not the MLE, is the unbiased estimate of \(\sigma^2\).
  • \(CI(\hat\mu_{ij}) = \hat\mu_{ij} \pm t_{df, \frac{\alpha}{2}} \cdot se(\hat{\mu}_{ij})\), where \(\sigma^2_e\) is the residual variance, and \(r\) is the number of repetitions.
  • Under a CRD, \(se(\mu_{ij}) = \sqrt{\frac{\sigma^2_e}{r}}\), where \(\sigma^2_e\) is the residual variance, and \(r\) is the number of repetitions.
  • ANOVA tables help us distinguish the most important sources of variability in the data. See Gelman (2005)
    • Essentially, an analysis of variance (ANOVA) divides the predictors into bins and analyzes whether each bin is relevant to explain variability in the data.
    • Typically, ANOVA tables show the sources of variation, their degrees of freedom, an F statistic and, sometimes, a p-value.
    • Sources of variation may come from treatment factors, or elements of the design.
    • Degrees of freedom tells us how many independent variables there are.
    • The F statistic serves as a multivariate t statistic, since it is used to test multiple predictors.

4.3.1 Applied example

Swine nutritionists are studying the effect of dietary treatments on the % of fecal dry matter, and whether said effects change at different points in time. In this experiment, 450 pigs were allocated in pens of 5 pigs. Each pen is independently assigned with dietary treatment. There are 6 dietary treatments: negative C, positive C, and 4 different pre-commercial additives.
So, there are 45 pens per treatment. One (randomly selected) pig per pen was observed at 3 points in time.

  • What is the experimental unit?
  • What is the observational unit?
  • What is the treatment structure?
  • What is the design structure?

A reasonable model could be

\[y_{ijk} \sim N(\mu_{ijk}, \sigma^2),\]

\[\mu_{ijk} = \eta_{ijk},\] \[\eta_{ijk} = \eta_0 + D_i + T_j + (DT)_{ij},\]

where:

  • \(y_{ijk}\) is the observed dry matter (%) for the \(i\)th dietary treatment, \(j\)th time, and \(k\)th repetition,
  • \(\mu_{ijk}\) is the expected value for the \(i\)th dietary treatment, \(j\)th time, and \(k\)th repetition,
  • \(\sigma^2\) is the residual variance,
  • \(\eta_{ijk}\), the linear predictor is equivalent to the expected value because the link function is the identity function,
  • \(\eta_0\) is the overall mean of the linear predictor,
  • \(F_i\) is the effect of the \(i\)th dietary treatment,
  • \(G_j\) is the effect of the \(j\)th time,
  • \((FG)_{ij}\) is the interaction for the \(i\)th dietary treatment and \(j\)th time

A reasonable ANOVA shell could be:

Table 4.1: ANOVA table for a CRD with a two-way factorial treatment structure.
Source df
Diet d-1
Time t-1
D x T (d-1)(t-1)
Error N-(d*t)
Total N-1
Table 4.1: ANOVA table for the diet CRD.
Source df
Diet 5
Time 2
D x T 10
Error 252
Total 269

Let’s implement that model:

# load the data 
pigs_crd <- read.csv("../../data_confid/fecalm_swine_crd.csv")
pigs_crd$Day <- as.factor(pigs_crd$Day)

# fit the model
m <- lm(F.Dry.matter ~ Day*Trt, data = pigs_crd)
Model checks
plot(predict(m), abs(resid(m)))

car::qqPlot(m)

## [1] 109 148
# anova
car::Anova(m, type = 2, test.statistic = "F")
## Anova Table (Type II tests)
## 
## Response: F.Dry.matter
##            Sum Sq  Df F value    Pr(>F)    
## Day         850.7   2  8.6720 0.0002279 ***
## Trt         896.5   5  3.6556 0.0032772 ** 
## Day:Trt     673.1  10  1.3723 0.1934086    
## Residuals 12360.9 252                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can even get the standard errors and, together with the degrees of freedom, get the 95% confidence intervals.

library(emmeans)
# get the s.e. by hand
n.reps <- 45
sigma2.hat <- sigma(m)^2

# here you have the s.e.
(se.mu <- sqrt(sigma2.hat/n.reps))
## [1] 1.044041
# get the CI
# get the mean 
mu <- as.data.frame(emmeans(m, ~Trt, at = list(Trt = "A")))[,2]
# get the t statistic for the CI
t.star <- qt(p = .975, df = m$df.residual)

(lower.CI_TrtA <- mu - t.star*se.mu)
## [1] 16.40749
(upper.CI_TrtA <- mu + t.star*se.mu)
## [1] 20.5198
mg_means <- emmeans(m, ~Trt)
head(mg_means)
##  Trt emmean   SE  df lower.CL upper.CL
##  A     18.5 1.04 252     16.4     20.5
##  B     23.6 1.04 252     21.5     25.6
##  C     20.1 1.04 252     18.1     22.2
##  D     18.3 1.04 252     16.3     20.4
##  E     19.8 1.04 252     17.8     21.9
##  F     18.6 1.04 252     16.5     20.6
## 
## Results are averaged over the levels of: Day 
## Confidence level used: 0.95

4.4 Linear mixed models applied to designed experiments

When the independence assumption cannot be taken for granted, especially if we know the dependence patterns in the data, we must account for said dependence pattern. Take the previous example, but in a blocked design instead of a CRD. The blocks are basically rooms that hold 90 pens. This means that the observations are not really independent anymore. After all, it’s hard to find rooms that fit 90 pens!

Consider an experiment with the same treatments, same number of EUs, but ran in 3 different rooms with 30 pens each. Each room has 5 pens of each dietary treatment.

  • What is the experimental unit?
  • What is the observational unit?
  • What is the treatment structure?
  • What is the design structure?

Let’s re-do the statistical model and ANOVA shell:

A reasonable model could be

\[y_{ijkl}|b_k \sim N(\mu_{ijk}, \sigma^2),\]

\[\mu_{ijkl} = \eta_{ijkl},\] \[\eta_{ijkl} = \eta_0 + D_i + T_j + (DT)_{ij} + b_k,\]

where:

  • \(y_{ijk}\) is the observed dry matter (%) for the \(i\)th dietary treatment, \(j\)th time, and \(k\)th room and \(l\)th repetition,
  • \(\mu_{ijk}\) is the expected value for the \(i\)th dietary treatment, \(j\)th time, and \(k\)th room and \(l\)th repetition,
  • \(\sigma^2\) is the residual variance,
  • \(\eta_{ijk}\), the linear predictor is equivalent to the expected value because the link function is the identity function,
  • \(\eta_0\) is the overall mean of the linear predictor,
  • \(F_i\) is the effect of the \(i\)th dietary treatment,
  • \(G_j\) is the effect of the \(j\)th time,
  • \((FG)_{ij}\) is the interaction for the \(i\)th dietary treatment and \(j\)th time,
  • \(b_k\) is the effect of the \(k\)th room, \(b_k \sim N(0, \sigma^2_b)\).

A reasonable ANOVA shell could be:

Table 4.2: ANOVA table for a GRCBD with a two-way factorial treatment structure.
Source df
Room r-1
Diet d-1
Time t-1
D x T (d-1)(t-1)
Error N-(d*t)-r
Total N-1
Table 4.2: ANOVA table for the diet GRCBD.
Source df
Room 2
Diet 5
Time 2
D x T 10
Error 250
Total 269

Let’s implement that model:

# load the data 
pigs_grcbd <- read.csv("../../data_confid/fecalm_swine_grcbd.csv")
pigs_grcbd$Day <- as.factor(pigs_grcbd$Day)
pigs_grcbd$Room <- as.factor(pigs_grcbd$Room)

# fit the model
library(lme4)
m_random <- lmer(F.Dry.matter ~ Day*Trt + (1|Room), data = pigs_grcbd)
Model checks
plot(predict(m_random), abs(resid(m_random)))

# anova
car::Anova(m_random, type = 2, test.statistic = "F")
## Analysis of Deviance Table (Type II Wald F tests with Kenward-Roger df)
## 
## Response: F.Dry.matter
##               F Df Df.res    Pr(>F)    
## Day     10.7115  2    250 3.441e-05 ***
## Trt      4.7730  5    250 0.0003483 ***
## Day:Trt  1.1153 10    250 0.3509474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the 252 degrees of freedom (aka independent observations) from the CRD are now spread across different levels, connected to different variance components, blocks and random error.

4.4.1 General notation for linear mixed models

The standard notation for mixed models is \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{u} + \boldsymbol\varepsilon, \\ \boldsymbol{u}\sim N(\boldsymbol{0}, \mathbf{G}), \\ \boldsymbol{\varepsilon}\sim N(\boldsymbol{0}, \mathbf{R}),\] where:

  • \(\mathbf{y}\) is the vector containing the data,
  • \(\boldsymbol{\beta}\) is the vector containing the model fixed effects (usually treatments),
  • \(\mathbf{X}\) is the matrix containing information about the treatment allocation,
  • \(\boldsymbol{u}\) is the vector containing the model random effects (usually elements of the design). Note that \(\boldsymbol{u}\) arises from a normal distribution with mean 0 and variance-covariance matrix \(\mathbf{G})\).
  • \(\mathbf{Z}\) is the matrix containing information about the design,
  • \(\boldsymbol\varepsilon\) is the model residual. Note that \(\boldsymbol{\varepsilon}\) arises from a normal distribution with mean 0 and variance-covariance matrix \(\mathbf{R})\).

Note that the conditional distribution of \(\mathbf{y}\) is \[\mathbf{y} \vert \boldsymbol{u} \sim N(\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{u},\mathbf{R}).\]

The marginal distribution of \(\mathbf{y}\) is
\[\mathbf{y} \sim N(\mathbf{X}\boldsymbol{\beta}, \mathbf{Z}\mathbf{G}\mathbf{Z}' + \mathbf{R}).\]

So, essentially what happened before was that the treatment effects haven’t changed, their effect sizes and means haven’t changed, but the variance-covariance matrix did change. A couple things did change:

  • Shrinkage
  • Estimation of \(\sigma^2\)
  • Estimation of \(se(\hat\mu)\)
  • Degrees of freedom change

4.4.2 Class discussion: Blocks - random or fixed?

Treatment point estimates don’t change whether blocks are fixed or random.
m_fixed <- lm(F.Dry.matter ~ Day*Trt + Room, data = pigs_grcbd)
mg_means_fixed <- emmeans(m_fixed, ~Trt, contr = list(c(1, -1, 0, 0, 0, 0)))
## NOTE: Results may be misleading due to involvement in interactions
mg_means_random <- emmeans(m_random, ~Trt, contr = list(c(1, -1, 0, 0, 0, 0)))
## NOTE: Results may be misleading due to involvement in interactions
as.data.frame(mg_means_fixed$emmeans) %>%
  mutate(blocks = "fixed") %>% 
  bind_rows(as.data.frame(mg_means_random$emmeans) %>%
              mutate(blocks = "random")) %>% 
  ggplot(aes(Trt, emmean))+
  geom_errorbar(aes(ymin = emmean-SE, ymax = emmean+SE, 
                    color = blocks), 
                position = position_dodge(width = .2),
                width = 0)+
  geom_text(aes(x = 4.5, y = 22.75), label = "s.e.(difference)")+
  geom_errorbar(aes(x = 5.4, y = 22, ymin = 22 , ymax = 22 + SE),
                width = 0.1,
                data = as.data.frame(mg_means_random$contrasts))+
  geom_errorbar(aes(x = 5.2, y = 22, ymin = 22 , ymax = 22 + SE),
                width = 0.1,
  data = as.data.frame(mg_means_fixed$contrasts))+
  geom_point(aes(color = blocks), position = position_dodge(width = .2))+
  theme_classic()+
  labs(color = "Blocks modeled as",
       y = "Fecal DM (%)",
       x = "Treatment")+
  theme(legend.position = "bottom")

More references:

4.5 Homework

  • Read Chapter 2 in Stroup’s GLMM book (1st ed. superior to 2nd ed.), particularly the “What Would Fisher Do” ANOVA shells.