Day 18 Applications of Smoothing Splines

18.1 Announcements

  • Assignment 5 is due tonight

18.2 Splines review

  • Piecewise polynomial functions
    • pros of local approximation of polynomials without cons of global approximation of polynomials
  • maximal smoothness
  • PROS:
    • Flexibility
    • No need to specify functional
  • CONS:
    • Less interpretable
    • Computational cost
    • Flexibility must be selected by adjusting the smoothing parameter \(\lambda\)

We can represent the data with the equation \[y_i = \beta_0 + f(x_i) + \varepsilon_i, \\ \varepsilon_i \sim N(0, \sigma^2),\] where \(f(\cdot)\) is estimated with the function \(g(\cdot)\) that minimizes \(||\mathbf{y}-\mathbf{g}||^2 + \lambda J_{md}(g)\). Reember that \(\mathbf{g} = [g(\mathbf{x}_1), g(\mathbf{x}_2), \dots, g(\mathbf{x}_n)]'\) \(J_{md}(g)\) is a penalty funcitonal that regulates the wiggliness of \(g(\cdot)\) and \(\lambda\) controls the smoothness of \(g(\cdot)\).

Importance of basis dimension

  • Impact on flexibility of the model
  • Checking if basis dimension is too small (residuals vs. neighbours)
  • Comparing effective degrees of freedom with basis dimension
  • Implemented in gam.check
    • k' gives the maximum possible EDF for that smooth
    • compare k' with EDF in a model

18.3 Applications of splines

library(tidyverse) # data wrangling 
library(ggpubr) # data viz
library(lme4) # fit LMMs (only supports simple random effects - sompound symmetry)
library(mgcv) # fit GAMs (supports both random effects and splines)

18.3.0.1 Spatial effects

Traditional approach

The data below come from a multi-environment experiment that studies yield of lupin for different genotypes.

dat <- drop_na(agridat::verbyla.lupin)

dat %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = yield))+
  facet_grid(year~loc)+
  coord_fixed()+
  scico::scale_fill_scico()+
  theme_pubclean()

A (typically) reasonable model could be \[y_{ijk} \sim N(\mu_{ijk}, \sigma^2),\\ \mu_{ijk} = \mu_0 + T_i + l_j + b_{k(j)}, \\ l_j \sim N(0, \sigma^2_l), \\ b_{k(j)} \sim N(0, \sigma^2_b),\] where:

  • \(y_{ijk}\) is the observed yield and arises from a normal distribution with mean \(\mu_{ijk}\) and variance \(\sigma^2\),
  • \(\mu_0\) is the overall mean yield for all fields,
  • \(l_j\) is the (random) effect of the \(j\)th field that arises from a normal distribution with variance \(\sigma^2_l\) (note that \(\mu_0 + l_j\) is the overall yield for the \(j\)th field),
  • \(b_{k(j)}\) is the (random) effect of the \(k\)th block in the \(j\)th field that arises from a normal distribution with variance \(\sigma^2_b\),
  • \(T_i\) is the treatment effect of the \(i\)th genotype.
tictoc::tic()
m1 <- lmer(yield ~ gen + (1|site/rep), data = dat)
tictoc::toc()
## 0.17 sec elapsed
VarCorr(m1)
##  Groups   Name        Std.Dev.
##  rep:site (Intercept) 0.10140 
##  site     (Intercept) 0.27897 
##  Residual             0.30284
dat$resid_1 <- resid(m1)
dat %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = resid_1))+
  facet_grid(~site)+
  coord_fixed()+
  scico::scale_fill_scico()+
  labs(main = "Residuals assuming within-block homogeneity")+
  theme_pubclean()
## Ignoring unknown labels:
## • main : "Residuals assuming within-block homogeneity"

Relaxing an assumption from the traditional approach

Alternatively, we could add an element to the linear predictor. Previously, we said \(\mu_{ijk} = \mu_0 + T_i + l_j + b_{k(j)}\).

Potential risks of assuming within-block homogeneity:

  • loss of power

Alternative: expand the linear predictor into \(\mu_{ijk} = \mu_0 + T_i + l_j + b_{k(j)} + f(\mathbf{s}_{ijk})\), where \(\mathbf{s}_{ijk}\) is a vector of the coordinates of the observation \(ijk\), and \(f(\cdot)\) is a smooth function of space. We consider \(f(\cdot)\) can be approximated using smoothing splines.

tictoc::tic()
m2 <- gam(yield ~ gen + s(rep, site, bs = "re") + s(row, col, bs = "tp", by = site), data = dat)
tictoc::toc()
## 10.76 sec elapsed
summary(m2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## yield ~ gen + s(rep, site, bs = "re") + s(row, col, bs = "tp", 
##     by = site)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.89485    0.04871  18.372  < 2e-16 ***
## genGungurru  -0.02161    0.03232  -0.669 0.503922    
## genIllyarrie -0.12520    0.03255  -3.846 0.000126 ***
## genMerrit    -0.01378    0.03300  -0.417 0.676425    
## genMyallie   -0.07358    0.03716  -1.980 0.047935 *  
## genUnicrop   -0.18396    0.03238  -5.681 1.65e-08 ***
## genWarrah    -0.17843    0.03243  -5.501 4.54e-08 ***
## genYandee    -0.15374    0.03242  -4.742 2.35e-06 ***
## genYorrel    -0.09957    0.03254  -3.060 0.002256 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df      F  p-value    
## s(rep,site)        21.824 27.000 26.549  < 2e-16 ***
## s(row,col):siteS01  2.000  2.000  0.911 0.402579    
## s(row,col):siteS02 17.825 19.763  5.263  < 2e-16 ***
## s(row,col):siteS03 12.999 15.385  2.526 0.000966 ***
## s(row,col):siteS04  2.000  2.000  0.757 0.469094    
## s(row,col):siteS05  3.745  4.893  1.816 0.126446    
## s(row,col):siteS06  2.000  2.000  1.288 0.276202    
## s(row,col):siteS07  9.206 12.532  1.016 0.432038    
## s(row,col):siteS08 13.272 17.680  2.008 0.007607 ** 
## s(row,col):siteS09  2.000  2.000  3.518 0.029936 *  
## s(row,col):siteS10  2.000  2.000  2.655 0.070672 .  
## s(row,col):siteS11  6.062  8.237  0.901 0.519681    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.536   Deviance explained =   57%
## GCV = 0.087403  Scale est. = 0.080877  n = 1392
dat$resid_2 <- resid(m2)
dat %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = resid_2))+
  facet_grid(~site)+
  coord_fixed()+
  labs(main = "Residuals relaxing the assumption of within-block homogeneity")+
  scico::scale_fill_scico()+
  theme_pubclean()
## Ignoring unknown labels:
## • main : "Residuals relaxing the assumption of within-block
##   homogeneity"

An intuitive visualization of splines

d_predict <- dat %>% 
  transmute(gen = "Merrit", 
            row, col, site, rep)
d_predict$yhat_1 <- predict(m1, newdata = d_predict)
d_predict$yhat_2 <- predict(m2, newdata = d_predict)
sp_effects_trad <- 
  d_predict %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = yhat_1))+
  facet_grid(~site)+
  coord_fixed()+
  scico::scale_fill_scico(palette = "bamako")+
  labs(fill = expression(Yield~(ton~ha^{-1})))+
  theme_pubclean()

sp_effects_splines <- 
  d_predict %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = yhat_2))+
  facet_grid(~site)+
  coord_fixed()+
  scico::scale_fill_scico(palette = "bamako")+
  labs(fill = expression(Yield~(ton~ha^{-1})))+
  theme_pubclean()

ggarrange(sp_effects_trad, sp_effects_splines, 
          common.legend = T,
          ncol = 1)

  • Some thoughts on missing data imputation.
  • Use of multilevel models for missing data imputation
  • Use of GAM models for missing data imputation

18.3.0.2 Modeling time

The data below come from an experiment aiming to test whether a diet treatment in cows can reduce the (GH) gas emmission.

The experiment consists of 4 groups of cows that were hosted turns in the same facility. Each group contained similar cows and they were randomly divided into 2 subgroups. Those subgroups were assigned to a corral and each corral was assigned with a diet treatment.

dat <- read.csv("../../data_confid/fluxes.csv")

dat %>% 
  ggplot(aes(doy, GHG_flux))+
  geom_point(aes(fill = diet), shape= 21)+
  scale_fill_manual(values = c("#FFBC42", "#218380"))+
  labs(fill  ="Diet", 
       y = expression(gas~flux~cow^{-1}))+
  theme_pubclean()

We can generally describe any point as \[y_{ijk} \sim D(\mu_{ijk}, \psi), \\ g(\mu_{ijk}) = \eta_{ijk}\] where \(y_{ijk}\) is the observation of the flux per head for the \(i\)th diet, \(j\)th group and \(l\)th moment.

  • Discussion of predictor \(\eta_{ijk}\)

We can, for example, use \[\eta_{ijk} = \mu_0 + D_i + f(t_{ij}) + b_{k} + u_{ik},\]

where

  • \(\mu_0\) is the overall mean,
  • \(D_i\) is the effect of the \(i\)th diet,
  • \(f(t_{ij})\) is the function of time,
  • \(b_{k}\) is the effect of the \(k\)th group.
dat$diet_group <- factor(paste(dat$diet, dat$group))
m1 <- gam(GHG_flux ~ diet + 
            s(doy, bs = "tp", k =35) + 
            s(group, bs = "re") +
            s(diet_group, bs = "re") , 
          data = dat,
          family = Gamma(link = "log"))

dat_plot <- expand.grid(doy = seq(min(dat$doy), max(dat$doy), length.out = 1000), 
                        diet = unique(dat$diet)) %>% 
  mutate(group = case_when(doy <35 ~ 1, 
                           between(doy, 37, 61) ~ 2, 
                           between(doy, 61, 71.4) ~ 3, 
                           between(doy, 71.4, 117.8) ~ 4, 
                           doy>117.8 ~ 5, 
                           TRUE ~ 6)) %>% 
  filter(group %in% unique(dat$group)) %>% 
  mutate(across(c( diet), ~as.factor(.)), 
         diet_group = factor(paste(diet, group)))

pred_plot <- predict(m1, newdata = dat_plot, type = "response", se = T)

dat_plot$pred_m1 <- pred_plot$fit
dat_plot$se_m1 <- pred_plot$se.fit

dat_plot %>% 
  ggplot(aes(doy, pred_m1))+
  geom_point(aes(doy, GHG_flux, color = diet), data = dat, alpha = .3)+
  geom_ribbon(aes(group = paste(diet, group),
                  fill = diet,
                  ymin = pred_m1 - (se_m1*1.96) , ymax = pred_m1 +(se_m1*1.96)), 
              alpha = .3)+
  geom_line(aes(group = paste(diet, group), color = diet))+
  scale_fill_manual(values = c("#FFBC42", "#218380"))+
  scale_color_manual(values = c("#FFBC42", "#218380"))+
  coord_cartesian(ylim = c(0, 800))+
  theme_pubclean()+
  labs(fill  ="Diet", color = "Diet",
       y = expression(gas~flux~cow^{-1}))

summary(m1)
## 
## Family: Gamma 
## Link function: log 
## 
## Formula:
## GHG_flux ~ diet + s(doy, bs = "tp", k = 35) + s(group, bs = "re") + 
##     s(diet_group, bs = "re")
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.43170    0.03345 162.388  < 2e-16 ***
## dietTreatment -0.13537    0.04848  -2.792  0.00534 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                     edf Ref.df      F p-value    
## s(doy)        3.217e+01  33.67 22.911 < 2e-16 ***
## s(group)      1.245e-05   1.00  0.000 0.52837    
## s(diet_group) 2.212e+00   6.00  1.833 0.00177 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.585   Deviance explained =   58%
## GCV = 0.14934  Scale est. = 0.12529   n = 1000