Day 18 Applications of Smoothing Splines

18.1 Announcements

Assignment 5 is due tonight

18.2 Splines review

Piecewise polynomial functions
- pros of local approximation of polynomials without cons of global approximation of polynomials
maximal smoothness
PROS:
- Flexibility
- No need to specify functional
CONS:
- Less interpretable
- Computational cost
- Flexibility must be selected by adjusting the smoothing parameter \(\lambda\)

We can represent the data with the equation \[y_i = \beta_0 + f(x_i) + \varepsilon_i, \\ \varepsilon_i \sim N(0, \sigma^2),\] where \(f(\cdot)\) is estimated with the function \(g(\cdot)\) that minimizes \(||\mathbf{y}-\mathbf{g}||^2 + \lambda J_{md}(g)\). Reember that \(\mathbf{g} = [g(\mathbf{x}_1), g(\mathbf{x}_2), \dots, g(\mathbf{x}_n)]'\) \(J_{md}(g)\) is a penalty funcitonal that regulates the wiggliness of \(g(\cdot)\) and \(\lambda\) controls the smoothness of \(g(\cdot)\).

Importance of basis dimension

Impact on flexibility of the model
Checking if basis dimension is too small (residuals vs. neighbours)
Comparing effective degrees of freedom with basis dimension
Implemented in gam.check
- k' gives the maximum possible EDF for that smooth
- compare k' with EDF in a model

18.3 Applications of splines

library(tidyverse) # data wrangling 
library(ggpubr) # data viz
library(lme4) # fit LMMs (only supports simple random effects - sompound symmetry)
library(mgcv) # fit GAMs (supports both random effects and splines)

18.3.0.1 Spatial effects

Traditional approach

The data below come from a multi-environment experiment that studies yield of lupin for different genotypes.

dat <- drop_na(agridat::verbyla.lupin)

dat %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = yield))+
  facet_grid(year~loc)+
  coord_fixed()+
  scico::scale_fill_scico()+
  theme_pubclean()

A (typically) reasonable model could be \[y_{ijk} \sim N(\mu_{ijk}, \sigma^2),\\ \mu_{ijk} = \mu_0 + T_i + l_j + b_{k(j)}, \\ l_j \sim N(0, \sigma^2_l), \\ b_{k(j)} \sim N(0, \sigma^2_b),\] where:

\(y_{ijk}\) is the observed yield and arises from a normal distribution with mean \(\mu_{ijk}\) and variance \(\sigma^2\),
\(\mu_0\) is the overall mean yield for all fields,
\(l_j\) is the (random) effect of the \(j\)th field that arises from a normal distribution with variance \(\sigma^2_l\) (note that \(\mu_0 + l_j\) is the overall yield for the \(j\)th field),
\(b_{k(j)}\) is the (random) effect of the \(k\)th block in the \(j\)th field that arises from a normal distribution with variance \(\sigma^2_b\),
\(T_i\) is the treatment effect of the \(i\)th genotype.

tictoc::tic()
m1 <- lmer(yield ~ gen + (1|site/rep), data = dat)
tictoc::toc()

## 0.17 sec elapsed

VarCorr(m1)

##  Groups   Name        Std.Dev.
##  rep:site (Intercept) 0.10140 
##  site     (Intercept) 0.27897 
##  Residual             0.30284

dat$resid_1 <- resid(m1)

dat %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = resid_1))+
  facet_grid(~site)+
  coord_fixed()+
  scico::scale_fill_scico()+
  labs(main = "Residuals assuming within-block homogeneity")+
  theme_pubclean()

## Ignoring unknown labels:
## • main : "Residuals assuming within-block homogeneity"

Relaxing an assumption from the traditional approach

Alternatively, we could add an element to the linear predictor. Previously, we said \(\mu_{ijk} = \mu_0 + T_i + l_j + b_{k(j)}\).

Potential risks of assuming within-block homogeneity:

loss of power

Alternative: expand the linear predictor into \(\mu_{ijk} = \mu_0 + T_i + l_j + b_{k(j)} + f(\mathbf{s}_{ijk})\), where \(\mathbf{s}_{ijk}\) is a vector of the coordinates of the observation \(ijk\), and \(f(\cdot)\) is a smooth function of space. We consider \(f(\cdot)\) can be approximated using smoothing splines.

tictoc::tic()
m2 <- gam(yield ~ gen + s(rep, site, bs = "re") + s(row, col, bs = "tp", by = site), data = dat)
tictoc::toc()

## 10.76 sec elapsed

summary(m2)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## yield ~ gen + s(rep, site, bs = "re") + s(row, col, bs = "tp", 
##     by = site)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.89485    0.04871  18.372  < 2e-16 ***
## genGungurru  -0.02161    0.03232  -0.669 0.503922    
## genIllyarrie -0.12520    0.03255  -3.846 0.000126 ***
## genMerrit    -0.01378    0.03300  -0.417 0.676425    
## genMyallie   -0.07358    0.03716  -1.980 0.047935 *  
## genUnicrop   -0.18396    0.03238  -5.681 1.65e-08 ***
## genWarrah    -0.17843    0.03243  -5.501 4.54e-08 ***
## genYandee    -0.15374    0.03242  -4.742 2.35e-06 ***
## genYorrel    -0.09957    0.03254  -3.060 0.002256 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df      F  p-value    
## s(rep,site)        21.824 27.000 26.549  < 2e-16 ***
## s(row,col):siteS01  2.000  2.000  0.911 0.402579    
## s(row,col):siteS02 17.825 19.763  5.263  < 2e-16 ***
## s(row,col):siteS03 12.999 15.385  2.526 0.000966 ***
## s(row,col):siteS04  2.000  2.000  0.757 0.469094    
## s(row,col):siteS05  3.745  4.893  1.816 0.126446    
## s(row,col):siteS06  2.000  2.000  1.288 0.276202    
## s(row,col):siteS07  9.206 12.532  1.016 0.432038    
## s(row,col):siteS08 13.272 17.680  2.008 0.007607 ** 
## s(row,col):siteS09  2.000  2.000  3.518 0.029936 *  
## s(row,col):siteS10  2.000  2.000  2.655 0.070672 .  
## s(row,col):siteS11  6.062  8.237  0.901 0.519681    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.536   Deviance explained =   57%
## GCV = 0.087403  Scale est. = 0.080877  n = 1392

dat$resid_2 <- resid(m2)

dat %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = resid_2))+
  facet_grid(~site)+
  coord_fixed()+
  labs(main = "Residuals relaxing the assumption of within-block homogeneity")+
  scico::scale_fill_scico()+
  theme_pubclean()

## Ignoring unknown labels:
## • main : "Residuals relaxing the assumption of within-block
##   homogeneity"

An intuitive visualization of splines

d_predict <- dat %>% 
  transmute(gen = "Merrit", 
            row, col, site, rep)
d_predict$yhat_1 <- predict(m1, newdata = d_predict)
d_predict$yhat_2 <- predict(m2, newdata = d_predict)

sp_effects_trad <- 
  d_predict %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = yhat_1))+
  facet_grid(~site)+
  coord_fixed()+
  scico::scale_fill_scico(palette = "bamako")+
  labs(fill = expression(Yield~(ton~ha^{-1})))+
  theme_pubclean()

sp_effects_splines <- 
  d_predict %>% 
  ggplot(aes(col, row))+
  geom_tile(aes(fill = yhat_2))+
  facet_grid(~site)+
  coord_fixed()+
  scico::scale_fill_scico(palette = "bamako")+
  labs(fill = expression(Yield~(ton~ha^{-1})))+
  theme_pubclean()

ggarrange(sp_effects_trad, sp_effects_splines, 
          common.legend = T,
          ncol = 1)

Some thoughts on missing data imputation.
Use of multilevel models for missing data imputation
Use of GAM models for missing data imputation

18.3.0.2 Modeling time

The data below come from an experiment aiming to test whether a diet treatment in cows can reduce the (GH) gas emmission.

The experiment consists of 4 groups of cows that were hosted turns in the same facility. Each group contained similar cows and they were randomly divided into 2 subgroups. Those subgroups were assigned to a corral and each corral was assigned with a diet treatment.

dat <- read.csv("../../data_confid/fluxes.csv")

dat %>% 
  ggplot(aes(doy, GHG_flux))+
  geom_point(aes(fill = diet), shape= 21)+
  scale_fill_manual(values = c("#FFBC42", "#218380"))+
  labs(fill  ="Diet", 
       y = expression(gas~flux~cow^{-1}))+
  theme_pubclean()

We can generally describe any point as \[y_{ijk} \sim D(\mu_{ijk}, \psi), \\ g(\mu_{ijk}) = \eta_{ijk}\] where \(y_{ijk}\) is the observation of the flux per head for the \(i\)th diet, \(j\)th group and \(l\)th moment.

Discussion of predictor \(\eta_{ijk}\)

We can, for example, use \[\eta_{ijk} = \mu_0 + D_i + f(t_{ij}) + b_{k} + u_{ik},\]

where

\(\mu_0\) is the overall mean,
\(D_i\) is the effect of the \(i\)th diet,
\(f(t_{ij})\) is the function of time,
\(b_{k}\) is the effect of the \(k\)th group.

dat$diet_group <- factor(paste(dat$diet, dat$group))
m1 <- gam(GHG_flux ~ diet + 
            s(doy, bs = "tp", k =35) + 
            s(group, bs = "re") +
            s(diet_group, bs = "re") , 
          data = dat,
          family = Gamma(link = "log"))

dat_plot <- expand.grid(doy = seq(min(dat$doy), max(dat$doy), length.out = 1000), 
                        diet = unique(dat$diet)) %>% 
  mutate(group = case_when(doy <35 ~ 1, 
                           between(doy, 37, 61) ~ 2, 
                           between(doy, 61, 71.4) ~ 3, 
                           between(doy, 71.4, 117.8) ~ 4, 
                           doy>117.8 ~ 5, 
                           TRUE ~ 6)) %>% 
  filter(group %in% unique(dat$group)) %>% 
  mutate(across(c( diet), ~as.factor(.)), 
         diet_group = factor(paste(diet, group)))

pred_plot <- predict(m1, newdata = dat_plot, type = "response", se = T)

dat_plot$pred_m1 <- pred_plot$fit
dat_plot$se_m1 <- pred_plot$se.fit

dat_plot %>% 
  ggplot(aes(doy, pred_m1))+
  geom_point(aes(doy, GHG_flux, color = diet), data = dat, alpha = .3)+
  geom_ribbon(aes(group = paste(diet, group),
                  fill = diet,
                  ymin = pred_m1 - (se_m1*1.96) , ymax = pred_m1 +(se_m1*1.96)), 
              alpha = .3)+
  geom_line(aes(group = paste(diet, group), color = diet))+
  scale_fill_manual(values = c("#FFBC42", "#218380"))+
  scale_color_manual(values = c("#FFBC42", "#218380"))+
  coord_cartesian(ylim = c(0, 800))+
  theme_pubclean()+
  labs(fill  ="Diet", color = "Diet",
       y = expression(gas~flux~cow^{-1}))

summary(m1)

## 
## Family: Gamma 
## Link function: log 
## 
## Formula:
## GHG_flux ~ diet + s(doy, bs = "tp", k = 35) + s(group, bs = "re") + 
##     s(diet_group, bs = "re")
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.43170    0.03345 162.388  < 2e-16 ***
## dietTreatment -0.13537    0.04848  -2.792  0.00534 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                     edf Ref.df      F p-value    
## s(doy)        3.217e+01  33.67 22.911 < 2e-16 ***
## s(group)      1.245e-05   1.00  0.000 0.52837    
## s(diet_group) 2.212e+00   6.00  1.833 0.00177 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.585   Deviance explained =   58%
## GCV = 0.14934  Scale est. = 0.12529   n = 1000