Day 1 Welcome to STAT 870!

August 25th, 2025

1.1 About this course:

About me
About you:

Figure 1.1: Students of this class

In rounds: What’s your major and what do you expect to learn in 870?

1.1.1 Logistics

Website
Syllabus
Statistical programming requirements
Rough mindmap of the course (on whiteboard)
Semester project
Grades: A (100-89.999999999(!!!)), B (89.99-79.99), C (79.99-69.99), D (69.99-59.99), F (<59.99).
Attendance policies & kahoots

1.2 Learning goals

By the end of this course, you should be able to:

Identify the treatment design, experiment design, experimental unit and observational unit of simple and complex designed experiments.
Distinguish the benefits/disadvantages of different experiment designs.
Write the statistical model that corresponds to data generated by designed experiments.
Write the Materials and Methods section in a paper (or thesis) that describes the designed experiment.

1.3 Overview of the course

Figure 1.2: Mindmap

1.4 What is messy data anyways?

Complex dependence structures
Complex structure in covariates

1.5 Linear Models review

Perhaps the most common model of all time (default in most software) is

\[y_{i} = \mu_i + \varepsilon_i, \ \varepsilon_i \sim N(0, \sigma^2),\]

where:

\(y_{i}\) is the observed value for the \(i\)th observation,
\(\mu_i\) is the expected value for the \(i\)th observation,
\(\varepsilon_i\) is the residual (i.e., the difference between observed and expected). All residuals are iid normal.
This example uses the model equation form.

The model above can also be written usign the probability distribution form. The probability distribution form is much more flexible because it is compatible with other distributions beyond the normal. It goes like this:

\[y_{i} \sim N(\mu_i, \sigma^2),\] where the elements are the same as described above.

Likewise, we can use the vectorized notation of the probability distribution form, and say that

\[\mathbf{y} \sim N(\boldsymbol\mu, \sigma^2 \mathbf{I}),\] where:

\(\mathbf{y} \equiv [y_1, y_2, ..., y_n]'\) is the vector of observed values,
\(\boldsymbol\mu \equiv [\mu_1, \mu_2, ..., \mu_n]'\) is the vector of expected values,
\(\sigma^2 \mathbf{I}\) is the variance-covariance matrix. Note that, more generally, we can call the variance-covariance matrix \(\boldsymbol{\Sigma}\) or \(\mathbf{V}\).

Note that \[\sigma^2 \mathbf{I} = \sigma^2 \begin{bmatrix} 1 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & 1 \end{bmatrix} = \begin{bmatrix} \sigma^2 & 0 & 0 & \dots & 0 \\ 0 & \sigma^2 & 0 & \dots & 0 \\ 0 & 0 & \sigma^2 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & \sigma^2 \end{bmatrix}.\]

The assumptions behind this model are:

Normal distribution of the data
Constant variance
Independence
Linearity

In this course, we will mostly deal with cases where:

The assumption of independence does not hold (basic and complex designed experiments),
The assumption of normality does not make sense (e.g., data are counts, proportions, or stricly positive),
The assumption of constant variance does not hold/make sense (e.g., larger biomass is associated to larger variance of said biomass).

We will approach most problems with a 3-step approach (Chapter 2 in Stroup et al.):

What is the distribution of the data?
What is the link function?
What is the blueprint of the design/data?

Figure 1.3: Common variable distributions. Page 60 in Stroup et al. (2024)

1.6 On notation

scalars: \(y\), \(\sigma\), \(\beta_0\)
vectors: \(\mathbf{y} \equiv [y_1, y_2, ..., y_n]'\), \(\boldsymbol{\beta} \equiv [\beta_1, \beta_2, ..., \beta_p]'\), \(\boldsymbol{u}\)
matrices: \(\mathbf{X}\), \(\Sigma\)
probability distribution: \(y \sim N(0, \sigma^2)\), \(\mathbf{y} \sim N(\boldsymbol{0}, \sigma^2\mathbf{I})\).

1.7 Homework & Announcements

Install R and RStudio.
Submit Assignment 1 by next Monday.
Read Chapter 2 in Stroup et al. (2024)
Applied Linear Mixed Models Workshop - Sept 13 & 14

STAT 870 - Analysis of Messy Data