# Multilevel Models (a.k.a. Hierarchical Models)

Bayesian hierarchical models (also called multilevel models) are a statistical model written at *multiple* levels (hierarchical form) that estimates the parameters of the posterior distribution using the Bayesian approach. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and to account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as an updated probability estimate, as additional evidence of the likelihood function is integrated together with the prior distribution of the parameters.

Hierarchical modeling is used when information is available at several different levels of observation units. The hierarchical form of analysis and organization helps to understand multiparameter problems and also plays an important role in the development of computational strategies.

Hierarchical models are mathematical statements that involve several parameters, so that the estimates of some parameters depend significantly on the values of other parameters. The figure below shows a hierarchical model in which there is a \(\phi\) hyperparameter that parameterizes the parameters \(\theta_1, \theta_2, \dots, \theta_N\) that are finally used to infer the posterior density of some variable of interest \(\mathbf{y} = y_1, y_2, \dots, y_N\).

*Hierarchical Model*

## When to use Multilevel Models?

Multilevel models are particularly suitable for research projects where participant data is organized at more than one level, *i.e.* nested data. Units of analysis are usually individuals (at a lower level) that are nested in contextual/aggregate units (at a higher level). An example is when we are measuring the performance of individuals and we have additional information about belonging to different groups such as sex, age group, hierarchical level, educational level or housing status.

There is a main assumption that cannot be violated in multilevel models which is **exchangeability** (de Finetti, 1974; Nau, 2001). Yes, this is the same assumption that we discussed in 2. **What is Bayesian Statistics?**. This assumption assumes that groups are exchangeable. The figure below shows a graphical representation of the exchangeability. The groups shown as "cups" that contain observations shown as "balls". If in the model's inferences, this assumption is violated, then multilevel models are not appropriate. This means that, since there is no theoretical justification to support exchangeability, the inferences of the multilevel model are not robust and the model can suffer from several pathologies and should not be used for any scientific or applied analysis.

*Exchangeability – Images from Michael Betancourt*

## Hyperpriors

As the priors of the parameters are sampled from another prior of the hyperparameter (upper-level's parameter), which are called **hyperpriors**. This makes one group's estimates help the model to better estimate the other groups by providing more **robust and stable estimates**.

We call the global parameters as **population effects** (or population-level effects, also sometimes called fixed effects) and the parameters of each group as **group effects** (or group-level effects, also sometimes called random effects). That is why multilevel models are also known as mixed models in which we have both fixed effects and random effects.

## Three Approaches to Multilevel Models

Multilevel models generally fall into three approaches:

**Random-intercept model**: each group receives a**different intercept**in addition to the global intercept.**Random-slope model**: each group receives**different coefficients**for each (or a subset of) independent variable(s) in addition to a global intercept.**Random-intercept-slope model**: each group receives**both a different intercept and different coefficients**for each independent variable in addition to a global intercept.

### Random-Intercept Model

The first approach is the **random-intercept model** in which we specify a different intercept for each group, in addition to the global intercept. These group-level intercepts are sampled from a hyperprior.

To illustrate a multilevel model, I will use the linear regression example with a Gaussian/normal likelihood function. Mathematically a Bayesian multilevel random-slope linear regression model is:

\[ \begin{aligned} \mathbf{y} &\sim \text{Normal}\left( \alpha + \alpha_j + \mathbf{X} \cdot \boldsymbol{\beta}, \sigma \right) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \alpha_j &\sim \text{Normal}(0, \tau) \\ \boldsymbol{\beta} &\sim \text{Normal}(\mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \\ \tau &\sim \text{Cauchy}^+(0, \psi_{\alpha})\\ \sigma &\sim \text{Exponential}(\lambda_\sigma) \end{aligned} \]The priors on the global intercept \(\alpha\), global coefficients \(\boldsymbol{\beta}\) and error \(\sigma\), along with the Gaussian/normal likelihood on \(\mathbf{y}\) are the same as in the linear regression model. But now we have **new parameters**. The first are the **group intercepts** prior \(\alpha_j\) that denotes that every group \(1, 2, \dots, J\) has its own intercept sampled from a normal distribution centered on 0 with a standard deviation \(\psi_\alpha\). This group intercept is added to the linear predictor inside the Gaussian/normal likelihood function. The **group intercepts' standard deviation** \(\tau\) have a hyperprior (being a prior of a prior) which is sampled from a positive-constrained Cauchy distribution (a special case of the Student-\(t\) distribution with degrees of freedom \(\nu = 1\)) with mean 0 and standard deviation \(\sigma_\alpha\). This makes the group-level intercept's dispersions being sampled from the same parameter \(\tau\) which allows the model to use information from one group intercept to infer robust information regarding another group's intercept dispersion and so on.

This is easily accomplished with Turing:

```
using Turing
using Statistics: mean, std
using Random:seed!
seed!(123)
@model varying_intercept(X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2)) = begin
#priors
α ~ Normal(mean(y), 2.5 * std(y)) # population-level intercept
β ~ filldist(Normal(0, 2), predictors) # population-level coefficients
σ ~ Exponential(1 / std(y)) # residual SD
#prior for variance of random intercepts
#usually requires thoughtful specification
τ ~ truncated(Cauchy(0, 2), 0, Inf) # group-level SDs intercepts
αⱼ ~ filldist(Normal(0, τ), n_gr) # group-level intercepts
#likelihood
ŷ = α .+ X * β .+ αⱼ[idx]
y ~ MvNormal(ŷ, σ)
end;
```

### Random-Slope Model

The second approach is the **random-slope model** in which we specify a different slope for each group, in addition to the global intercept. These group-level slopes are sampled from a hyperprior.

To illustrate a multilevel model, I will use the linear regression example with a Gaussian/normal likelihood function. Mathematically a Bayesian multilevel random-slope linear regression model is:

\[ \begin{aligned} \mathbf{y} &\sim \text{Normal}\left( \alpha + \mathbf{X} \cdot \boldsymbol{\beta}_j \cdot \boldsymbol{\tau}, \sigma \right) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta}_j &\sim \text{Normal}(0, 1) \\ \boldsymbol{\tau} &\sim \text{Cauchy}^+(0, \psi_{\boldsymbol{\beta}})\\ \sigma &\sim \text{Exponential}(\lambda_\sigma) \end{aligned} \]Here we have a similar situation from before with the same hyperprior, but now it is a hyperprior for the the group coefficients' standard deviation prior: \(\boldsymbol{\beta}_j\). This makes the group-level coefficients's dispersions being sampled from the same parameter \(\tau\) which allows the model to use information from one group coefficients to infer robust information regarding another group's coefficients dispersion and so on.

In Turing we can accomplish this as:

```
@model varying_slope(X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2)) = begin
#priors
α ~ Normal(mean(y), 2.5 * std(y)) # population-level intercept
σ ~ Exponential(1 / std(y)) # residual SD
#prior for variance of random slopes
#usually requires thoughtful specification
τ ~ filldist(truncated(Cauchy(0, 2), 0, Inf), n_gr) # group-level slopes SDs
βⱼ ~ filldist(Normal(0, 1), predictors, n_gr) # group-level standard normal slopes
#likelihood
ŷ = α .+ X * βⱼ * τ
y ~ MvNormal(ŷ, σ)
end;
```

### Random-Intercept-Slope Model

The third approach is the **random-intercept-slope model** in which we specify a different intercept and slope for each group, in addition to the global intercept. These group-level intercepts and slopes are sampled from hyperpriors.

To illustrate a multilevel model, I will use the linear regression example with a Gaussian/normal likelihood function. Mathematically a Bayesian multilevel random-intercept-slope linear regression model is:

\[ \begin{aligned} \mathbf{y} &\sim \text{Normal}\left( \alpha + \alpha_j + \mathbf{X} \cdot \boldsymbol{\beta}_j \cdot \boldsymbol{\tau}_{\boldsymbol{\beta}}, \sigma \right) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \alpha_j &\sim \text{Normal}(0, \tau_{\alpha}) \\ \boldsymbol{\beta}_j &\sim \text{Normal}(0, 1) \\ \tau_{\alpha} &\sim \text{Cauchy}^+(0, \psi_{\alpha})\\ \boldsymbol{\tau}_{\boldsymbol{\beta}} &\sim \text{Cauchy}^+(0, \psi_{\boldsymbol{\beta}})\\ \sigma &\sim \text{Exponential}(\lambda_\sigma) \end{aligned} \]Here we have a similar situation from before with the same hyperpriors, but now we fused both random-intercept and random-slope together.

In Turing we can accomplish this as:

```
@model varying_intercept_slope(X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2)) = begin
#priors
α ~ Normal(mean(y), 2.5 * std(y)) # population-level intercept
σ ~ Exponential(1 / std(y)) # residual SD
#prior for variance of random intercepts and slopes
#usually requires thoughtful specification
τₐ ~ truncated(Cauchy(0, 2), 0, Inf) # group-level SDs intercepts
τᵦ ~ filldist(truncated(Cauchy(0, 2), 0, Inf), n_gr) # group-level slopes SDs
αⱼ ~ filldist(Normal(0, τₐ), n_gr) # group-level intercepts
βⱼ ~ filldist(Normal(0, 1), predictors, n_gr) # group-level standard normal slopes
#likelihood
ŷ = α .+ αⱼ[idx] .+ X * βⱼ * τᵦ
y ~ MvNormal(ŷ, σ)
end;
```

## Example - Cheese Ratings

For our example, I will use a famous dataset called `cheese`

(Boatwright, McCulloch & Rossi, 1999), which is data from cheese ratings. A group of 10 rural and 10 urban raters rated 4 types of different cheeses (A, B, C and D) in two samples. So we have \(4 \cdot 20 \cdot2 = 160\) observations and 4 variables:

`cheese`

: type of cheese from`A`

to`D`

`rater`

: id of the rater from`1`

to`10`

`background`

: type of rater, either`rural`

or`urban`

`y`

: rating of the cheese

Ok let's read our data with `CSV.jl`

and output into a `DataFrame`

from `DataFrames.jl`

:

```
using DataFrames, CSV, HTTP
url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/cheese.csv"
cheese = CSV.read(HTTP.get(url).body, DataFrame)
describe(cheese)
```

```
4×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼───────────────────────────────────────────────────────────────
1 │ cheese A D 0 String1
2 │ rater 5.5 1 5.5 10 0 Int64
3 │ background rural urban 0 String7
4 │ y 70.8438 33 71.5 91 0 Int64
```

As you can see from the `describe()`

output, the mean cheese ratings is around 70 ranging from 33 to 91.

In order to prepare the data for Turing, I will convert the `String`

s in variables `cheese`

and `background`

to `Int`

s. Regarding `cheese`

, I will create 4 dummy variables one for each cheese type; and `background`

will be converted to integer data taking two values: one for each background type. My intent is to model `background`

as a group both for intercept and coefficients. Take a look at how the data will look like for the first 5 observations:

```
for c in unique(cheese[:, :cheese])
cheese[:, "cheese_$c"] = ifelse.(cheese[:, :cheese] .== c, 1, 0)
end
cheese[:, :background_int] = map(cheese[:, :background]) do b
b == "rural" ? 1 :
b == "urban" ? 2 : missing
end
first(cheese, 5)
```

```
5×9 DataFrame
Row │ cheese rater background y cheese_A cheese_B cheese_C cheese_D background_int
│ String1 Int64 String7 Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────
1 │ A 1 rural 67 1 0 0 0 1
2 │ A 1 rural 66 1 0 0 0 1
3 │ B 1 rural 51 0 1 0 0 1
4 │ B 1 rural 53 0 1 0 0 1
5 │ C 1 rural 75 0 0 1 0 1
```

Now let's us instantiate our model with the data. Here, I will specify a vector of `Int`

s named `idx`

to represent the different observations' group memberships. This will be used by Turing when we index a parameter with the `idx`

, *e.g.* `αⱼ[idx]`

.

```
X = Matrix(select(cheese, Between(:cheese_A, :cheese_D)));
y = cheese[:, :y];
idx = cheese[:, :background_int];
```

The first model is the `varying_intercept`

:

```
model_intercept = varying_intercept(X, idx, y)
chain_intercept = sample(model_intercept, NUTS(), MCMCThreads(), 2_000, 4)
summarystats(chain_intercept) |> DataFrame |> println
```

```
9×8 DataFrame
Row │ parameters mean std naive_se mcse ess rhat ess_per_sec
│ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────
1 │ α 70.7049 5.70978 0.0638372 0.164864 1353.74 1.00015 27.5033
2 │ β[1] 3.25087 1.25589 0.0140412 0.0210992 3448.24 0.999756 70.0563
3 │ β[2] -11.5816 1.24749 0.0139474 0.0202346 3511.89 0.999775 71.3494
4 │ β[3] 7.18857 1.23827 0.0138443 0.0209749 3381.35 0.99972 68.6972
5 │ β[4] 1.23144 1.25783 0.0140629 0.0212949 3141.82 0.999772 63.8308
6 │ σ 6.00683 0.274647 0.00307065 0.0042206 4766.07 1.0003 96.83
7 │ τ 6.55056 6.34343 0.0709217 0.161918 1757.96 1.00416 35.7156
8 │ αⱼ[1] -3.45676 5.64177 0.0630769 0.165614 1335.34 1.00019 27.1295
9 │ αⱼ[2] 3.69883 5.64633 0.0631279 0.16462 1334.24 1.0002 27.1071
```

Here we can see that the model has a population-level intercept `α`

along with population-level coefficients `β`

s for each `cheese`

dummy variable. But notice that we have also group-level intercepts for each of the groups `αⱼ`

s. Specifically, `αⱼ[1]`

are the rural raters and `αⱼ[2]`

are the urban raters.

Now let's go to the second model, `varying_slope`

:

```
model_slope = varying_slope(X, idx, y)
chain_slope = sample(model_slope, NUTS(), MCMCThreads(), 2_000, 4)
summarystats(chain_slope) |> DataFrame |> println
```

```
12×8 DataFrame
Row │ parameters mean std naive_se mcse ess rhat ess_per_sec
│ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────
1 │ α 74.4228 9.35066 0.104544 0.835716 22.6265 1.69451 0.169834
2 │ σ 6.57412 0.27803 0.00310847 0.0100486 153.598 1.05749 1.1529
3 │ τ[1] 7.91347 6.42468 0.0718301 0.45446 36.0057 1.31823 0.270258
4 │ τ[2] 6.99387 5.31401 0.0594125 0.260981 132.606 1.04219 0.995337
5 │ βⱼ[1,1] 0.179605 0.769462 0.00860285 0.0283767 164.827 1.05396 1.23719
6 │ βⱼ[2,1] -1.01473 1.00317 0.0112158 0.0357563 302.922 1.02735 2.27373
7 │ βⱼ[3,1] 0.268153 1.0751 0.01202 0.0734929 42.1253 1.22108 0.316192
8 │ βⱼ[4,1] -0.219553 1.04667 0.0117021 0.0788742 32.7925 1.33597 0.24614
9 │ βⱼ[1,2] 0.0524376 0.885558 0.00990083 0.0521062 59.0526 1.14982 0.443248
10 │ βⱼ[2,2] -0.917036 0.961937 0.0107548 0.0242328 1785.24 1.00291 13.3999
11 │ βⱼ[3,2] 0.587982 0.830407 0.00928423 0.0191609 1876.36 1.00898 14.0839
12 │ βⱼ[4,2] 0.281483 0.864789 0.00966863 0.0475261 59.3133 1.1729 0.445205
```

Here we can see that the model has still a population-level intercept `α`

. But now our population-level coefficients `β`

s are replaced by group-level coefficients `βⱼ`

s along with their standard deviation `τᵦ`

s. Specifically `βⱼ`

's first index denotes the 4 dummy `cheese`

variables' and the second index are the group membership. So, for example `βⱼ[1,1]`

is the coefficient for `cheese_A`

and rural raters (group 1).

Now let's go to the third model, `varying_intercept_slope`

:

```
model_intercept_slope = varying_intercept_slope(X, idx, y)
chain_intercept_slope = sample(model_intercept_slope, NUTS(), MCMCThreads(), 2_000, 4)
summarystats(chain_intercept_slope) |> DataFrame |> println
```

```
15×8 DataFrame
Row │ parameters mean std naive_se mcse ess rhat ess_per_sec
│ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────
1 │ α 71.7565 8.02456 0.0897173 0.367218 237.98 1.02027 0.92551
2 │ σ 5.87263 0.261976 0.00292898 0.00554319 1981.28 1.00224 7.70523
3 │ τₐ 6.38706 6.52443 0.0729454 0.165376 1822.61 1.00578 7.08817
4 │ τᵦ[1] 6.09447 5.37293 0.0600712 0.173534 716.495 1.0069 2.78647
5 │ τᵦ[2] 6.75202 6.16084 0.0688802 0.344854 134.199 1.03435 0.521901
6 │ αⱼ[1] -3.75526 5.52406 0.0617609 0.155873 1279.59 1.00054 4.97637
7 │ αⱼ[2] 3.42156 5.5182 0.0616954 0.155046 1301.3 1.00054 5.0608
8 │ βⱼ[1,1] 0.208397 0.809126 0.00904631 0.0192226 2199.62 1.00105 8.55439
9 │ βⱼ[2,1] -0.911891 1.0565 0.0118121 0.0272848 1513.56 1.0019 5.88628
10 │ βⱼ[3,1] 0.518425 0.877849 0.00981465 0.0213334 1869.26 1.00255 7.26958
11 │ βⱼ[4,1] -0.0271807 0.915114 0.0102313 0.0511887 103.484 1.03649 0.402451
12 │ βⱼ[1,2] 0.231053 0.824283 0.00921577 0.0211158 1357.62 1.00276 5.27981
13 │ βⱼ[2,2] -0.91239 1.04135 0.0116427 0.024434 2210.34 1.00319 8.59606
14 │ βⱼ[3,2] 0.521613 0.931685 0.0104166 0.0312735 582.521 1.00624 2.26544
15 │ βⱼ[4,2] 0.0614634 0.803627 0.00898482 0.0189511 2000.35 1.00192 7.77942
```

Now we have fused the previous model in one. We still have a population-level intercept `α`

. But now we have in the same model group-level intercepts for each of the groups `αⱼ`

s and group-level along with their standard deviation `τₐ`

. We also have the coefficients `βⱼ`

s with their standard deviation `τᵦ`

s. The parameters are interpreted exactly as the previous cases.

## References

Boatwright, P., McCulloch, R., & Rossi, P. (1999). Account-level modeling for trade promotion: An application of a constrained parameter hierarchical model. Journal of the American Statistical Association, 94(448), 1063–1073.

de Finetti, B. (1974). Theory of Probability (Volume 1). New York: John Wiley & Sons.

Nau, R. F. (2001). De Finetti was Right: Probability Does Not Exist. Theory and Decision, 51(2), 89–124. https://doi.org/10.1023/A:1015525808214