library(ggplot2)
library(lme4)
library(emmeans)
Mixed model simulation
I have started to believe that a good part of understanding the process of a particular statistical model involves being able to set up a simulation of it. Here we do a simulation of data expected from a fish gut microbiota pilot study that I was involved in
Let’s go
While we obtain abundance data on many different microbes at once (a community profile) in such studies, the first step is to understand what we can do with just one. This means this approach is applicable to each microbe separately and does not examine the data at community level.
In this simulation, there are 4 gut locations (mouth, foregut, midgut and hindgut) examined from 10 fish (A to J).
The model is:
\[Y_{abundance} = beta1_{gut.location} + sigma1_{fish} + sigma2_{unexplained}\]
Packages
Building the data set
There are 4 gut locations and 10 fish, thus there are 40 * 10 rows required.
<-
fish_data expand.grid(
gut_location = c("mouth", "foregut", "midgut","hindgut"),
fish_id = LETTERS[1:10]
|>
) ::as_tibble()
tibble
fish_data
# A tibble: 40 × 2
gut_location fish_id
<fct> <fct>
1 mouth A
2 foregut A
3 midgut A
4 hindgut A
5 mouth B
6 foregut B
7 midgut B
8 hindgut B
9 mouth C
10 foregut C
# ℹ 30 more rows
Define parameters
Let’s assume these proportion trends and that there are about 200 individuals. This gives us the deterministic part of the model.
<-
gut_location_proportion c("mouth" = 0.01,
"foregut" = 0.5,
"midgut" = 0.35,
"hindgut" = 0.10
)
<- 200 * gut_location_proportion
gut_location_means
gut_location_means
mouth foregut midgut hindgut
2 100 70 20
Next we define the variability of fish individuals and unexplained variation.
set.seed(1010)
<-
fish_variation rnorm(10, sd = 15) |>
round(digits = 0) |>
setNames(unique(fish_data$fish_id))
fish_variation
A B C D E F G H I J
2 -13 20 6 -4 20 29 -35 -9 -8
<-
residual_variation rnorm(n = nrow(fish_data), sd = 5) |>
round(0)
residual_variation
[1] -1 -4 -7 0 4 -4 -4 1 -5 -5 0 4 -5 2 -3 4 -2 -5 -1 -9 -5 10 -1 2 -5
[26] 2 -1 -2 -3 -5 -2 2 0 3 -3 1 2 -3 2 7
Simulating the counts
Then we use the above to create observed counts
set.seed(1010)
<-
fish_data_build |>
fish_data ::mutate(
dplyrlocation_count = gut_location_means[gut_location],
fish_variation = fish_variation[fish_id],
residual_variation = residual_variation,
observed_count = location_count + fish_variation + residual_variation,
observed_count = ifelse(observed_count < 0, 0, observed_count)
)
fish_data_build
# A tibble: 40 × 6
gut_location fish_id location_count fish_variation residual_variation
<fct> <fct> <dbl> <dbl> <dbl>
1 mouth A 2 2 -1
2 foregut A 100 2 -4
3 midgut A 70 2 -7
4 hindgut A 20 2 0
5 mouth B 2 -13 4
6 foregut B 100 -13 -4
7 midgut B 70 -13 -4
8 hindgut B 20 -13 1
9 mouth C 2 20 -5
10 foregut C 100 20 -5
# ℹ 30 more rows
# ℹ 1 more variable: observed_count <dbl>
And lastly, clean up the data
<-
fish_data_final |>
fish_data_build ::select(gut_location, fish_id, observed_count)
dplyr
fish_data_final
# A tibble: 40 × 3
gut_location fish_id observed_count
<fct> <fct> <dbl>
1 mouth A 3
2 foregut A 98
3 midgut A 65
4 hindgut A 22
5 mouth B 0
6 foregut B 83
7 midgut B 53
8 hindgut B 8
9 mouth C 17
10 foregut C 115
# ℹ 30 more rows
Data exploration
Counts by fish individuals
|>
fish_data_final ggplot(aes(x = gut_location, y = observed_count, group = fish_id)) +
geom_line()
In another way
|>
fish_data_final ggplot(aes(x = gut_location, y = observed_count, group = fish_id)) +
geom_line() +
geom_point() +
facet_wrap(vars(fish_id), nrow = 2)
Hand calculating parameters
|>
fish_data_final ::group_by(gut_location) |>
dplyr::summarise(
dplyrmean_count = mean(observed_count)
)
# A tibble: 4 × 2
gut_location mean_count
<fct> <dbl>
1 mouth 6.6
2 foregut 99.9
3 midgut 68.8
4 hindgut 23.1
|>
fish_data_final ::group_by(fish_id) |>
dplyr::summarise(
dplyrmean_count = mean(observed_count)
|>
) ::ungroup() |>
dplyr::summarise(
dplyrsd_count = sd(mean_count)
)
# A tibble: 1 × 1
sd_count
<dbl>
1 16.5
Modelling
We fit the model like so
<- lmer(observed_count ~ gut_location + (1|fish_id), data = fish_data_final)
fit1
plot(fit1, pch = 19)
And obtain a model summary
<- summary(fit1)
fit1_summary fit1_summary
Linear mixed model fit by REML ['lmerMod']
Formula: observed_count ~ gut_location + (1 | fish_id)
Data: fish_data_final
REML criterion at convergence: 279.4
Scaled residuals:
Min 1Q Median 3Q Max
-2.10601 -0.40453 0.02645 0.36144 2.65347
Random effects:
Groups Name Variance Std.Dev.
fish_id (Intercept) 259.93 16.122
Residual 48.95 6.997
Number of obs: 40, groups: fish_id, 10
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.600 5.558 1.188
gut_locationforegut 93.300 3.129 29.818
gut_locationmidgut 62.200 3.129 19.879
gut_locationhindgut 16.500 3.129 5.273
Correlation of Fixed Effects:
(Intr) gt_lctnf gt_lctnm
gt_lctnfrgt -0.281
gt_lctnmdgt -0.281 0.500
gt_lctnhndg -0.281 0.500 0.500
And we obtain the values we used for the data simulation
$coefficients fit1_summary
Estimate Std. Error t value
(Intercept) 6.6 5.557727 1.187536
gut_locationforegut 93.3 3.128957 29.818245
gut_locationmidgut 62.2 3.128957 19.878830
gut_locationhindgut 16.5 3.128957 5.273323
$varcor fit1_summary
Groups Name Std.Dev.
fish_id (Intercept) 16.1224
Residual 6.9966
Lastly, a final plot …
<-
fit1_means ::emmeans(fit1, specs = 'gut_location') |>
emmeansas.data.frame()
Cannot use mode = "kenward-roger" because *pbkrtest* package is not installed
|>
fish_data_final ggplot(aes(x = gut_location, y = observed_count)) +
geom_jitter(height = 0, width = 0.15) +
geom_point(data = fit1_means,
mapping = aes(y = emmean),
color = 'red', size = 4, shape = 'cross') +
geom_line(data = fit1_means,
mapping = aes(y = emmean, group = 1),
color = 'red') +
geom_errorbar(data = fit1_means,
mapping = aes(y = NULL, ymax = upper.CL, ymin = lower.CL),
color = 'red',
width = 0.1)