W. Zane Billings – Increase vs post titer as regression outcome

For a vaccine study with an immunogenicity endpoint, two commonly used outcomes are the raw post-vaccination titer, or the fold change in titers between the post-vaccination and pre-vaccination endpoint. Both provide interesting information. However, under many of the simple statistical models that are used for both outcomes, the model results are deterministically related, which I want to show here.

Generative model

First, we generate an example dataset. This dataset will have two columns, just pre-titer and post-titer with no other confounders. Here’s the population parameters. Throughout this example, I will assume all effects for the titer values occur on the log scale for simplicity. Though we could have a lot of arguments about the generative model, I specifically chose a very simple and easy to work with generative model that I think illustrates the point that I want to make.

sim_parms <- list(
    S <- 32482304,
    N <- 1000,
    alpha <- 3,
    beta <- 0.5,
    pre_mean <- 2,
    pre_var <- 2,
    error_var <- 2
)
str(sim_parms)

List of 7
 $ : num 32482304
 $ : num 1000
 $ : num 3
 $ : num 0.5
 $ : num 2
 $ : num 2
 $ : num 2

Now we generate the data. Even though HAI titers always have the \(\times 5\) in the formulas, I left that off here for simplicity. Since it’s a constant multiplier it doesn’t affect what’s going on here.

gen_data <- function(S, N, alpha, beta, pre_mean, pre_var, error_var) {
    set.seed(S)
    dat <- tibble::tibble(
        log_pre = rnorm(N, pre_mean, pre_var),
        pre = (2 ^ log_pre),
        noise = rnorm(N, 0, error_var),
        log_post = alpha + beta * log_pre + noise,
        post = (2 ^ log_post),
        TI = log_post - log_pre,
        FC = 2 ^ TI
    )
    return(dat)
}
sim_dat <- do.call(gen_data, sim_parms)
print(sim_dat, n = 5)

# A tibble: 1,000 × 7
  log_pre     pre  noise log_post  post    TI     FC
    <dbl>   <dbl>  <dbl>    <dbl> <dbl> <dbl>  <dbl>
1   6.95  124.    -1.72      4.75  27.0 -2.20  0.218
2  -0.501   0.707  1.83      4.58  23.9  5.08 33.8  
3   2.02    4.04   0.706     4.71  26.2  2.70  6.49 
4   0.832   1.78   1.82      5.23  37.6  4.40 21.1  
5   5.91   60.0   -1.23      4.72  26.4 -1.19  0.439
# … with 995 more rows

If we plot the pre-titers vs the post-titers, we see a cloud of points around the regression line, as we would expect.

ggplot(sim_dat) +
    aes(x = pre, y = post) +
    geom_point() +
    geom_abline(slope = beta, intercept = alpha, color = "red") +
    scale_x_continuous(trans = "log2", breaks = scales::breaks_log(base = 2)) +
    scale_y_continuous(trans = "log2", breaks = scales::breaks_log(base = 2))

The relationship between titer increase and pretiter is also completely determined by this linear model. For the specific set of parameters I chose for the simulation, we see a negative effect of pre-titer on titer increase (which we think is true in real life), but with this generative model, this is entirely dependent on the value of \(\beta\). Because I set a value of \(\beta\) which is in \((0, 1)\), we observe a positive correlation between pre and post- titer, and a negative correlation between pre-titer and titer increase. We will see why shortly.

ggplot(sim_dat) +
    aes(x = pre, y = FC) +
    geom_point() +
    scale_x_continuous(trans = "log2", breaks = scales::breaks_log(base = 2)) +
    scale_y_continuous(trans = "log2", breaks = scales::breaks_log(base = 2))

Regression offsets

We can derive the regression line for the case where the titer increase is the outcome of interest.

\[ \begin{aligned} \log_2(\text{post}) &= \alpha + \beta \cdot \log_2(\text{pre}) \\ \log_2(\text{post}) - \log_2(\text{pre}) &= \alpha + \beta \cdot \log_2(\text{pre}) - \log_2(\text{pre})\\ \text{titer increase} &= \alpha + (\beta - 1) \cdot \log_2 (\text{pre}) \end{aligned} \] That is, we can include the negative pre-titer as an offset term in the linear model for titer increase to recover the same coefficients as we would in the model for post-titer. This is why we got the specific qualitative behavior we observed earlier.

Note that a regression offset is a term that is included on the right-hand side of a linear model, but the coefficient is pre-specified as 1, and not estimated as part of the model fit.

mod1 <- lm(log_post ~ log_pre, data = sim_dat)
mod2 <- lm(TI ~ log_pre, data = sim_dat)
mod3 <- lm(TI ~ log_pre, data = sim_dat, offset = (-1 * log_pre))

Here we can see that model 1 and model 3 recover the parameters from the generative model, while if we fit model 2, we get a biased coefficient that is exactly equal to 1 minus the estimated parameter from the other two models. So while both models produce a “correct” inference, we need to be aware why the models are different.

dplyr::bind_rows(
    "mod1" = coef(mod1),
    "mod2" = coef(mod2),
    "mod3" = coef(mod3),
    .id = "model"
)

# A tibble: 3 × 3
  model `(Intercept)` log_pre
  <chr>         <dbl>   <dbl>
1 mod1           3.05   0.479
2 mod2           3.05  -0.521
3 mod3           3.05   0.479

Note that adding linear confounders will not change what happens here. For example, we could add \(\beta_2 \cdot \mathrm{age}\) into our generative model and our regression models, and we would see the same pattern. However, if we incorporate nonlinearity (e.g. interaction effects or hierarchical effects), understanding how the models are related might not be that simple to solve, because in those cases we can’t necessarily combine the offset term with something else to simplify the model.

Other outcomes

If we were to construct an outcome that is not linear in the pre-titer value, this equivalence would no longer hold. For example, if instead of using the log fold change as the outcome, we instead used

\[y = \frac{\log_2 (\text{post})}{\log_2 (\text{pre})},\] the models are no longer equivalent in this way. This model is not commonly used because the ratio of the logarithms is not as easily interpreted as the log fold change, but we could do that if we wanted. If we fit a model with that outcome we would get \[ \begin{aligned} \frac{\log_2 (\text{post})}{\log_2 (\text{pre})} &= \frac{\alpha + \beta\cdot\log_2 (\text{pre})}{\log_2 (\text{pre})} \\ &=\beta + \alpha\frac{1}{\log_2 (\text{pre})}. \end{aligned} \]

x <- 1 / sim_dat$log_pre
y <- sim_dat$log_post / sim_dat$log_pre
mod4 <- lm(y ~ sim_dat$log_pre)
mod5 <- lm(y ~ x)
dplyr::bind_rows(
    "mod4" = coef(mod4),
    "mod5" = coef(mod5),
    .id = "model"
)

# A tibble: 2 × 4
  model `(Intercept)` `sim_dat$log_pre`      x
  <chr>         <dbl>             <dbl>  <dbl>
1 mod4          1.30              0.177 NA    
2 mod5          0.899            NA      0.916

Uh-oh, those aren’t the numbers I said we should get! Actually, a major problem with this model is that those transformations kind of wreck our numeric precision. If we plot the transformed data, we can see that regression line still passes through the middle of the data really well, we just get some crazy points that ruin our ability to estimate this with a plain linear regression model.

plot(x, y)
abline(a = beta, b = alpha)

So I guess that’s probably another reason this outcome doesn’t get used much.

Details

Last updated at 2023-10-28 22:18:34.847873.

source code

sessionInfo()

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] ggplot2_3.3.6

loaded via a namespace (and not attached):
 [1] vctrs_0.5.0       cli_3.4.1         knitr_1.40        rlang_1.0.6      
 [5] xfun_0.34         stringi_1.7.8     generics_0.1.3    renv_0.16.0      
 [9] jsonlite_1.8.3    glue_1.6.2        colorspace_2.0-3  htmltools_0.5.3  
[13] fansi_1.0.3       scales_1.2.1      rmarkdown_2.17    grid_4.3.1       
[17] evaluate_0.17     munsell_0.5.0     tibble_3.1.8      fastmap_1.1.0    
[21] yaml_2.3.6        lifecycle_1.0.3   zlib_0.0.1        stringr_1.4.1    
[25] compiler_4.3.1    dplyr_1.0.10      pkgconfig_2.0.3   htmlwidgets_1.5.4
[29] farver_2.1.1      digest_0.6.30     R6_2.5.1          tidyselect_1.2.0 
[33] utf8_1.2.2        pillar_1.8.1      magrittr_2.0.3    withr_2.5.0      
[37] tools_4.3.1       gtable_0.3.1

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{billings2023,
  author = {Billings, Zane},
  title = {Increase Vs Post Titer as Regression Outcome},
  date = {2023-10-28},
  url = {https://wzbillings.com/posts/2023-10-28_TI-vs-Post-Regression},
  langid = {en}
}

For attribution, please cite this work as:

Billings, Zane. 2023. “Increase Vs Post Titer as Regression Outcome.” October 28, 2023. https://wzbillings.com/posts/2023-10-28_TI-vs-Post-Regression.