Normal approximation vs. empirical simulation

Asymptotic normal approximation

Let the vector {\bf p} be the true fractions across D categories. Consider C total counts sampled from a Dirichlet-multinomial (DMN) distribution with overdispersion \tau, where \tau=1 reduces to the multinomial distribution. The centered log ratio (CLR) of the i^{th} estimated fraction, \hat p_i is

\text{clr}(\hat p_i) = \log(\hat p_i) - \frac{1}{D}\sum_{j=1}^D \log(\hat p_j) , and we show that the sampling variance is well approximated by

\text{var}[\text{clr}(\hat p_i)] = \frac{\tau}{C} \left[ \frac{1}{\hat p_i} - \frac{2}{ D \hat p_i} + \frac{1}{D^2}\sum_{j=1}^D \frac{1}{\hat p_j} \right] .

Simulations

The sampling variance is derived from asymototic theory, so we examine its behavior for finite total counts. Here we evaluate the empirical variance from 1,000 draws from a Dirichlet-multinomial distribution while varying D, \tau, and C. A pseudocount of 0.5 is added to the observed counts since the asymptotic theory is not defined for counts of zero.

Here we plot the standard deviation after CLR transform from the empirical DMN and the asymptotic normal approximation under a range of conditions. Results are shown for instances with at least 2 counts.

D=2 categories

D=15 categories

Interpretation

The asymptotic standard deviation shows good agreement with the empirical results even for small values of C, when at least 2 counts are observed. In practice, it is often reasonable to assume a sufficient number of counts before a variable is included in an analysis. Importantly, with less than 2 counts the asymptotic theory gives a larger standard deviation than the emprical results (results not shown). Therefore, this approach is conservative and should not underestimate the true amount of variation. The asymptotic normal approximation is most accurate for large total counts C, large proportions p, and small overdispersion \tau.

Consideration of overdispersion

Based on Equation (2), the variance of the CLR-transformed proportions is a linear function of \tau. Importantly, downstream analysis of the CLR-transformed proportions with a precision-weighted linear (mixed) model or a variance stabilizing transform depends only on the relative variances. Since relative variances are invariant to the scale of \tau, for these applications the value of \tau can be set to 1 instead of being estimated from the data.

For other applications, crumblr can estimate \tau from the data by using crumblr(counts, tau=NULL). This calls dmn.mle() to estimate the parameters of the DMN distribution and is substantially faster than alternatives.

But note that due the theoretical properties of the variance estimate of CLR-transformed proportions, the precision weights are invariant to the scale of \tau. We can see this empirically:

library(crumblr)

data(IFNCellCounts)

counts <- df_cellCounts

# run crumblr with different tau values
# show part of the weights matrix
crumblr(counts, tau=1)$weights[1:3, 1:3]

##                  ctrl101 ctrl1015 ctrl1016
## B cells         2.704599 5.000000 3.094438
## CD14+ Monocytes 1.822824 4.143929 2.716263
## CD4 T cells     2.044461 3.615231 2.433429

crumblr(counts, tau=5)$weights[1:3, 1:3]

##                  ctrl101 ctrl1015 ctrl1016
## B cells         2.704599 5.000000 3.094438
## CD14+ Monocytes 1.822824 4.143929 2.716263
## CD4 T cells     2.044461 3.615231 2.433429

crumblr(counts, tau=NULL)$weights[1:3, 1:3]

##                  ctrl101 ctrl1015 ctrl1016
## B cells         2.704599 5.000000 3.094438
## CD14+ Monocytes 1.822824 4.143929 2.716263
## CD4 T cells     2.044461 3.615231 2.433429

Session Info

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin23.6.0
## Running under: macOS Sonoma 14.7.1
## 
## Matrix products: default
## BLAS/LAPACK: /opt/homebrew/Cellar/openblas/0.3.30/lib/libopenblasp-r0.3.30.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] muscat_1.22.0            variancePartition_1.37.4 BiocParallel_1.42.1     
##  [4] limma_3.64.3             lubridate_1.9.4          forcats_1.0.0           
##  [7] stringr_1.5.1            dplyr_1.1.4              purrr_1.1.0             
## [10] readr_2.1.5              tidyr_1.3.1              tibble_3.3.0            
## [13] tidyverse_2.0.0          glue_1.8.0               HMP_2.0.1               
## [16] dirmult_0.1.3-5          crumblr_0.99.22          ggplot2_3.5.2           
## [19] BiocStyle_2.36.0        
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.5.1               bitops_1.0-9                ggplotify_0.1.2            
##   [4] rpart_4.1.24                lifecycle_1.0.4             Rdpack_2.6.4               
##   [7] edgeR_4.6.3                 doParallel_1.0.17           globals_0.18.0             
##  [10] lattice_0.22-7              MASS_7.3-65                 backports_1.5.0            
##  [13] magrittr_2.0.3              sass_0.4.10                 rmarkdown_2.29             
##  [16] jquerylib_0.1.4             yaml_2.3.10                 sctransform_0.4.2          
##  [19] minqa_1.2.8                 RColorBrewer_1.1-3          abind_1.4-8                
##  [22] EnvStats_3.1.0              glmmTMB_1.1.11              GenomicRanges_1.60.0       
##  [25] BiocGenerics_0.54.0         yulab.utils_0.2.0           circlize_0.4.16            
##  [28] GenomeInfoDbData_1.2.14     IRanges_2.42.0              S4Vectors_0.46.0           
##  [31] ggrepel_0.9.6               pbkrtest_0.5.5              irlba_2.3.5.1              
##  [34] listenv_0.9.1               tidytree_0.4.6              vegan_2.7-1                
##  [37] parallelly_1.45.1           pkgdown_2.1.3               permute_0.9-8              
##  [40] codetools_0.2-20            DelayedArray_0.34.1         scuttle_1.18.0             
##  [43] tidyselect_1.2.1            shape_1.4.6.1               aplot_0.2.8                
##  [46] UCSC.utils_1.4.0            farver_2.1.2                ScaledMatrix_1.16.0        
##  [49] lme4_1.1-37                 viridis_0.6.5               matrixStats_1.5.0          
##  [52] stats4_4.5.1                jsonlite_2.0.0              BiocNeighbors_2.2.0        
##  [55] GetoptLong_1.0.5            scater_1.36.0               iterators_1.0.14           
##  [58] systemfonts_1.2.3           foreach_1.5.2               tools_4.5.1                
##  [61] progress_1.2.3              treeio_1.32.0               ragg_1.4.0                 
##  [64] Rcpp_1.1.0                  blme_1.0-6                  gridExtra_2.3              
##  [67] SparseArray_1.8.1           xfun_0.52                   mgcv_1.9-3                 
##  [70] DESeq2_1.48.1               MatrixGenerics_1.20.0       GenomeInfoDb_1.44.1        
##  [73] withr_3.0.2                 numDeriv_2016.8-1.1         BiocManager_1.30.26        
##  [76] fastmap_1.2.0               boot_1.3-31                 rsvd_1.0.5                 
##  [79] caTools_1.18.3              digest_0.6.37               timechange_0.3.0           
##  [82] R6_2.6.1                    gridGraphics_0.5-1          textshaping_1.0.1          
##  [85] colorspace_2.1-1            gtools_3.9.5                RhpcBLASctl_0.23-42        
##  [88] generics_0.1.4              data.table_1.17.8           corpcor_1.6.10             
##  [91] prettyunits_1.2.0           httr_1.4.7                  htmlwidgets_1.6.4          
##  [94] S4Arrays_1.8.1              pkgconfig_2.0.3             gtable_0.3.6               
##  [97] rpart.plot_3.1.3            ComplexHeatmap_2.24.1       SingleCellExperiment_1.30.1
## [100] XVector_0.48.0              remaCor_0.0.18              htmltools_0.5.8.1          
## [103] bookdown_0.43               TMB_1.9.17                  zigg_0.0.2                 
## [106] clue_0.3-66                 scales_1.4.0                Biobase_2.68.0             
## [109] png_0.1-8                   fANCOVA_0.6-1               reformulas_0.4.1           
## [112] ggfun_0.2.0                 knitr_1.50                  tzdb_0.5.0                 
## [115] reshape2_1.4.4              rjson_0.2.23                nlme_3.1-168               
## [118] nloptr_2.2.1                cachem_1.1.0                GlobalOptions_0.1.2        
## [121] KernSmooth_2.23-26          vipor_0.4.7                 desc_1.4.3                 
## [124] pillar_1.11.0               grid_4.5.1                  vctrs_0.6.5                
## [127] gplots_3.2.0                BiocSingular_1.24.0         beachmat_2.24.0            
## [130] cluster_2.1.8.1             beeswarm_0.4.0              evaluate_1.0.4             
## [133] mvtnorm_1.3-3               cli_3.6.5                   locfit_1.5-9.12            
## [136] compiler_4.5.1              rlang_1.1.6                 crayon_1.5.3               
## [139] future.apply_1.20.0         labeling_0.4.3              ggbeeswarm_0.7.2           
## [142] plyr_1.8.9                  fs_1.6.6                    stringi_1.8.7              
## [145] viridisLite_0.4.2           lmerTest_3.1-3              lazyeval_0.2.2             
## [148] aod_1.3.3                   Matrix_1.7-3                hms_1.1.3                  
## [151] patchwork_1.3.1             future_1.67.0               statmod_1.5.0              
## [154] SummarizedExperiment_1.38.1 rbibutils_2.3               Rfast_2.1.5.1              
## [157] broom_1.0.9                 RcppParallel_5.1.10.9000    bslib_0.9.0                
## [160] ggtree_3.16.3               ape_5.8-1

Developed by Gabriel Hoffman