
Using crumblr in practice
Developed by Gabriel Hoffman
Run on 2025-08-11 14:19:38.671664
Source:vignettes/crumblr.Rmd
crumblr.Rmd
Introduction
Changes in cell type composition play an important role in health and
disease. Recent advances in single cell technology have enabled
measurement of cell type composition at increasing cell lineage
resolution across large cohorts of individuals. Yet this raises new
challenges for statistical analysis of these compositional data to
identify changes associated with a phenotype. We introduce
crumblr
, a scalable statistical method for analyzing count
ratio data using precision-weighted linear models incorporating random
effects for complex study designs. Uniquely, crumblr
performs tests of association at multiple levels of the cell lineage
hierarchy using multivariate regression to increase power over tests of
a single cell component. In simulations, crumblr
increases
power compared to existing methods, while controlling the false positive
rate.
The crumblr
package integrates with the variancePartition
and dreamlet
packages in the Bioconductor ecosystem.
Here we consider counts for 8 cell types from quantified using single cell RNA-seq data from unstimulated and interferon β stimulated PBMCs from 8 subjects (Kang, et al., 2018).
The functions here incorporate the precision weights:
Installation
To install this package, start R and enter:
# 1) Make sure Bioconductor is installed
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
# 2) Install crumblr and dependencies:
# From Bioconductor
BiocManager::install("crumblr")
Analysis workflow
Process data
Here we evaluate whether the observed cell proportions change in response to interferon β. Given the results here, we cannot reject the null hypothesis that interferon β does not affect the cell type proportions.
library(crumblr)
# Load cell counts, clustering and metadata
# from Kang, et al. (2018) https://doi.org/10.1038/nbt.4042
data(IFNCellCounts)
# Apply crumblr transformation
# cobj is an EList object compatable with limma workflow
# cobj$E stores transformed values
# cobj$weights stores precision weights
# corresponding to the regularized inverse variance
cobj <- crumblr(df_cellCounts)
Variance partitioning
Decomposing the variance illustrates that more variation is explained by subject than stimulation status.
library(variancePartition)
# Partition variance into components for Subject (i.e. ind)
# and stimulation status, and residual variation
form <- ~ (1 | ind) + (1 | StimStatus)
vp <- fitExtractVarPartModel(cobj, form, info)
# Plot variance fractions
fig.vp <- plotPercentBars(vp)
fig.vp
PCA
Performing PCA on the transformed cell counts indicates that the samples cluster based on subject rather than stimulation status.
library(ggplot2)
# Perform PCA
# use crumblr::standardize() to get values with
# approximately equal sampling variance,
# which is a key property for downstream PCA and clustering analysis.
pca <- prcomp(t(standardize(cobj)))
# merge with metadata
df_pca <- merge(pca$x, info, by = "row.names")
# Plot PCA
# color by Subject
# shape by Stimulated vs unstimulated
ggplot(df_pca, aes(PC1, PC2, color = as.character(ind), shape = StimStatus)) +
geom_point(size = 3) +
theme_classic() +
theme(aspect.ratio = 1) +
scale_color_discrete(name = "Subject") +
xlab("PC1") +
ylab("PC2")
Differential testing
# Use variancePartition workflow to analyze each cell type
# Perform regression on each cell type separately
# then use eBayes to shrink residual variance
# Also compatible with limma::lmFit()
fit <- dream(cobj, ~ StimStatus + ind, info)
fit <- eBayes(fit)
# Extract results for each cell type
topTable(fit, coef = "StimStatusstim", number = Inf)
## logFC AveExpr t P.Value adj.P.Val B
## CD8 T cells -0.25085170 0.0857175 -4.0787416 0.002436375 0.01949100 -1.279815
## Dendritic cells 0.37386979 -2.1849234 3.1619195 0.010692544 0.02738587 -2.638507
## CD14+ Monocytes -0.10525402 1.2698117 -3.1226341 0.011413912 0.02738587 -2.709377
## B cells -0.10478652 0.5516882 -3.0134349 0.013692935 0.02738587 -2.940542
## CD4 T cells -0.07840101 2.0201947 -2.2318104 0.050869691 0.08139151 -4.128069
## FCGR3A+ Monocytes 0.07425165 -0.2567492 1.6647681 0.128337022 0.17111603 -4.935304
## NK cells 0.10270672 0.3797777 1.5181860 0.161321761 0.18436773 -5.247806
## Megakaryocytes 0.01377768 -1.8655172 0.1555131 0.879651456 0.87965146 -6.198336
Multivariate testing along a tree
We can gain power by jointly testing multiple cell types using a
multivariate statistical model, instead of testing one cell type at a
time. Here we construct a hierarchical clustering between cell types
based on gene expression from pseudobulk, and perform a multivariate
test for each internal node of the tree based on its leaf nodes. The
results for the leaves are the same as from topTable()
above. At each internal node treeTest()
performs a fixed
effects meta-analysis of the coefficients of the leaves while modeling
the covariance between coefficient estimates. In the backend, this is
implemented using variancePartition::mvTest()
and remaCor
package.
Here the hierarchical clustering, hcl
, is precomputed
from pseudobulk gene expression using
buildClusterTreeFromPB()
.
# Perform multivariate test across the hierarchy
res <- treeTest(fit, cobj, hcl, coef = "StimStatusstim")
# Plot hierarchy and testing results
plotTreeTest(res)
# Plot hierarchy and regression coefficients
plotTreeTestBeta(res)
Combined plotting
plotTreeTestBeta(res) +
theme(legend.position = "bottom", legend.box = "vertical") |
plotForest(res, hide = FALSE) |
fig.vp
Hierarchical clustering
The hierarchical clustering used by treeTest()
can be
computed a number of ways, depending on the available data and
biological question. For example, see Article for details about how hierarchical
clustering was run in this dataset.
In general, hierarchical clustering can be computed from
- pseudobulked single cell gene expression
hcl <- buildClusterTreeFromPB(pb)
- cell type frequencies
# correlation matrix between all cell types
C <- cor(t(standardize(cobj)))
# convert to distance
dm <- as.dist(1 - abs(C))
# eval hierarchical clustering
hcl <- hclust(dm)
- Newick formated tree computed from external data
# Make sure packages are installed
# BiocManager::install(c("ctc", "ape", "phylogram"))
library(ape)
library(ctc)
library(phylogram)
library(tidyverse)
# Write tree to file,
# edit manually
# then read back into R
#
# Specify tree as text in Newick format
txt = "((CD14+ Monocytes,(B cells,(Dendritic cells,Megakaryocytes))),(CD8 T cells,(NK cells,(CD4 T cells,FCGR3A+ Monocytes))));"
# read from text
hcl_from_txt <- read.tree(text = txt) %>%
as.dendrogram.phylo %>%
as.hclust
# Alternatively,
# write existing tree to file
# and edit manaully
write(hc2Newick(hcl),file='hcl.nwk')
hcl_from_txt2 <- read.tree(file = 'hcl.nwk') %>%
as.dendrogram.phylo %>%
as.hclust
Session Info
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin23.6.0
## Running under: macOS Sonoma 14.7.1
##
## Matrix products: default
## BLAS/LAPACK: /opt/homebrew/Cellar/openblas/0.3.30/lib/libopenblasp-r0.3.30.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] muscat_1.22.0 crumblr_0.99.22 variancePartition_1.37.4
## [4] BiocParallel_1.42.1 limma_3.64.3 ggplot2_3.5.2
## [7] BiocStyle_2.36.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.5.1 bitops_1.0-9 ggplotify_0.1.2
## [4] tibble_3.3.0 lifecycle_1.0.4 Rdpack_2.6.4
## [7] edgeR_4.6.3 doParallel_1.0.17 globals_0.18.0
## [10] lattice_0.22-7 MASS_7.3-65 backports_1.5.0
## [13] magrittr_2.0.3 sass_0.4.10 rmarkdown_2.29
## [16] jquerylib_0.1.4 yaml_2.3.10 sctransform_0.4.2
## [19] minqa_1.2.8 RColorBrewer_1.1-3 abind_1.4-8
## [22] EnvStats_3.1.0 glmmTMB_1.1.11 GenomicRanges_1.60.0
## [25] purrr_1.1.0 BiocGenerics_0.54.0 yulab.utils_0.2.0
## [28] circlize_0.4.16 GenomeInfoDbData_1.2.14 IRanges_2.42.0
## [31] S4Vectors_0.46.0 ggrepel_0.9.6 pbkrtest_0.5.5
## [34] irlba_2.3.5.1 listenv_0.9.1 tidytree_0.4.6
## [37] parallelly_1.45.1 pkgdown_2.1.3 codetools_0.2-20
## [40] DelayedArray_0.34.1 scuttle_1.18.0 tidyselect_1.2.1
## [43] shape_1.4.6.1 aplot_0.2.8 UCSC.utils_1.4.0
## [46] farver_2.1.2 lme4_1.1-37 ScaledMatrix_1.16.0
## [49] viridis_0.6.5 matrixStats_1.5.0 stats4_4.5.1
## [52] jsonlite_2.0.0 GetoptLong_1.0.5 BiocNeighbors_2.2.0
## [55] scater_1.36.0 iterators_1.0.14 systemfonts_1.2.3
## [58] foreach_1.5.2 tools_4.5.1 progress_1.2.3
## [61] treeio_1.32.0 ragg_1.4.0 Rcpp_1.1.0
## [64] blme_1.0-6 glue_1.8.0 gridExtra_2.3
## [67] SparseArray_1.8.1 xfun_0.52 mgcv_1.9-3
## [70] DESeq2_1.48.1 MatrixGenerics_1.20.0 GenomeInfoDb_1.44.1
## [73] dplyr_1.1.4 withr_3.0.2 numDeriv_2016.8-1.1
## [76] BiocManager_1.30.26 fastmap_1.2.0 boot_1.3-31
## [79] caTools_1.18.3 digest_0.6.37 rsvd_1.0.5
## [82] R6_2.6.1 gridGraphics_0.5-1 textshaping_1.0.1
## [85] colorspace_2.1-1 gtools_3.9.5 RhpcBLASctl_0.23-42
## [88] tidyr_1.3.1 generics_0.1.4 data.table_1.17.8
## [91] corpcor_1.6.10 prettyunits_1.2.0 httr_1.4.7
## [94] htmlwidgets_1.6.4 S4Arrays_1.8.1 pkgconfig_2.0.3
## [97] gtable_0.3.6 ComplexHeatmap_2.24.1 SingleCellExperiment_1.30.1
## [100] XVector_0.48.0 remaCor_0.0.18 htmltools_0.5.8.1
## [103] bookdown_0.43 TMB_1.9.17 zigg_0.0.2
## [106] clue_0.3-66 scales_1.4.0 Biobase_2.68.0
## [109] png_0.1-8 fANCOVA_0.6-1 reformulas_0.4.1
## [112] ggfun_0.2.0 knitr_1.50 reshape2_1.4.4
## [115] rjson_0.2.23 nlme_3.1-168 nloptr_2.2.1
## [118] cachem_1.1.0 GlobalOptions_0.1.2 stringr_1.5.1
## [121] KernSmooth_2.23-26 parallel_4.5.1 vipor_0.4.7
## [124] desc_1.4.3 pillar_1.11.0 grid_4.5.1
## [127] vctrs_0.6.5 gplots_3.2.0 BiocSingular_1.24.0
## [130] beachmat_2.24.0 cluster_2.1.8.1 beeswarm_0.4.0
## [133] evaluate_1.0.4 mvtnorm_1.3-3 cli_3.6.5
## [136] locfit_1.5-9.12 compiler_4.5.1 rlang_1.1.6
## [139] crayon_1.5.3 future.apply_1.20.0 labeling_0.4.3
## [142] plyr_1.8.9 fs_1.6.6 ggbeeswarm_0.7.2
## [145] stringi_1.8.7 viridisLite_0.4.2 lmerTest_3.1-3
## [148] lazyeval_0.2.2 aod_1.3.3 Matrix_1.7-3
## [151] hms_1.1.3 patchwork_1.3.1 future_1.67.0
## [154] statmod_1.5.0 SummarizedExperiment_1.38.1 rbibutils_2.3
## [157] Rfast_2.1.5.1 broom_1.0.9 RcppParallel_5.1.10.9000
## [160] bslib_0.9.0 ggtree_3.16.3 ape_5.8-1