Skip to contents

Cell- versus gene-major storage

General matrices can be stored either in row-major or column-major format, depending on the need for efficient access to rows versus columns. Similarly, sparse matrices with many zero entries can be stored in CSR or CSC format. In both cases, the best format depends on the need to access rows versus columns most efficiently.

The H5AD format from AnnData is widely used to store single cell gene expression data. Many standard analyses are cell-centric tasks: QC, annotation, PCA, UMAP, K-Nearest Neighbors. Therefore, by default, an H5AD file stores the sparse read count matrix in a format that prioritizing efficient access to cells rather than genes. Avoiding the details of the underlying data storage model, we term this a cell-major format. This is the default format used by most H5AD files.

However, differential expression analysis is a gene-centric task. For each annotated cell type, lucida fits a negative binomial regression model to each gene. Therefore, lucida analysis can be accelerated by storing the read count matrix in a gene-major format (Figure 1) also supported by H5AD.

Figure 1: Intution for cell- versus gene-major storage formats. The data matrix is shown on the left with genes as rows, cells as columns, and entries with corresponding labels. Using cell-major format, accessing gene 1 (in red) requires reading across the entire dataset. Using gene-major order, this requires only a single efficient query of 3 adjacent entries. This illustration shows row- versus column-major order for storing general matrices, and serves as an analogy for sparse matrices.
Figure 1: Intution for cell- versus gene-major storage formats. The data matrix is shown on the left with genes as rows, cells as columns, and entries with corresponding labels. Using cell-major format, accessing gene 1 (in red) requires reading across the entire dataset. Using gene-major order, this requires only a single efficient query of 3 adjacent entries. This illustration shows row- versus column-major order for storing general matrices, and serves as an analogy for sparse matrices.

Empirical testing of H5AD formats

Load times for 12 H5AD formats were benchmarked and show that using a gene-major format can outperform the default cell-major format (Figure 2). Best performance was observed here and in other datasets using gene-major format, compression using the LZF algorithm, and sorting cells by annotated class and biological sample identifier.

Figure 2: Load times for H5AD files using gene-wise access. Comparison was performed for 12 options for stroing H5AD files. Sorted indicates if cells were sorted by cell type and sample identifier; sparse indicates format for storing sparse count matrix: either CSR for cell-major format (default), or CSC for gene-major format. Compression indicates the compression algorithm. Benchmarking was performed on a 1K1K dataset of 36K genes and 1.2 cells on an M3 Mac with 36 Gb RAM. For each format, the time to read in 1000 genes for 161K effector memory CD8-positive, alpha-beta T cells is shown. Left and right panels show sample results, with different x-axis scales.
Figure 2: Load times for H5AD files using gene-wise access. Comparison was performed for 12 options for stroing H5AD files. Sorted indicates if cells were sorted by cell type and sample identifier; sparse indicates format for storing sparse count matrix: either CSR for cell-major format (default), or CSC for gene-major format. Compression indicates the compression algorithm. Benchmarking was performed on a 1K1K dataset of 36K genes and 1.2 cells on an M3 Mac with 36 Gb RAM. For each format, the time to read in 1000 genes for 161K effector memory CD8-positive, alpha-beta T cells is shown. Left and right panels show sample results, with different x-axis scales.

Convert H5AD file

We developed the recode_h5ad.py script that uses the AnnData library in order to:

  • Convert read count matrix to gene-major format
  • Sort cells by annotated class and biological sample identifier
  • Create variable libSize storing the number of reads for each cell. If raw/X exists, computed from this. Otherwise computed from X
  • Write to new H5AD file using LZF compression

The result is a valid H5AD file supported by standard tools, but just optimized for gene-centric access.

Here, we download the script and examine the arguments:

# Download python script
SRC=https://raw.githubusercontent.com/GabrielHoffman/GenomicDataStream_analysis/refs/heads/main/recode_h5ad.py
wget $SRC

# Install anndata if needed
conda install anndata

# Examine arguments
recode_h5ad.py --help
usage: recode_h5ad.py [-h] --input INPUT --output OUTPUT [--ondisk]
                      [--sortBy SORTBY] [--compression {None,None,gzip,lzf}]
                      [--format {CSR,CSC}] [--noLibSize]

Convert an AnnData .h5ad file so that X (and raw/X) is stored in CSC sparse format (v1.1)

options:
  -h, --help            show this help message and exit
  --input INPUT         Input .h5ad file
  --output OUTPUT       Output .h5ad file
  --ondisk              Use file-backed mode to reduce memory usage
  --sortBy SORTBY       Cols to sort by in _decreasing_ order of importance
  --compression {None,gzip,lzf}
                        Optional compression for output file (gzip or lzf).
                        Default: None
  --format {CSR,CSC}    Store sparse count matrix in CSR or CSC format. CSR
                        allows faster access to cells, CSC gives faster
                        access to genes. Default: CSR
  --noLibSize           Skip computing libSize for each cell

Now, we convert an H5AD file to CSC (i.e. gene-major) format, sorting the cells by class, subclass and SampleID, and saving using LZF compression.

# H5AD=(Original H5AD file)
# OUTFILE=(New H5AD file)

recode_h5ad.py \
  --input $H5AD \
  --sortBy class,subclass,SampleID \
  --format CSC \
  --compression lzf \
  --out $OUTFILE
Session info
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin23.6.0
## Running under: macOS Sonoma 14.7.1
## 
## Matrix products: default
## BLAS/LAPACK: /opt/homebrew/Cellar/openblas/0.3.31_1/lib/libopenblasp-r0.3.31.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.39     desc_1.4.3        R6_2.6.1          fastmap_1.2.0    
##  [5] xfun_0.56         cachem_1.1.0      knitr_1.51        htmltools_0.5.9  
##  [9] rmarkdown_2.30    lifecycle_1.0.5   cli_3.6.5         sass_0.4.10      
## [13] pkgdown_2.2.0     textshaping_1.0.5 jquerylib_0.1.4   systemfonts_1.3.2
## [17] compiler_4.5.1    tools_4.5.1       ragg_1.5.1        bslib_0.10.0     
## [21] evaluate_1.0.5    yaml_2.3.12       otel_0.2.0        jsonlite_2.0.0   
## [25] rlang_1.1.7       fs_1.6.7          htmlwidgets_1.6.4