Skip to contents


A scalable interface between genomic data and analysis underneath R

Reading genomic data files (VCF, BCF, BGEN, PGEN, BED, H5AD, HDF5, DelayedArray) into R/Rcpp in chunks for analysis with Armadillo / Eigen / Rcpp libraries. Mondern datasets are often too big to fit into memory, and many analyses operate on a small chunk features at a time. Yet in practice, many implementations require the whole dataset stored in memory. Others pair an analysis with a specific data format in way that the two components can’t be separated for use in other applications. For example, regression analysis paired with genotype data from a VCF file.

The GenomicDataStream interface separates:

  1. data source
  2. streaming chunks of features into a data matrix
  3. downstream analysis

GenomicDataStream provides interfaces at both the C++ and R levels. The C++ interface prioritizes efficiency, while the R interface wraps the C++ backend for non-technical users.

See header-only C++ library documentation

Install

# Install latest version of GenomicDataStream and dependencies
BiocManager::install("GabrielHoffman/GenomicDataStream")

Supported formats

Genetic data

Format Version Support
BGEN 1.1 biallelic variants
BGEN 1.2, 1.3 phased or unphased biallelic variants
PGEN plink2 biallelic variants
BED plink1 biallelic variants
VCF / BCF 4.x biallelic variants with GT/GP fields, continuous dosage with DS field

Single cell data

Count matrices for single cell data are stored in the H5AD format. This format, based on HDF5, can store millions of cells since it is designed for sparse counts (i.e. many entries are 0) and uses built-in compression. H5AD enables file-backed random access for analyzing a subset of the data without reading the entire file in to memory.

Key Dependencies

Package Ref Role
vcfppR Bioinformatics C++ API for htslib
htslib GigaScience C API for VCF/BCF files
pgenlibr GigaScience R/C++ API for plink files
beatchmat PLoS Comp Biol C++ API for access data owned by R
DelayedArray R interface for handling on-disk data formats
Rcpp J Stat Software API for R/C++ integration
RcppEigen J Stat Software API for Rcpp access to Eigen matrix library
RcppArmadillo J Stat Software API for Rcpp access to Armadillo matrix library
Eigen C++ library for linear algebra with advanced features
Armadillo J Open Src Soft User-friendly C++ library for linear algebra
RcppParallel oneAPI Threading Building Blocks for parallel analysis