GenomicDataStream
A scalable interface between data and analysis
Loading...
Searching...
No Matches
A scalable interface between data and analysis underneath R
Reading genomic data files (VCF, BCF, BGEN, PGEN, BED, H5AD, HDF5, DelayedArray) into R/Rcpp in chunks for analysis with Armadillo / Eigen / Rcpp libraries. Mondern datasets are often too big to fit into memory, and many analyses operate on a small chunk features at a time. Yet in practice, many implementations require the whole dataset stored in memory. Others pair an analysis with a specific data format in way that the two components can’t be separated for use in other applications. For example, regresson analysis paired with genotype data from a VCF file.

The GenomicDataStream C++ interface separates

  1. data source
  2. streaming chunks of features into a data matrix
  3. downstream analysis

GenomicDataStream provides interfaces at both the C++ and R levels. The C++ interface prioritizes efficiency, while the R interface wraps the C++ backend for non-technical users.

Example code with C++17

#include <RcppArmadillo.h>
// use namespace for GenomicDataStream
using namespace gds;
// parameters
string file = "test.vcf.gz";
string field = "DS"; // read dosage field
string region = ""; // no region filter
string samples = "-"; // no samples filter
double MAF = 0.05; // minor allele freq filter
double minVariance = 0; // retain features with var > minVariance
int chunkSize = 4; // each chunk will read 4 variants
// initialize parameters
Param param( file, region, samples, MAF, minVariance, chunkSize);
param.setField(field);
// Initialise GenomicDataStream to read
// VCF/BCF/BGEN/PGEN with same interface
shared_ptr<GenomicDataStream> gdsStream = createFileView( param );
// declare DataChunk storing an Armadillo matrix for each chunk
// Store meta-data about each variant
// loop through chunks
while( gdsStream->getNextChunk( chunk ) ){
// get data from chunk
arma::mat X = chunk.getData();
// get variant information
info = chunk.getInfo<VariantInfo>();
// Do analysis with variants in this chunk
analysis_function(X, info);
}
Definition GenomicDataStream_virtual.h:34
matType getData() const
Definition GenomicDataStream_virtual.h:46
infoType * getInfo() const
Definition GenomicDataStream_virtual.h:51
Definition VariantInfo.h:24
Definition bgenstream.h:33
Definition GenomicDataStream_virtual.h:65

Key Dependencies

Package Ref Role
vcfppR Bioinformatics C++ API for htslib
htslib GigaScience C API for VCF/BCF files
pgenlibr GigaScience R/C++ API for plink files
beatchmat PLoS Comp Biol C++ API for access data owned by R
Rcpp J Stat Software API for R/C++ integration
RcppEigen J Stat Software API for Rcpp access to Eigen matrix library
RcppArmadillo J Stat Software API for Rcpp access to Armadillo matrix library
Eigen C++ library for linear algebra with advanced features
Armadillo J Open Src Soft User-friendly C++ library for linear algebra
RcppParallel oneAPI Threading Building Blocks for parallel analysis

Notes

GenomicDataStream provide flexability in terms of data input types and and matrix libraries. This can useful in many cases, but the large number of dependencies can require installation of additional libraries and increase compile times. Some of these dependencies can be avoided by removing support for some capabilities with compiler flags in Makevars:

-D DISABLE_DELAYED_STREAM
           Omit DelayedStream class, remove dependence on Rcpp and beachmat

-D DISABLE_EIGEN
           Omit support for Eigen matrix library, and remove dependence on RcppEigen and Eigen

-D DISABLE_RCPP
           Omit support for Rcpp matrix library, and remove dependence on Rcpp

-D DISABLE_PLINK
           Omit support for PLINK files (PGEN, BED), and remove dependence on pgenlibr



  Developed by Gabriel Hoffman at Center for Disease Neurogenomics at the Icahn School of Medicine at Mount Sinai.