Data as a GenomicDataStream
R and C++ code
Developed by Gabriel Hoffman
Run on 2025-07-01 10:08:56
Source:vignettes/GenomicDataStream.Rmd
GenomicDataStream.Rmd
GenomicDataStream
is designed to read chunks of
features rather than chunks of samples. Features are
stored as columns in the matrix returned by R/C++, independent of
the underlying data storage format.
Usage
GenomicDataStreamRegression
implements regression models
(linear and GLMs) that stream chucks of features using the
GenomicDataStream
interface. In general, variants from
genetic data are used as covariates in lmFitFeatures()
, and
genes from single cell data are used as responses in
lmFitResponses()
.
Example code with R
Read genotype data into R
library(GenomicDataStream)
library(GenomicDataStreamRegression)
# VCF file
file <- system.file("extdata", "test.vcf.gz", package = "GenomicDataStream")
# initialize
gds = GenomicDataStream(file, "DS", chunkSize=5, initialize=TRUE)
n = 60
y = rnorm(n)
names(y) = paste0("I", seq(n))
design = model.matrix(~1, data.frame(y))
# loop until break
while( 1 ){
# get data chunk
# data$X matrix with features as columns
# data$info information about each feature as rows
dat = getNextChunk(gds)
# check if end of stream
if( atEndOfStream(gds) ) break
# do analysis on this chunk of data
# from GenomicDataStreamRegression
fit = lmFitFeatures(y, design, dat$X)
}
Use R to run analysis at C++ level
library(GenomicDataStream)
library(GenomicDataStreamRegression)
# VCF file
file <- system.file("extdata", "test.vcf.gz", package = "GenomicDataStream")
# create object, but don't read yet
# Read DS field storing dosage
gds = GenomicDataStream(file, "DS", chunkSize=5)
n = 60
y = rnorm(n)
names(y) = paste0("I", seq(n))
design = model.matrix(~1, data.frame(y))
# regression of y ~ design + X[,j]
# where X[,j] is the jth variant in the GenomicDataStream
# data in GenomicDataStream is only accessed at C++ level
# from GenomicDataStreamRegression
fit = lmFitFeatures(y, design, gds)
Example code with C++17
#include <RcppArmadillo.h>
#include <GenomicDataStream.h>
// use namespace for GenomicDataStream
using namespace gds;
// parameters
string file = "test.vcf.gz";
string field = "DS"; // read dosage field
string region = ""; // no region filter
string samples = "-"; // no samples filter
double MAF = 0.05; // minor allele freq filter
double minVariance = 0; // retain features with var > minVariance
int chunkSize = 4; // each chunk will read 4 variants
// initialize parameters
Param param( file, region, samples, MAF, minVariance, chunkSize);
param.setField(field);
// Initialise GenomicDataStream to read
// VCF/BCF/BGEN/PGEN with same interface
shared_ptr<GenomicDataStream> gdsStream = createFileView( param );
// declare DataChunk storing an Armadillo matrix for each chunk
DataChunk<arma::mat> chunk;
// Store meta-data about each variant
VariantInfo *info;
// loop through chunks
while( gdsStream->getNextChunk( chunk ) ){
// get data from chunk
arma::mat X = chunk.getData();
// get variant information
info = chunk.getInfo<VariantInfo>();
// Do analysis with variants in this chunk
analysis_function(X, info);
}
Accessing data
Data types
At the C++ level data from the GenomicDataStream
can be
accessed as multiple types a according to the templated definition of
DataChunk
. Above, each chunk
is a
DataChunk<arma::mat>
so the data is returned as an
Armadillo matrix of doubles (i.e. arma::mat
).
Suppported data types for dense matricies are:
Library | Namespace | Type |
---|---|---|
Rcpp | Rcpp |
NumericMatrix |
Armadillo | arma |
mat |
Eigen | Eigen |
MatrixXd |
STL | std |
vector<double> |
Importantly, each of these types wraps an array of
double
storing data in column-major order. The types just
provide different interfaces to the underlying array for use in
downstream analysis. In each case, the raw data is read as an
double
array, a constructor is called for the requested
data type using the array as the underlying data and an object is
returned to the user. In fact, the constructors to the dense matrix
types for Eigen and Armadillo return objects that point to the original
double
array, without allocating new memory.
GenomicDataStream
also supports the following sparse
matrix types:
Library | Namespace | Type |
---|---|---|
Armadillo | arma |
sp_mat |
Eigen | Eigen |
SparseMatrix |
In each case, a constructor is called for the requested data type the
converter double
array to a sparse matrix.
Session info
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin23.6.0
## Running under: macOS Sonoma 14.7.1
##
## Matrix products: default
## BLAS: /Users/gabrielhoffman/prog/R-4.4.2/lib/libRblas.dylib
## LAPACK: /opt/homebrew/Cellar/r/4.4.3_1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.4 knitr_1.50 rlang_1.1.6 xfun_0.52
## [6] purrr_1.0.4 textshaping_1.0.0 jsonlite_2.0.0 htmltools_0.5.8.1 ragg_1.3.3
## [11] sass_0.4.9 rmarkdown_2.29 evaluate_1.0.3 jquerylib_0.1.4 fastmap_1.2.0
## [16] yaml_2.3.10 lifecycle_1.0.4 memoise_2.0.1 compiler_4.4.2 fs_1.6.6
## [21] htmlwidgets_1.6.4 systemfonts_1.2.1 digest_0.6.37 R6_2.6.1 magrittr_2.0.3
## [26] bslib_0.9.0 tools_4.4.2 pkgdown_2.0.9 cachem_1.1.0 desc_1.4.3
<>