Extending GenomicDataStream
File and data types
Developed by Gabriel Hoffman
Run on 2025-07-01 10:08:55
Source:vignettes/extensions.Rmd
extensions.Rmd
Additional file types
Consider extending the package to support to support a new file of
type X
. First, create a new class Xstream
that
inherits from GenomicDataStream
defined here.
Xstream
must define a constructor taking Param
defined here, and
implement interfaces for getNextChunk()
for each data
type.
A DataChunk
stores information about each feature and sample in a class inheriting
from DataInfo
.
Streams for genotype data store feature information in the VariantInfo
class, while DelayedStream
uses MatrixInfo
.
These classes or a custom class can be used to store information for a
new file type.
While nothing else about the underlying data form at is assumed,
GenomicDataStream
designed to chunks of features
rather than chunks of samples. Features are stored as columns
in the matrix returned by R/C++, independent of the underlying data
storage format.
Additional matrix types
First, take a look a currently supported data types, and note that
each of these types wraps an array of double
storing data
in column-major order. In fact, the constructors to the dense matrix
types for Eigen and Armadillo return objects that point to the original
double
array, without allocating new memory. So any new
data type should have a constructor that takes a double
array.
The simplest new type to implement would be a float
version of an existing type. This could reduce memory usage in
downstream analyses and increase speed at the cost of numerical
precision.
Importing into other projects
GenomicDataStream
is written so that core functions are
in C++17 with no dependence or R or Rcpp. On top of that, there is a
thin wrapper that uses Rcpp to interface between R and the lower-level
library.
The C++ code is divided into two sections:
inst/include
: header-only C++17 code with no R dependenciessrc/export.cpp
: Rcpp layer between header-only library to R interface
GenomicDataStream
provides flexability in terms of data
input types and and matrix libraries. This can useful in many cases, but
the large number of dependencies can require installation of additional
libraries and increase compile times. Some of these dependencies can be
avoided by removing support for some capabilities with compiler flags in
Makevars
:
-D DISABLE_DELAYED_STREAM
Omit DelayedStream
class, remove dependence on
Rcpp
and beachmat
-D DISABLE_EIGEN
Omit support for Eigen matrix library, and remove dependence
on RcppEigen
and Eigen
-D DISABLE_RCPP
Omit support for Rcpp
matrix library, and remove
dependence on Rcpp
-D DISABLE_PLINK
Omit support for PLINK
files (PGEN, BED), and
remove dependence on pgenlibr