Skip to contents

GenomicDataStream designed to chunks of features rather than chunks of samples. Features are stored as columns in the matrix returned by R/C++, independent of the underlying data storage format.

Usage

GenomicDataStreamRegression implements regression models (linear and GLMs) that stream chucks of features using the GenomicDataStream interface. In general, variants from genetic data are used as covariates in lmFitFeatures(), and genes from single cell data are used as responses in lmFitResponses().

Example code with R

Read genotype data into R
library(GenomicDataStream) 
library(GenomicDataStreamRegression) 

# VCF file
file <- system.file("extdata", "test.vcf.gz", package = "GenomicDataStream")

# initialize 
gds <- GenomicDataStream(file, "DS", chunkSize=5, initialize=TRUE)

n <- 60
y <- rnorm(n)
design <- matrix(1, n, 1)
rownames(design) <- paste0("I", seq(n))

# loop until break
while( 1 ){

  # get data chunk
  # data$X matrix with features as columns
  # data$info information about each feature as rows
  dat <- getNextChunk(gds)

  # check if end of stream 
  if( atEndOfStream(gds) ) break
  
  # do analysis on this chunk of data
  fit <- lmFitFeatures(y, design, dat$X)
}
Use R to run analysis at C++ level
library(GenomicDataStream) 
library(GenomicDataStreamRegression) 

# VCF file
file <- system.file("extdata", "test.vcf.gz", package = "GenomicDataStream")

# create object, but don't read yet 
# Read DS field storing dosage
gds <- GenomicDataStream(file, "DS", chunkSize=5)

n <- 60
y <- rnorm(n)
design <- matrix(1, n, 1)
rownames(design) <- paste0("I", seq(n))

# regression of y ~ design + X[,j]
#   where X[,j] is the jth variant in the GenomicDataStream
# data in GenomicDataStream is only accessed at C++ level 
fit <- lmFitFeatures(y, design, gds)
## preprojection: 1

Session info

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin23.6.0
## Running under: macOS Sonoma 14.7.1
## 
## Matrix products: default
## BLAS/LAPACK: /opt/homebrew/Cellar/openblas/0.3.33/lib/libopenblasp-r0.3.33.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] GenomicDataStreamRegression_0.99.0 GenomicDataStream_0.99.0          
## 
## loaded via a namespace (and not attached):
##   [1] tidyselect_1.2.1            dplyr_1.2.1                 farver_2.1.2               
##   [4] S7_0.2.2                    fastmap_1.2.0               SingleCellExperiment_1.32.0
##   [7] digest_0.6.39               lifecycle_1.0.5             statmod_1.5.1              
##  [10] magrittr_2.0.5              compiler_4.5.1              progress_1.2.3             
##  [13] rlang_1.2.0                 sass_0.4.10                 tools_4.5.1                
##  [16] yaml_2.3.12                 knitr_1.51                  prettyunits_1.2.0          
##  [19] S4Arrays_1.10.1             htmlwidgets_1.6.4           reticulate_1.46.0          
##  [22] DelayedArray_0.36.1         RColorBrewer_1.1-3          abind_1.4-8                
##  [25] HDF5Array_1.38.0            withr_3.0.2                 purrr_1.2.2                
##  [28] BiocGenerics_0.56.0         desc_1.4.3                  grid_4.5.1                 
##  [31] stats4_4.5.1                beachmat_2.26.0             Rhdf5lib_1.32.0            
##  [34] ggplot2_4.0.3               scales_1.4.0                MASS_7.3-65                
##  [37] dichromat_2.0-0.1           SummarizedExperiment_1.40.0 cli_3.6.6                  
##  [40] crayon_1.5.3                rmarkdown_2.31              reformulas_0.4.4           
##  [43] ragg_1.5.2                  generics_0.1.4              otel_0.2.0                 
##  [46] RcppParallel_5.1.11-2       fastglmm_0.4.6              minqa_1.2.8                
##  [49] cachem_1.1.0                rhdf5_2.54.1                stringr_1.6.0              
##  [52] splines_4.5.1               parallel_4.5.1              XVector_0.50.0             
##  [55] matrixStats_1.5.0           vctrs_0.7.3                 boot_1.3-32                
##  [58] Matrix_1.7-5                jsonlite_2.0.0              carData_3.0-6              
##  [61] car_3.1-5                   hms_1.1.4                   IRanges_2.44.0             
##  [64] S4Vectors_0.48.1            pbmcapply_1.5.1             Formula_1.2-5              
##  [67] systemfonts_1.3.2           h5mread_1.2.1               limma_3.66.0               
##  [70] beachmat.hdf5_1.8.0         jquerylib_0.1.4             glue_1.8.1                 
##  [73] nloptr_2.2.1                pkgdown_2.2.0               codetools_0.2-20           
##  [76] stringi_1.8.7               gtable_0.3.6                GenomicRanges_1.62.1       
##  [79] lme4_2.0-1                  tibble_3.3.1                pillar_1.11.1              
##  [82] BatchRegression_0.0.21      htmltools_0.5.9             Seqinfo_1.0.0              
##  [85] rhdf5filters_1.22.0         R6_2.6.1                    Rdpack_2.6.6               
##  [88] textshaping_1.0.5           evaluate_1.0.5              lattice_0.22-9             
##  [91] Biobase_2.70.0              rbibutils_2.4.1             png_0.1-9                  
##  [94] bslib_0.10.0                Rcpp_1.1.1-1.1              nlme_3.1-169               
##  [97] SparseArray_1.10.10         anndataR_1.1.3              xfun_0.57                  
## [100] fs_2.1.0                    MatrixGenerics_1.22.0       pkgconfig_2.0.3

<>