Skip to contents

Read genomic data files (VCF, BCF, BGEN, h5ad) into R/Rcpp in chunks for analysis with Armadillo or Eigen libraries

Interface to GenomicDataStream C++ code

Usage

GenomicDataStream(
  file,
  field = "",
  region = "",
  samples = "-",
  MAF = 0,
  minVariance = 0,
  chunkSize = 10000,
  missingToMean = TRUE,
  initialize = FALSE
)

Arguments

file

file in VCF/BCF/BGEN/PGEN format with index

field

field of VCF/BCF to read. Ignored for other file types

region

target in the format chr2:1-12345. Multiple regions can be separated by one of ",\n\t", for example "chr2:1-12345, chr3:1000-8000". Setting region to "" includes all variants

samples

string of comma separated sample IDs to extract: "ID1,ID2,ID3". "-" indicates all samples

MAF

minor allele frequency filter applied to variants with max value <= 2

minVariance

minimum variance filter applied to variants with max value < 2

chunkSize

number of variants to return per chunk

missingToMean

if true, set missing values to the mean dosage value. if false, set to NaN

initialize

default FALSE. If TRUE, file info is read from path, otherwise store path until GenomicDataStream is initialized later

Value

object of class GenomicDataStream

Details

Consider minor allele frequency (MAF) \(f\) and Hardy-Weinberg equilibrium, the allelic states have probability \(f^2, 2f(1-f), (1-f)^2\). If the variant has mean \(\mu\) and variance \(\sigma^2\), MAF can be estimated from the mean as \(min(\mu/2, 1 - \mu/2)\) or from the variance as \(p = 1+sqrt(1-2\sigma^2))/2\) and MAF = \(min(p, 1-p)\). In addition the sample variance of the variant is \(2(1-f)f\). Therefore, setting a MAF cutoff corresponds to a variance cutoff that can also apply to multi-allelic variants.

Examples

file <- system.file("extdata", "test.vcf.gz", package = "GenomicDataStream")

# initialize
obj <- GenomicDataStream(file, "DS", chunkSize = 5, initialize = TRUE)

obj
#> 		 GenomicDataStream 
#> 
#>   file:          test.vcf.gz 
#>   initialized:   TRUE 
#>   stream type:   vcf.gz 
#>   field:         DS 
#>   region:         
#>   samples:       60 
#>   MAF:           0 
#>   minVar cutoff: 0 
#>   missingToMean: TRUE 
#>   chunkSize:     5 
#>   features read: 0 
#>   end of stream: FALSE 

# loop until break
while (1) {
  # get data chunk
  # data$X matrix with features as columns
  # data$info information about each feature as rows
  dat <- getNextChunk(obj)

  if (atEndOfStream(obj)) break

  print(dat$info)
}
#>   CHROM   POS          ID A1 A2
#> 1     1 10000 1:10000:C:A  C  A
#> 2     1 11000 1:11000:T:C  T  C
#> 3     1 12000 1:12000:T:C  T  C
#> 4     1 13000 1:13000:T:C  T  C
#> 5     1 14000 1:14000:G:C  G  C
#>   CHROM   POS          ID A1 A2
#> 1     1 15000 1:15000:A:C  A  C
#> 2     1 16000 1:16000:G:A  G  A
#> 3     1 17000 1:17000:C:A  C  A
#> 4     1 18000 1:18000:C:G  C  G
#> 5     1 19000 1:19000:T:G  T  G