Impute many z-statistics — impute

Impute many z-statistics given observed z-statistics and reference panel

Usage

impute_region(
  df,
  gds,
  region,
  flankWidth,
  method = c("decorrelate", "Ledoit-Wolf", "OAS", "Touloumis", "Schafer-Strimmer"),
  lambda = NULL,
  ...
)

Arguments

df: data.frame with columns ID, z, GWAS_A1, GWAS_A2, CHROM, POS REF_A1, REF_A2.
gds: GenomicDataStream of reference panel
region: genomic region to impute
flankWidth: additional window added to region
method: method used to estimate shrinkage parameter lambda. default is "decorrelate"
lambda: (default: NULL) value used to shrink correlation matrix. Only used if method is "decorrelate"
...: additional arguments passed to imputez() and imputezDecorr()

Value

tibble storing imputed results:

ID: variant identifier
z.stat: imputed z-statistic
sigSq: variance of imputed z-statistic
r2.pred: metric of accuracy of the imputed z-statistic based on its variance
lambda: shrinkage parameter
maf: minor allele frequency in reference panel
nVariants: number of variants used in imputation

Examples

library(GenomicDataStream)
library(mvtnorm)
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union

# VCF file for reference
file <- system.file("extdata", "test.vcf.gz", package = "GenomicDataStream")

# initialize data stream
gds <- GenomicDataStream(file, "DS", initialize=TRUE)

# read genotype data from reference
dat <- getNextChunk(gds)

# simulate z-statistics with correlation structure
# from the LD of the reference panel
set.seed(1)
z <- c(rmvnorm(1, rep(0, 10), cor(dat$X)))

# Combine z-statistics with variant ID, position, etc
df <- dat$info %>%
    mutate(z = z, GWAS_A1 = A1, GWAS_A2 = A2) %>%
    rename(REF_A1 = A1, REF_A2 = A2)

# Given observed z-statistics and 
# GenomicDataStream of reference panel,
# Impute z-statistics from variants missing z-statistics.
# Here drop variant 2, and then impute its z-statistic
# Impute variants in the given region
region <- "1:1000-100000"
impute_region(df[-2,], gds, region, 1000)
#> # A tibble: 1 × 9
#>   ID          A1    A2         z.stat        se  r2.pred lambda    maf nVariants
#>   <chr>       <chr> <chr>       <dbl>     <dbl>    <dbl>  <dbl>  <dbl>     <int>
#> 1 1:11000:T:C T     C     -0.00000550 0.0000291 8.45e-10  1.000 0.0159         9