Impute many z-statistics given observed z-statistics and reference panel
Usage
impute_region(
df,
gds,
region,
flankWidth,
method = c("decorrelate", "Ledoit-Wolf", "OAS", "Touloumis", "Schafer-Strimmer"),
lambda = NULL,
...
)
Arguments
- df
data.frame
with columnsID
,z
,GWAS_A1
,GWAS_A2
,CHROM
,POS
REF_A1
,REF_A2
.- gds
GenomicDataStream
of reference panel- region
genomic region to impute
- flankWidth
additional window added to
region
- method
method used to estimate shrinkage parameter lambda. default is
"decorrelate"
- lambda
(default: NULL) value used to shrink correlation matrix. Only used if method is
"decorrelate"
- ...
additional arguments passed to
imputez()
andimputezDecorr()
Value
tibble
storing imputed results:
- ID
variant identifier
- z.stat
imputed z-statistic
- sigSq
variance of imputed z-statistic
- r2.pred
metric of accuracy of the imputed z-statistic based on its variance
- lambda
shrinkage parameter
- maf
minor allele frequency in reference panel
- nVariants
number of variants used in imputation
Examples
library(GenomicDataStream)
library(mvtnorm)
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
# VCF file for reference
file <- system.file("extdata", "test.vcf.gz", package = "GenomicDataStream")
# initialize data stream
gds <- GenomicDataStream(file, "DS", initialize=TRUE)
# read genotype data from reference
dat <- getNextChunk(gds)
# simulate z-statistics with correlation structure
# from the LD of the reference panel
set.seed(1)
z <- c(rmvnorm(1, rep(0, 10), cor(dat$X)))
# Combine z-statistics with variant ID, position, etc
df <- dat$info %>%
mutate(z = z, GWAS_A1 = A1, GWAS_A2 = A2) %>%
rename(REF_A1 = A1, REF_A2 = A2)
# Given observed z-statistics and
# GenomicDataStream of reference panel,
# Impute z-statistics from variants missing z-statistics.
# Here drop variant 2, and then impute its z-statistic
# Impute variants in the given region
region <- "1:1000-100000"
impute_region(df[-2,], gds, region, 1000)
#> # A tibble: 1 × 9
#> ID A1 A2 z.stat se r2.pred lambda maf nVariants
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1:11000:T:C T C -0.00000550 0.0000291 8.45e-10 1.000 0.0159 9