Add SNP data to a GRN object and associate SNPs to peaks.

This function accepts a vector of SNP IDs (rsID), retrieves their genomic positions and overlaps them with the peaks to extend the peak metadata (`GRN@data$peaks$counts_metadata`) by storing the number, positions and rsids of all overlapping SNPs per peak (new columns starting with `SNP_`). Optionally, SNPs in LD with the user-provided SNPs can be identified using the LDlinkR package. Note that only SNPs in LD are associated with a peak for those SNPs directly overlapping a peak. That is, if a user-provided SNP does not overlap with any peak, neither the SNP itself nor any of the SNPs in LD will be associated with any peak, even if an LD SNP overlaps another peak. The results of are stored in GRN@annotation$SNPs (full, unfiltered table) and GRN@annotation$SNPs_filtered (filtered table), and rapid re-filtering is possible without re-querying the database (time-consuming)

addSNPData(
  GRN,
  SNP_IDs,
  EnsemblVersion = NULL,
  add_SNPs_LD = FALSE,
  requeryLD = FALSE,
  population = "CEU",
  r2d = "r2",
  token = NULL,
  filter = "R2 > 0.8",
  forceRerun = FALSE
)

Arguments

GRN: Object of class GRN
SNP_IDs: Character vector. No default. Vector of SNP IDs (rsID) that should be integrated and overlapped with the peaks.
EnsemblVersion: NULL or Character(1). Default NULL. Only relevant if source is not set to custom, ignored otherwise. The Ensembl version to use for the retrieval of gene IDs from their provided database names (e.g., JASPAR) via biomaRt. By default (NULL), the newest version is selected for the most recent genome assembly versions is used (see biomaRt::listEnsemblArchives() for supported versions). This parameter can override this to use a custom (older) version instead.
add_SNPs_LD: TRUE or FALSE. Default FALSE. Should SNPs in LD with any of the user-provided SNPs that overlap a peak be identified and added to the peak? If set to TRUE, LDlinkR::LDproxy_batch will be used to identify SNPs in LD based on the user-provided SNP_IDs argument, a specific (set of) populations (argument population) and a value for r2d. The full table (stored in GRN@annotation$SNPs) is then subsequently filtered, see also the filter argument.
requeryLD: TRUE or FALSE. Default FALSE. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. Should LDlinkR::LDproxy_batch be re-executed if already present? As this is very time-consuming, querying the database is only performed if this parameter is set to TRUE or if GRN@annotation$SNPs is missing.
population: Character vector. Default CEU. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. Population code(s) from the 1000 Genomes project to be used for LDlinkR::LDproxy_batch. Multiple codes are allowed. For all valid codes, see https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_populations.md
r2d: r2 or d. Default r2. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. See the help of the function LDlinkR::LDproxy_batch for more details.
token: Character or NULL. Default NULL. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. LDlink provided user token. Register for token at https://ldlink.nih.gov/?tab=apiaccess. Has to be done only once and is very simple and straight forward but unfortunately necessary. An example, non-functional token is 2c49a3b54g04.
filter: Character. Default R2 > 0.8. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. Filter criteria for the output table as generated by LDlinkR::LDproxy_batch. dplyr::filter style is used to specify filters, and multiple filtering criteria can be used (e.g., R2 > 0.8 & MAF > 0.01). The filtered table is stored in GRN@annotation$SNPs_filtered. Note that re-filtering is quick without the need to re-query the database unless requeryLD = TRUE.
forceRerun: TRUE or FALSE. Default FALSE. Force execution, even if the GRN object already contains the result. Overwrites the old results.

Value

An updated GRN object, with additional information added from this function.

Details

`biomaRt` is used to retrieve genomic positions for the user-defined SNPs, which can take a long time depending on the number of SNPs provided. Similarly, querying the LDlink servers may take a long time.

Examples

# See the Workflow vignette on the GRaNIE website for examples
GRN = loadExampleObject()
#> Downloading GRaNIE example object from https://git.embl.de/grp-zaugg/GRaNIE/-/raw/master/data/GRN.rds
#> Finished successfully. You may explore the example object. Start by typing the object name to the console to see a summaty. Happy GRaNIE'ing!
GRN = addSNPData(GRN, SNP_IDs = c("rs7570219", "rs6445264", "rs12067275"), forceRerun = FALSE)
#> INFO [2024-04-04 17:34:10]  Not retrieving SNPS in LD, use only user-provided SNPs
#> INFO [2024-04-04 17:34:10]  Quering biomaRt for 3 rsid annotation, this may take a while, particularly if the number of provided IDs is large
#> INFO [2024-04-04 17:34:21]  Retrieving BioMart database succeeded
#> INFO [2024-04-04 17:34:22]  Retrieving genome annotation succeeded
#> INFO [2024-04-04 17:34:22]  Retrieved annotation for a total of 3 SNPs out of 3 that were provided
#> INFO [2024-04-04 17:34:24]  Overlapping peaks and SNPs
#> INFO [2024-04-04 17:34:24]  Added SNP information to GRN@data$peaks$counts_metadata
#> INFO [2024-04-04 17:34:24]  Statistics: 
#> INFO [2024-04-04 17:34:24]   Number of peaks with 0 SNPs associated: 74997
#> INFO [2024-04-04 17:34:24]   Number of peaks with 1 SNPs associated: 3
#> INFO [2024-04-04 17:34:24] Finished successfully. Execution time: 14.6 secs