This is one of the main integrative functions of the GRaNIE package. It has two main functions: First, filtering both TF-peak and peak-gene connections according to different criteria such as FDR and other properties Second, joining the three major elements that an eGRN consist of (TFs, peaks, genes) into one data frame, with one row per unique TF-peak-gene connection. After successful execution, the connections (along with additional feature metadata) can be retrieved with the function getGRNConnections. Note that a previously stored eGRN graph is reset upon successful execution of this function along with printing a descriptive warning, and re-running the function build_eGRN_graph is necessary when any of the network functions of the package shall be executed. If the filtered connections changed, all network related enrichment functions also have to be rerun. Internally, before joining them, both TF-peak links and peak-gene connections are filtered separately for reasons of memory and computational efficacy: First filtering out unwanted links dramatically reduces the memory needed for the full eGRN. Peak-gene p-value adjustment is only done after all filtering steps on the remaining set of connections to lower the statistical burden of multiple-testing adjustment; therefore, this may lead to initially counter-intuitive effects such as a particular connections not being included anymore as compared to a filtering based on different thresholds, or the FDR being different for the same reason.

filterGRNAndConnectGenes(
  GRN,
  TF_peak.fdr.threshold = 0.2,
  TF_peak.connectionTypes = "all",
  peak.SNP_filter = list(min_nSNPs = 0, filterType = "orthogonal"),
  peak_gene.p_raw.threshold = NULL,
  peak_gene.fdr.threshold = 0.2,
  peak_gene.fdr.method = "BH",
  peak_gene.IHW.covariate = NULL,
  peak_gene.IHW.nbins = "auto",
  outputFolder = NULL,
  gene.types = c("protein_coding"),
  allowMissingTFs = FALSE,
  allowMissingGenes = TRUE,
  peak_gene.r_range = c(0, 1),
  peak_gene.selection = "all",
  peak_gene.maxDistance = NULL,
  filterTFs = NULL,
  filterGenes = NULL,
  filterPeaks = NULL,
  TF_peak_FDR_selectViaCorBins = FALSE,
  filterLoops = TRUE,
  resetGraphAndStoreInternally = TRUE,
  silent = FALSE,
  forceRerun = FALSE
)

Arguments

GRN

Object of class GRN

TF_peak.fdr.threshold

Numeric[0,1]. Default 0.2. Maximum FDR for the TF-peak links. Set to 1 or NULL to disable this filter.

TF_peak.connectionTypes

Character vector. Default all. TF-peak connection types to consider. The special keyword all denotes all connection types (e.g., expression and TFActivity) that are found in the GRN object. By default, only expression is present in the object, so all and expression are usually equivalent unless calculation of TF-peak links based on TF activity has also been enabled.

peak.SNP_filter

Named list. Default list(min_nSNPs = 0, filterType = "orthogonal"). Filters related to SNP data if they have been added with the function addSNPData, ignored otherwise. The named list must contain at least two elements: First, min_nSNPs, an integer >= 0 that denotes how many SNPs a peak has to overlap with at least to pass the filter or be considered for inclusion. Second, filterType, a character that must either be orthogonal or extra and denotes whether the SNP filter is orthogonal to the other filters (i.e, an alternative way of when a peak is considered for being kept) or whether the SNP filter is in addition to all other filters. For more help, see the Vignettes.

peak_gene.p_raw.threshold

Numeric[0,1]. Default NULL. Threshold for the peak-gene connections, based on the raw p-value. All peak-gene connections with a larger raw p-value will be filtered out.

peak_gene.fdr.threshold

Numeric[0,1]. Default 0.2. Threshold for the peak-gene connections, based on the FDR. All peak-gene connections with a larger FDR will be filtered out.

peak_gene.fdr.method

Character. Default "BH". One of: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none", "IHW". Method for adjusting p-values for multiple testing. If set to "IHW", the package IHW is required (as it is listed under Suggests, it may not be installed), and independent hypothesis weighting will be performed, and a suitable covariate has to be specified for the parameter peak_gene.IHW.covariate.

peak_gene.IHW.covariate

Character. Default NULL. Name of the covariate to use for IHW (column name from the table that is returned with the function getGRNConnections. Only relevant if peak_gene.fdr.method is set to "IHW". You have to make sure the specified covariate is suitable for IHW, see the diagnostic plots that are generated in this function for this. For many datasets, the peak-gene distance (called peak_gene.distance in the object) seems suitable.

peak_gene.IHW.nbins

Integer or "auto". Default "auto". Number of bins for IHW. Only relevant if peak_gene.fdr.method is set to "IHW".

outputFolder

Character or NULL. Default NULL. If set to NULL, the default output folder as specified when initiating the object in initializeGRN will be used. Otherwise, all output from this function will be put into the specified folder. If a folder is provided, while we recommend specifying an absolute path, a relative one also works.

gene.types

Character vector of supported gene types. Default c("protein_coding", "lincRNA"). Filter for gene types to retain, genes with gene types not listed here are filtered. The special keyword "all" indicates no filter and retains all gene types. The specified names must match the names as stored in the GRN object (see GRN@annotation$genes$gene.type) and correspond 1:1 to the gene type names as provided by biomaRt, with the exception of lncRNAs, which is internally renamed to lincRNAs when first fetching all gene types. This is done due to a recent change in biomaRt and aims at keeping backwards compatibility with GRN objects.

allowMissingTFs

TRUE or FALSE. Default FALSE. Should connections be returned for which the TF is NA (i.e., connections consisting only of peak-gene links?). If set to TRUE, this generally greatly increases the number of connections but it may not be what you aim for.

allowMissingGenes

TRUE or FALSE. Default TRUE. Should connections be returned for which the gene is NA (i.e., connections consisting only of TF-peak links?). If set to TRUE, this generally increases the number of connections.

peak_gene.r_range

Numeric(2). Default c(0,1). Filter for lower and upper limit for the peak-gene links. Only links will be retained if the correlation coefficient is within the specified interval. This filter is usually used to filter out negatively correlated peak-gene links.

peak_gene.selection

"all" or "closest". Default "all". Filter for the selection of genes for each peak. If set to "all", all previously identified peak-gene are used, while "closest" only retains the closest gene for each peak that is retained until the point the filter is applied.

peak_gene.maxDistance

Integer >0. Default NULL. Maximum peak-gene distance to retain a peak-gene connection.

filterTFs

Character vector. Default NULL. Vector of TFs (as named in the GRN object) to retain. All TFs not listed will be filtered out.

filterGenes

Character vector. Default NULL. Vector of gene IDs (as named in the GRN object) to retain. All genes not listed will be filtered out.

filterPeaks

Character vector. Default NULL. Vector of peak IDs (as named in the GRN object) to retain. All peaks not listed will be filtered out.

TF_peak_FDR_selectViaCorBins

TRUE or FALSE. Default FALSE. Use a modified procedure for selecting TF-peak links. Instead of selecting solely based on the user-specified FDR, this procedure first identifies the correlation bin closest to 0 that contains at least one significant TF-peak link according to the chosen TF_peak.fdr.threshold. This is done spearately for both FDR directions. It then retains all TF-peak links that have a correlation bin at least as extreme as the identified pair. For example, if the correlation bin [0.35,0.40] contains a significant TF-peak link while [0,0.05], [0.05,0.10], ..., [0.30,0.35] do not, all TF-peak links with a correlation of at least 0.35 or above are selected (i.e, bins [0.35,0.40], [0.40,0.45], ..., [0.95,1.00]). Thus, for the final selection, also links with a higher FDR but a more extreme correlation may be selected.

filterLoops

TRUE or FALSE. Default TRUE. If a TF regulates itself (i.e., the TF and the gene are the same entity), should such loops be filtered from the GRN?

resetGraphAndStoreInternally

TRUE or FALSE. Default TRUE. If set to TRUE, the stored eGRN graph (slot graph) is reset due to the potentially changed connections that would otherwise cause conflicts in the information stored in the object. Also, a GRN object is returned. If set to FALSE, only the new filtered connections are returned and the object is not altered.

silent

TRUE or FALSE. Default FALSE. Print progress messages and filter statistics.

forceRerun

TRUE or FALSE. Default FALSE. Force execution, even if the GRN object already contains the result. Overwrites the old results.

Value

An updated GRN object, with additional information added from this function. The filtered and merged TF-peak and peak-gene connections in the slot GRN@connections$all.filtered and can be retrieved (along with other feature metadata) using the function getGRNConnections.

Examples

# See the Workflow vignette on the GRaNIE website for examples
GRN = loadExampleObject()
#> Downloading GRaNIE example object from https://git.embl.de/grp-zaugg/GRaNIE/-/raw/master/data/GRN.rds
#> Finished successfully. You may explore the example object. Start by typing the object name to the console to see a summaty. Happy GRaNIE'ing!
GRN = filterGRNAndConnectGenes(GRN)
#> INFO [2024-04-04 17:36:31] Filter GRN network
#> INFO [2024-04-04 17:36:31] Data already exists in object or the specified file already exists. Set forceRerun = TRUE to regenerate and overwrite.
#> INFO [2024-04-04 17:36:31]  Finished successfully. Execution time: 0 secs