This is one of the main integrative functions of the GRaNIE package. It has two main functions: First, filtering both TF-peak and peak-gene connections according to different criteria such as FDR and other properties Second, joining the three major elements that an eGRN consist of (TFs, peaks, genes) into one data frame, with one row per unique TF-peak-gene connection. After successful execution, the connections (along with additional feature metadata) can be retrieved with the function getGRNConnections. Note that a previously stored eGRN graph is reset upon successful execution of this function along with printing a descriptive warning, and re-running the function build_eGRN_graph is necessary when any of the network functions of the package shall be executed. If the filtered connections changed, all network related enrichment functions also have to be rerun. Internally, before joining them, both TF-peak links and peak-gene connections are filtered separately for reasons of memory and computational efficacy: First filtering out unwanted links dramatically reduces the memory needed for the full eGRN. Peak-gene p-value adjustment is only done after all filtering steps on the remaining set of connections to lower the statistical burden of multiple-testing adjustment; therefore, this may lead to initially counter-intuitive effects such as a particular connections not being included anymore as compared to a filtering based on different thresholds, or the FDR being different for the same reason.

filterGRNAndConnectGenes(
  GRN,
  TF_peak.fdr.threshold = 0.2,
  TF_peak.connectionTypes = "all",
  peak_gene.p_raw.threshold = NULL,
  peak_gene.fdr.threshold = 0.2,
  peak_gene.fdr.method = "BH",
  peak_gene.IHW.covariate = NULL,
  peak_gene.IHW.nbins = "auto",
  gene.types = c("protein_coding"),
  allowMissingTFs = FALSE,
  allowMissingGenes = TRUE,
  peak_gene.r_range = c(0, 1),
  peak_gene.selection = "all",
  peak_gene.maxDistance = NULL,
  filterTFs = NULL,
  filterGenes = NULL,
  filterPeaks = NULL,
  TF_peak_FDR_selectViaCorBins = FALSE,
  filterLoops = TRUE,
  outputFolder = NULL,
  resetGraphAndStoreInternally = TRUE,
  silent = FALSE
)

Arguments

GRN

Object of class GRN

TF_peak.fdr.threshold

Numeric[0,1]. Default 0.2. Maximum FDR for the TF-peak links. Set to 1 or NULL to disable this filter.

TF_peak.connectionTypes

Character vector. Default all. TF-peak connection types to consider. The special keyword all denotes all connection types (e.g., expression and TFActivity) that are found in the GRN object. By default, only expression is present in the object, so all and expression are usually equivalent unless calculation of TF-peak links based on TF activity has also been enabled.

peak_gene.p_raw.threshold

Numeric[0,1]. Default NULL. Threshold for the peak-gene connections, based on the raw p-value. All peak-gene connections with a larger raw p-value will be filtered out.

peak_gene.fdr.threshold

Numeric[0,1]. Default 0.2. Threshold for the peak-gene connections, based on the FDR. All peak-gene connections with a larger FDR will be filtered out.

peak_gene.fdr.method

Character. Default "BH". One of: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none", "IHW". Method for adjusting p-values for multiple testing. If set to "IHW", the package IHW is required (as it is listed under Suggests, it may not be installed), and independent hypothesis weighting will be performed, and a suitable covariate has to be specified for the parameter peak_gene.IHW.covariate.

peak_gene.IHW.covariate

Character. Default NULL. Name of the covariate to use for IHW (column name from the table thatis returned with the function getGRNConnections. Only relevant if peak_gene.fdr.method is set to "IHW". You have to make sure the specified covariate is suitable or IHW, see the diagnostic plots that are generated in this function for this. For many datasets, the peak-gene distance (called peak_gene.distance in the object) seems suitable.

peak_gene.IHW.nbins

Integer or "auto". Default "auto". Number of bins for IHW. Only relevant if peak_gene.fdr.method is set to "IHW".

gene.types

Character vector of supported gene types. Default c("protein_coding", "lincRNA"). Filter for gene types to retain, genes with gene types not listed here are filtered. The special keyword "all" indicates no filter and retains all gene types. The specified names must match the names as stored in the GRN object (see GRN@annotation$genes$gene.type) and correspond 1:1 to the gene type names as provided by biomaRt, with the exception of lncRNAs, which is internally renamed to lincRNAs when first fetching all gene types. This is done due to a recent change in biomaRt and aims at keeping backwards compatibility with GRN objects.

allowMissingTFs

TRUE or FALSE. Default FALSE. Should connections be returned for which the TF is NA (i.e., connections consisting only of peak-gene links?). If set to TRUE, this generally greatly increases the number of connections but it may not be what you aim for.

allowMissingGenes

TRUE or FALSE. Default TRUE. Should connections be returned for which the gene is NA (i.e., connections consisting only of TF-peak links?). If set to TRUE, this generally increases the number of connections.

peak_gene.r_range

Numeric(2). Default c(0,1). Filter for lower and upper limit for the peak-gene links. Only links will be retained if the correlation coefficient is within the specified interval. This filter is usually used to filter out negatively correlated peak-gene links.

peak_gene.selection

"all" or "closest". Default "all". Filter for the selection of genes for each peak. If set to "all", all previously identified peak-gene are used, while "closest" only retains the closest gene for each peak that is retained until the point the filter is applied.

peak_gene.maxDistance

Integer >0. Default NULL. Maximum peak-gene distance to retain a peak-gene connection.

filterTFs

Character vector. Default NULL. Vector of TFs (as named in the GRN object) to retain. All TFs not listed will be filtered out.

filterGenes

Character vector. Default NULL. Vector of gene IDs (as named in the GRN object) to retain. All genes not listed will be filtered out.

filterPeaks

Character vector. Default NULL. Vector of peak IDs (as named in the GRN object) to retain. All peaks not listed will be filtered out.

TF_peak_FDR_selectViaCorBins

TRUE or FALSE. Default FALSE. Use a modified procedure for selecting TF-peak links that is based on the user-specified FDR but that retains also links that may have a higher FDR but a more extreme correlation.

filterLoops

TRUE or FALSE. Default TRUE. If a TF regulates itself (i.e., the TF and the gene are the same entity), should such loops be filtered from the GRN?

outputFolder

Character or NULL. Default NULL. If set to NULL, the default output folder as specified when initiating the object in link{initializeGRN} will be used. Otherwise, all output from this function will be put into the specified folder. We recommend specifying an absolute path.

resetGraphAndStoreInternally

TRUE or FALSE. Default TRUE. If set to TRUE, the stored eGRN graph (slot graph) is reset due to the potentially changed connections that would otherwise cause conflicts in the information stored in the object. Also, a GRN object is returned. If set to FALSE, only the new filtered connections are returned and the object is not altered.

silent

TRUE or FALSE. Default FALSE. Print progress messages and filter statistics.

Value

An updated GRN object, with additional information added from this function. The filtered and merged TF-peak and peak-gene connections in the slot GRN@connections$all.filtered and can be retrieved (along with other feature metadata) using the function getGRNConnections.

Examples

# See the Workflow vignette on the GRaNIE website for examples
GRN = loadExampleObject()
#> Downloading GRaNIE example object from https://git.embl.de/grp-zaugg/GRaNIE/-/raw/master/data/GRN.rds
#> Finished successfully. You may explore the example object. Start by typing the object name to the console to see a summaty. Happy GRaNIE'ing!
GRN = filterGRNAndConnectGenes(GRN)
#> INFO [2023-03-06 16:39:43] Filter GRN network
#> INFO [2023-03-06 16:39:43] 
#> 
#> Real data
#> INFO [2023-03-06 16:39:43] Inital number of rows left before all filtering steps: 23096
#> INFO [2023-03-06 16:39:43]  Filter network and retain only rows with TF-peak connections with an FDR < 0.2
#> INFO [2023-03-06 16:39:43]   Number of TF-peak rows before filtering TFs: 23096
#> INFO [2023-03-06 16:39:43]   Number of TF-peak rows after filtering TFs: 4907
#> INFO [2023-03-06 16:39:43] 2. Filter peak-gene connections
#> INFO [2023-03-06 16:39:43]  Filter genes by gene type, keep only the following gene types: protein_coding
#> INFO [2023-03-06 16:39:43]   Number of peak-gene rows before filtering by gene type: 18828
#> INFO [2023-03-06 16:39:43]   Number of peak-gene rows after filtering by gene type: 14944
#> INFO [2023-03-06 16:39:43] 3. Merging TF-peak with peak-gene connections and filter the combined table...
#> INFO [2023-03-06 16:39:44] Inital number of rows left before filtering steps: 5485
#> INFO [2023-03-06 16:39:44]  Filter TF-TF self-loops
#> INFO [2023-03-06 16:39:44]   Number of rows before filtering genes: 5485
#> INFO [2023-03-06 16:39:44]   Number of rows after filtering genes: 3211
#> INFO [2023-03-06 16:39:44]  Filter network and retain only rows with peak_gene.r in the following interval: (0 - 1]
#> INFO [2023-03-06 16:39:44]   Number of rows before filtering: 3211
#> INFO [2023-03-06 16:39:44]   Number of rows after filtering: 1767
#> INFO [2023-03-06 16:39:44]  Calculate FDR based on remaining rows, filter network and retain only rows with peak-gene connections with an FDR < 0.2
#> INFO [2023-03-06 16:39:44]   Number of rows before filtering genes (including/excluding NA): 1767/1767
#> INFO [2023-03-06 16:39:44]   Number of rows after filtering genes (including/excluding NA): 392/392
#> INFO [2023-03-06 16:39:44] Final number of rows left after all filtering steps: 392
#> INFO [2023-03-06 16:39:44] 
#> 
#> Permuted data
#> Error in dplyr::left_join(., GRN@annotation$TFs %>% dplyr::select("TF.ID",     "TF.name", "TF.ENSEMBL"), by = c("TF.ID")): Join columns in `x` must be present in the data.
#>  Problem with `TF.ID`.