filterGRNAndConnectGenes.Rd
This is one of the main integrative functions of the GRaNIE
package. It has two main functions:
First, filtering both TF-peak and peak-gene connections according to different criteria such as FDR and other properties
Second, joining the three major elements that an eGRN consist of (TFs, peaks, genes) into one data frame, with one row per unique TF-peak-gene connection.
After successful execution, the connections (along with additional feature metadata) can be retrieved with the function getGRNConnections
.
Note that a previously stored eGRN graph is reset upon successful execution of this function along with printing a descriptive warning,
and re-running the function build_eGRN_graph
is necessary when any of the network functions of the package shall be executed.
If the filtered connections changed, all network related enrichment functions also have to be rerun.
Internally, before joining them, both TF-peak links and peak-gene connections are filtered separately for reasons of memory and computational efficacy:
First filtering out unwanted links dramatically reduces the memory needed for the full eGRN. Peak-gene p-value adjustment is only done after all filtering steps on the remaining set of
connections to lower the statistical burden of multiple-testing adjustment; therefore, this may lead to initially counter-intuitive effects such as a particular connections not being included anymore as compared to a
filtering based on different thresholds, or the FDR being different for the same reason.
filterGRNAndConnectGenes(
GRN,
TF_peak.fdr.threshold = 0.2,
TF_peak.connectionTypes = "all",
peak_gene.p_raw.threshold = NULL,
peak_gene.fdr.threshold = 0.2,
peak_gene.fdr.method = "BH",
peak_gene.IHW.covariate = NULL,
peak_gene.IHW.nbins = "auto",
gene.types = c("protein_coding"),
allowMissingTFs = FALSE,
allowMissingGenes = TRUE,
peak_gene.r_range = c(0, 1),
peak_gene.selection = "all",
peak_gene.maxDistance = NULL,
filterTFs = NULL,
filterGenes = NULL,
filterPeaks = NULL,
TF_peak_FDR_selectViaCorBins = FALSE,
filterLoops = TRUE,
outputFolder = NULL,
resetGraphAndStoreInternally = TRUE,
silent = FALSE
)
Object of class GRN
Numeric[0,1]. Default 0.2. Maximum FDR for the TF-peak links. Set to 1 or NULL to disable this filter.
Character vector. Default all
. TF-peak connection types to consider. The special keyword all
denotes all connection types (e.g., expression
and TFActivity
) that are found in the GRN
object. By default, only expression
is present in the object, so all
and expression
are usually equivalent unless calculation of TF-peak links based on TF activity has also been enabled.
Numeric[0,1]. Default NULL. Threshold for the peak-gene connections, based on the raw p-value. All peak-gene connections with a larger raw p-value will be filtered out.
Numeric[0,1]. Default 0.2. Threshold for the peak-gene connections, based on the FDR. All peak-gene connections with a larger FDR will be filtered out.
Character. Default "BH". One of: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none", "IHW".
Method for adjusting p-values for multiple testing.
If set to "IHW", the package IHW
is required (as it is listed under Suggests
, it may not be installed),
and independent hypothesis weighting will be performed, and a suitable covariate has to be specified for the parameter peak_gene.IHW.covariate
.
Character. Default NULL. Name of the covariate to use for IHW (column name from the table thatis returned with the function getGRNConnections
. Only relevant if peak_gene.fdr.method
is set to "IHW". You have to make sure the specified covariate is suitable or IHW, see the diagnostic plots that are generated in this function for this. For many datasets, the peak-gene distance (called peak_gene.distance
in the object) seems suitable.
Integer or "auto". Default "auto". Number of bins for IHW. Only relevant if peak_gene.fdr.method
is set to "IHW".
Character vector of supported gene types. Default c("protein_coding", "lincRNA")
.
Filter for gene types to retain, genes with gene types not listed here are filtered. The special keyword "all" indicates no filter and retains all gene types.
The specified names must match the names as stored in the GRN
object (see GRN@annotation$genes$gene.type
) and
correspond 1:1 to the gene type names as provided by biomaRt
, with the exception of lncRNAs
,
which is internally renamed to lincRNAs
when first fetching all gene types. This is done due to a recent change in biomaRt
and aims at
keeping backwards compatibility with GRN
objects.
TRUE
or FALSE
. Default FALSE
. Should connections be returned for which the TF is NA (i.e., connections consisting only of peak-gene links?). If set to TRUE
, this generally greatly increases the number of connections but it may not be what you aim for.
TRUE
or FALSE
. Default TRUE
. Should connections be returned for which the gene is NA (i.e., connections consisting only of TF-peak links?). If set to TRUE
, this generally increases the number of connections.
Numeric(2). Default c(0,1)
. Filter for lower and upper limit for the peak-gene links. Only links will be retained if the correlation coefficient is within the specified interval. This filter is usually used to filter out negatively correlated peak-gene links.
"all"
or "closest"
. Default "all"
. Filter for the selection of genes for each peak. If set to "all"
, all previously identified peak-gene are used, while "closest"
only retains the closest gene for each peak that is retained until the point the filter is applied.
Integer >0. Default NULL
. Maximum peak-gene distance to retain a peak-gene connection.
Character vector. Default NULL
. Vector of TFs (as named in the GRN object) to retain. All TFs not listed will be filtered out.
Character vector. Default NULL
. Vector of gene IDs (as named in the GRN object) to retain. All genes not listed will be filtered out.
Character vector. Default NULL
. Vector of peak IDs (as named in the GRN object) to retain. All peaks not listed will be filtered out.
TRUE
or FALSE
. Default FALSE
. Use a modified procedure for selecting TF-peak links that is based on the user-specified FDR but that retains also links that may have a higher FDR but a more extreme correlation.
TRUE
or FALSE
. Default TRUE
. If a TF regulates itself (i.e., the TF and the gene are the same entity), should such loops be filtered from the GRN?
Character or NULL
. Default NULL
. If set to NULL
, the default output folder as specified when initiating the object in link{initializeGRN}
will be used. Otherwise, all output from this function will be put into the specified folder. We recommend specifying an absolute path.
TRUE
or FALSE
. Default TRUE
. If set to TRUE
, the stored eGRN graph (slot graph
) is reset due to the potentially changed connections that
would otherwise cause conflicts in the information stored in the object. Also, a GRN object is returned. If set to FALSE
, only the new filtered connections are returned and the object is not altered.
TRUE
or FALSE
. Default FALSE
. Print progress messages and filter statistics.
An updated GRN
object, with additional information added from this function.
The filtered and merged TF-peak and peak-gene connections in the slot GRN@connections$all.filtered
and can be retrieved (along with other feature metadata) using the function getGRNConnections
.
# See the Workflow vignette on the GRaNIE website for examples
GRN = loadExampleObject()
#> Downloading GRaNIE example object from https://git.embl.de/grp-zaugg/GRaNIE/-/raw/master/data/GRN.rds
#> Finished successfully. You may explore the example object. Start by typing the object name to the console to see a summaty. Happy GRaNIE'ing!
GRN = filterGRNAndConnectGenes(GRN)
#> INFO [2023-03-06 16:39:43] Filter GRN network
#> INFO [2023-03-06 16:39:43]
#>
#> Real data
#> INFO [2023-03-06 16:39:43] Inital number of rows left before all filtering steps: 23096
#> INFO [2023-03-06 16:39:43] Filter network and retain only rows with TF-peak connections with an FDR < 0.2
#> INFO [2023-03-06 16:39:43] Number of TF-peak rows before filtering TFs: 23096
#> INFO [2023-03-06 16:39:43] Number of TF-peak rows after filtering TFs: 4907
#> INFO [2023-03-06 16:39:43] 2. Filter peak-gene connections
#> INFO [2023-03-06 16:39:43] Filter genes by gene type, keep only the following gene types: protein_coding
#> INFO [2023-03-06 16:39:43] Number of peak-gene rows before filtering by gene type: 18828
#> INFO [2023-03-06 16:39:43] Number of peak-gene rows after filtering by gene type: 14944
#> INFO [2023-03-06 16:39:43] 3. Merging TF-peak with peak-gene connections and filter the combined table...
#> INFO [2023-03-06 16:39:44] Inital number of rows left before filtering steps: 5485
#> INFO [2023-03-06 16:39:44] Filter TF-TF self-loops
#> INFO [2023-03-06 16:39:44] Number of rows before filtering genes: 5485
#> INFO [2023-03-06 16:39:44] Number of rows after filtering genes: 3211
#> INFO [2023-03-06 16:39:44] Filter network and retain only rows with peak_gene.r in the following interval: (0 - 1]
#> INFO [2023-03-06 16:39:44] Number of rows before filtering: 3211
#> INFO [2023-03-06 16:39:44] Number of rows after filtering: 1767
#> INFO [2023-03-06 16:39:44] Calculate FDR based on remaining rows, filter network and retain only rows with peak-gene connections with an FDR < 0.2
#> INFO [2023-03-06 16:39:44] Number of rows before filtering genes (including/excluding NA): 1767/1767
#> INFO [2023-03-06 16:39:44] Number of rows after filtering genes (including/excluding NA): 392/392
#> INFO [2023-03-06 16:39:44] Final number of rows left after all filtering steps: 392
#> INFO [2023-03-06 16:39:44]
#>
#> Permuted data
#> Error in dplyr::left_join(., GRN@annotation$TFs %>% dplyr::select("TF.ID", "TF.name", "TF.ENSEMBL"), by = c("TF.ID")): Join columns in `x` must be present in the data.
#> ✖ Problem with `TF.ID`.