This function adds both RNA and peak data to a GRN object, along with data normalization. In addition, and highly recommended, sample metadata can be optionally provided.

addData(
  GRN,
  counts_peaks,
  normalization_peaks = "DESeq2_sizeFactors",
  idColumn_peaks = "peakID",
  counts_rna,
  normalization_rna = "limma_quantile",
  idColumn_RNA = "ENSEMBL",
  sampleMetadata = NULL,
  additionalParams.l = list(),
  allowOverlappingPeaks = FALSE,
  keepOriginalReadCounts = FALSE,
  EnsemblVersion = NULL,
  genomeAnnotationSource = "AnnotationHub",
  forceRerun = FALSE
)

Arguments

GRN

Object of class GRN

counts_peaks

Data frame. No default. Counts for the peaks, with raw or normalized counts for each peak (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument idColumn_peaks below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.

normalization_peaks

Character. Default DESeq2_sizeFactors. Normalization procedure for peak data. Must be one of limma_cyclicloess, limma_quantile, limma_scale, csaw_cyclicLoess_orig, csaw_TMM, EDASeq_GC_peaks, gcqn_peaks, DESeq2_sizeFactors, none.

idColumn_peaks

Character. Default peakID. Name of the column in the counts_peaks data frame that contains peak IDs. The required format must be chr:start-end", with chr denoting the abbreviated chromosome name, and start and end the begin and end of the peak coordinates, respectively. End must be bigger than start. Examples for valid peak IDs are chr1:400-800 or chrX:20-25.

counts_rna

Data frame. No default. Counts for the RNA-seq data, with raw or normalized counts for each gene (rows) across all samples (columns). In addition to the count data, it must also contain one ID column with a particular format, see the argument idColumn_rna below. Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.

normalization_rna

Character. Default limma_quantile. Normalization procedure for peak data. Must be one of limma_cyclicloess, limma_quantile, limma_scale, csaw_cyclicLoess_orig, csaw_TMM, DESeq2_sizeFactors, none.

idColumn_RNA

Character. Default ENSEMBL. Name of the column in the counts_rna data frame that contains Ensembl IDs.

sampleMetadata

Data frame. Default NULL. Optional, additional metadata for the samples, such as age, sex, gender etc. If provided, the @seealso [plotPCA_all()] function can then incorporate and plot it. Sample names must match with those from both peak and RNA-Seq data. The first column is expected to contain the sample IDs, the actual column name is irrelevant.

additionalParams.l

Named list. Default list(). Additional parameters for the chosen normalization method. Currently, only the GC-aware normalization methods EDASeq_GC_peaks and gcqn_peaks are supported here. Both support the parameters roundResults (logical flag, TRUE or FALSE) and nBins (Integer > 0), and EDASeq_GC_peaks supports three additional parameters: withinLane_method (one of: "loess","median","upper","full") and betweenLane_method (one of: "median","upper","full"). For more information, see the EDASeq vignette.

allowOverlappingPeaks

TRUE or FALSE. Default FALSE. Should overlapping peaks be allowed (then only a warning is issued when overlapping peaks are found) or (the default) should an error be raised?

keepOriginalReadCounts

TRUE or FALSE. Default FALSE. Should the original read counts as provided to the function be kept in addition to storing the rad counts after a (if any) normalization? This increases the memory footprint of the object because 2 additional count matrices have to be stored.

EnsemblVersion

NULL or Character(1). Default NULL. The Ensembl version to use for genome annotation retrieval via biomaRt, which is only used to populate the gene annotation metadata that is stored in GRN@annotation$genes. By default (NULL), the newest version is selected for the most recent genome assembly versions is used (see biomaRt::listEnsemblArchives() for supported versions). This parameter can override this to use a custom (older) version instead.

genomeAnnotationSource

AnnotationHub or biomaRt. Default AnnotationHub. Annotation source to retrieve genome annotation data from.

forceRerun

TRUE or FALSE. Default FALSE. Force execution, even if the GRN object already contains the result. Overwrites the old results.

Value

An updated GRN object, with added data from this function(e.g., slots GRN@data$peaks and GRN@data$RNA)

Details

If the ChIPseeker package is installed, additional peak annotation is provided in the annotation slot and a peak annotation QC plot is produced as part of peak-gene QC. This is fully optional, however, and has no consequences for downstream functions. Normalizing the data sensibly is very important. When quantileis chose, limma::normalizeQuantiles is used, which in essence does the following: Each quantile of each column is set to the mean of that quantile across arrays. The intention is to make all the normalized columns have the same empirical distribution. This will be exactly true if there are no missing values and no ties within the columns: the normalized columns are then simply permutations of one another.

See also

Examples

# See the Workflow vignette on the GRaNIE website for examples
# library(readr)
# rna.df   = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/rna.tsv.gz")
# peaks.df = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/peaks.tsv.gz")
# meta.df  = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/sampleMetadata.tsv.gz")
# GRN = loadExampleObject()
# We omit sampleMetadata = meta.df in the following line, becomes too long otherwise
# GRN = addData(GRN, counts_peaks = peaks.df, counts_rna = rna.df, forceRerun = FALSE)