1 Introduction

2 Preliminaries

3 Download epigenomic data

4 Benchmark chromswitch performance

4.1 Prepare benchmarking dataset
4.2 Prepare genome-wide windows for validation
4.3 Run chromswitch on benchmark dataset
4.4 Random predictor
4.5 Shuffled labels
4.6 ROC curves
4.7 AUROC

5 Robustness experiments

5.1 Tuning parameters
5.2 Sample size and class imbalance
5.3 Window size
5.4 Analysis of a dispersed mark

5.4.1 Download data
5.4.2 Run chromswitch using H3K27me3 peaks

6 Genome-wide validation

6.1 Prepare GTEx data
6.2 Run chromswitch genome wide

6.2.1 Summary strategy
6.2.2 Binary strategy

6.3 Evaluate results

7 Session Info

1 Introduction

This document executes all the analysis presented in chromswitch: A flexible method for detecting chromatin state switches, from downloading the data, to running experiments, to generating the tables and figures included in the paper. The dropdown in the top right corner of this HTML doument controls whether code is shown or hidden. The document can be navigated using the menu on the left.

About the project:

In this paper, we present an R/Bioconductor package, chromswitch, which implements a flexible method to detect chromatin state switches between samples in two biological conditions in a specific genomic region of interest given peaks or chromatin state calls from ChIP-seq data. chromswitch is available from Bioconductor 3.6 at https://bioconductor.org/packages/chromswitch. We analyze its performance on a benchmark dataset, study its robustness to changes in tuning parameters, sample size and imbalance, window size, and varying mark profiles; and perform a genome-wide validation.

How this repository is set up:

This HTML file is generated by rendering analysis.Rmd, and executes all the analysis
input contains a few individual data and metadata files needed for the analysis
data stores data which is downloaded during the analysis
figures contains all figures produced in the analysis; the subfolder heatmaps stores the figures output by the tool
functions stores a few long functions used (source()’d by) the analysis
computations stores the results of computationally-intensive steps of the analysis, which are saved in the main document after being executed once, and then loaded for further analysis
tables stores the data behind each figure as TSV files, and tables presented in the supplementary material

How to re-run the analysis:

To run the analysis, this R Markdown file analysis.Rmd needs to be knit. This involves executing every chunk, or piece of code, sequentially. To feasibly handle chunks with computationally-intensive code, we typically evaluate the chunk once, and then set eval = FALSE for that chunk. The results are saved as an .RData file in the computations folder, and then loaded in a following chunk so that the results can be processed and visualized. To re-run the analysis, including these computationally-intensive steps, that option needs to be changed back to eval = TRUE for every chunk.

Since the analysis would take multiple days to run, we recommend running the analysis from the command line, using the run.sh bash script. If HPC resources are available, much of the work can be parallelized by modifying that script and setting the n_cores variable in the next section accordingly.

Dependencies:

We used R 3.3.1 and Bioconductor 3.5. The R packages required are loaded below, see the exact versions in Session Info. All the data needed for the analysis is provided in the repository or downloaded in this document.

2 Preliminaries

Load general packages:

# Visualization
library(RColorBrewer)
library(ggplot2)

# Install and download custom ggplot2 theme
devtools::install_github("sjessa/ggmin")
library(ggmin) 

# Genomics
library(GenomicRanges)
library(rtracklayer)
library(Homo.sapiens)
library(biomaRt)

# General-purpose
library(pROC)
library(xtable)
library(BiocParallel)
library(parallel)
library(magrittr)
library(plyr)
library(purrr)
library(tidyverse)
library(dplyr)
library(stringi)

# Source functions
source("functions/plotting_functions.R")
source("functions/subsampling_functions.R")
source("functions/misc_functions.R")

Install chromswitch from the GitHub repository:

devtools::install_github("sjessa/chromswitch@v0.99.0")
library(chromswitch)

Note on chromswitch version: chromswitch is now available from Bioconductor 3.6, but for reproducibility, the above command will download v0.99.0 of chromswitch, with which the bulk of the analysis was done: https://github.com/sjessa/chromswitch/releases/tag/v0.99.0.

Set-up for parallel computations:

# For other parallel computations
n_cores <- 1

chromswitch is implemented in a relatively modular way, with a function for each step of the algorithm. It also provides two wrapper functions, one for each feature construction strategy, which runs the whole analysis. Although the wrapper functions in chromswitch can analyze multiple query regions in parallel, we will parallelize outside the function in order to catch any errors that might arise.

safeCallSummary <- purrr::safely(chromswitch::callSummary)
safeCallBinary  <- purrr::safely(chromswitch::callBinary)

3 Download epigenomic data

The data we analyze comes from the Roadmap Epigenomics Program. We evaluate three types of input:

H3K4me3 peaks
DNase I peaks/DNase I hypersensitive sites
ChromHMM segmentation learned on the core histone marks in the Roadmap analysis, filtered to regions assigned the state “Active TSS”.

We will work with 23 adult tissue samples from Roadmap. 7 are brain tissues, and 16 are various other tissues.

# Read in the samples
meta <- read_tsv("input/sample_metadata.tsv")
meta

ABCDEFGHIJ0123456789

Sample <chr>	Tissue <chr>	Condition <chr>
E065	Aorta	Other
E066	Adult_Liver	Other
E067	Brain_Angular_Gyrus	Brain
E068	Brain_Anterior_Caudate	Brain
E069	Brain_Cingulate_Gyrus	Brain
E071	Brain_Hippocampus_Middle	Brain
E072	Brain_Inferior_Temporal_Lobe	Brain
E073	Brain_Mid_Frontal_Lobe	Brain
E074	Brain_Substantia_Nigra	Brain
E075	Colonic_Mucosa	Other

table(meta$Condition)

## 
## Brain Other 
##     7    16

# Download the file if it doesn't exist
noClobberDownload <- function(url, destpath) {
    if(!(file.exists(destpath))) download.file(url, destpath)
}

We retrieve the data from this link: http://egg2.wustl.edu/roadmap/data/byFileType/.

urls_k4 <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/",
                      meta$Sample, "-H3K4me3.narrowPeak.gz")
basenames_k4 <- lapply(urls_k4, basename) %>% unlist()
destpaths_k4 <- paste0("data/H3K4me3_peaks/", basenames_k4)

dl_k4 <- mcMap(noClobberDownload, urls_k4, destpaths_k4, mc.cores = n_cores)

extra_cols <- c("name" = "character", "score" = "integer", "strand" = "character",
                "signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric",
                "peak" = "numeric")

pk_k4me3 <- lapply(destpaths_k4, rtracklayer::import, format = "bed",
                   extraCols = extra_cols)
names(pk_k4me3) <- meta$Sample

save(pk_k4me3, file = "computations/pk_k4me3.RData")

urls_dnase <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidatedImputed/narrowPeak/",
               meta$Sample,
               "-DNase.imputed.narrowPeak.bed.nPk.gz")
basenames_dnase <- lapply(urls_dnase, basename) %>% unlist()
destpaths_dnase <- paste0("data/DNase_peaks/", basenames_dnase)

dl_dnase <- mcMap(noClobberDownload, urls_dnase, destpaths_dnase,
                  mc.cores = n_cores)

pk_dnase <- lapply(destpaths_dnase, rtracklayer::import, format = "bed")
names(pk_dnase) <- meta$Sample

save(pk_dnase, file = "computations/pk_dnase.RData")

To convert the chromatin state segmentations into a format that can be used by chromswitch, we will filter to regions assigned “Active_TSS”, the state associated with active transcription.

urls_st <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/",
                      meta$Sample, "_15_coreMarks_mnemonics.bed.gz")
basenames_st <- lapply(urls_st, basename) %>% unlist()
destpaths_st <- paste0("data/ChromHMM_segmentations/", basenames_st)

dl_st <- mcMap(noClobberDownload, urls_st, destpaths_st, mc.cores = 6)

extra_cols <- c("State" = "character")

chmm_states <- lapply(destpaths_st, rtracklayer::import, format = "bed",
                   extraCols = extra_cols) %>% 
    lapply(as.data.frame) %>% 
    # We will treat regions assigned the state associated with active
    # transcription as the epigenomic features in our analysis
    lapply(filter, State == "1_TssA") %>% 
    lapply(makeGRangesFromDataFrame)

names(chmm_states) <- meta$Sample

save(chmm_states, file = "computations/chmm_states.RData")

load("computations/pk_k4me3.RData")
load("computations/pk_dnase.RData")
load("computations/chmm_states.RData")

4 Benchmark `chromswitch` performance

4.1 Prepare benchmarking dataset

We manually assembled a list of 120 genes on which to benchmark chromswitch performance. 60 of these genes exhibit chromatin state switches between brain and other tissues based on change in H3K4me3 signal (whose enrichment at promoters is associated with active transcription), DNase I hypersensitivity (associated with open chromatin), and expression (the functional consequence of chromatin changes). The other 60 are negative examples, where there was no change across the tissues in H3K4me3, DNase I, or expression data. These genes are specified in the file input/benchmark.tsv.

To define windows for these genes to use as input for our experiment, we retrieve a list of genes and coordinates in the human genome from UCSC. This file is saved in the data directory, and we obtained it using the UCSC Table Browser, with the following selections:

group: Genes and Gene Predictions
track: RefSeq
Fields: name, chrom, strand, txStart, txEnd, and name2

all_genes <- read_tsv("data/refseq_genes.hg19.tsv",
                     col_names = c("transcript_id", "chromosome", "strand",
                                   "txStart", "txEnd", "gene_name"), skip = 1)

head(all_genes)

ABCDEFGHIJ0123456789

transcript_id <chr>	chromosome <chr>	strand <chr>	txStart <int>	txEnd <int>	gene_name <chr>
NM_032291	chr1	+	66999638	67216822	SGIP1
NM_001308203	chr1	+	66999251	67216822	SGIP1
NM_001350218	chr1	+	66999638	67216822	SGIP1
NM_001350217	chr1	+	66999638	67216822	SGIP1
NM_013943	chr1	+	25071759	25170815	CLIC4
NM_018090	chr1	+	16767166	16786584	NECAP2

Define 5kbp windows around the TSS:

benchmark <- read_tsv("input/benchmark.tsv") %>% 
    left_join(all_genes, by = c("transcript_id", "gene_name")) %>% 
    # Define window around the TSS
  mutate(start = case_when(
    strand == "+" ~ as.numeric(txStart) - 2500,
    strand == "-" ~ as.numeric(txEnd) - 2500)) %>%
  mutate(end = case_when(
    strand == "+" ~ as.numeric(txStart) + 2500,
    strand == "-" ~ as.numeric(txEnd) + 2500)) %T>% 
    write_tsv("input/benchmark_coords.5kbp.tsv") %>% 
    makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>% 
    keepStandardChromosomes()

Table S1

benchmark %>%
  as.data.frame %>%
  dplyr::select(gene_name, transcript_id, chr = seqnames, start, end, switch) %>% 
  arrange(desc(switch), gene_name) %T>% 
  write_tsv("tables/tableS2.tsv") %>% 
  xtable()

ABCDEFGHIJ0123456789

gene_name <chr>	transcript_id <chr>	chr <fctr>	start <int>	end <int>	switch <int>
ADCYAP1R1	NM_001199636	chr7	31089575	31094575	1
ADGRA1	NM_001083909	chr10	134899007	134904007	1
AJAP1	NM_018836	chr1	4712604	4717604	1
AK5	NM_012093	chr1	77745786	77750786	1
ATCAY	NM_033064	chr19	3878117	3883117	1
C2orf80	NM_001099334	chr2	209052273	209057273	1
CACNG3	NM_006539	chr16	24264373	24269373	1
CDH10	NM_006727	chr5	24642587	24647587	1
CDH18	NM_004934	chr5	19985853	19990853	1
CHRM1	NM_000738	chr11	62686512	62691512	1

4.2 Prepare genome-wide windows for validation

In order to later validate chromswitch genome-wide, we also clean up the gene list to define 5kbp windows genome-wide.

refseq_unique <- all_genes %>% 
      # Keep the longest transcript per gene
      mutate(length = txEnd - txStart) %>%
      group_by(gene_name) %>%
      dplyr::slice(which.max(length)) %>% 
      ungroup() %>% 
      # Filter out genes where the length of the gene is less than half the
      # window size
      filter(length >= 2500)

getRefSeq <- function(window) {
    
  d <- window / 2
    
  refseq_unique %>%
      # Define window around the TSS
      mutate(start = case_when(
          strand == "+" ~ as.numeric(txStart) - d,
          strand == "-" ~ as.numeric(txEnd) - d)) %>%
      mutate(end = case_when(
          strand == "+" ~ as.numeric(txStart) + d,
          strand == "-" ~ as.numeric(txEnd) + d)) %T>% 
      write_tsv(paste0("input/benchmark_coords.", window, ".tsv")) %>% 
      makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>% 
      keepStandardChromosomes(pruning.mode = "coarse")
}

refseq_5kbp <- getRefSeq(5000)

4.3 Run `chromswitch` on benchmark dataset

Here we evaluate each feature matrix construction strategy on each type of input.

Summary strategy applied to H3K4me3

bench_s_k_raw <- mclapply(benchmark, safeCallSummary,
                       peaks = pk_k4me3,
                       metadata = meta,
                       mark = "H3K4me3",
                       filter = TRUE,
                       filter_columns = c("qValue", "signalValue"),
                       filter_thresholds = c(10, 4),
                       normalize = TRUE,
                       normalize_columns = c("qValue", "signalValue"),
                       summarize_columns = c("qValue", "signalValue"),
                       length = FALSE,
                       fraction = TRUE,
                       n = FALSE,
                       optimal_clusters = TRUE,
                       heatmap = FALSE,
                       BPPARAM = SerialParam(), mc.cores = n_cores) %>% 
  purrr::transpose()

bench_s_k <- bench_s_k_raw$result[map_lgl(bench_s_k_raw$error, is_null)] %>%
    bind_rows()

save(bench_s_k, file = "computations/bench_s_k.RData")

Summary strategy applied to DNase I

bench_s_d_raw <- mclapply(benchmark, safeCallSummary,
                               peaks = pk_dnase,
                               metadata = meta,
                               mark = "DNase",
                               filter = FALSE,
                               normalize = FALSE,
                               summarize_columns = NULL,
                               length = FALSE,
                               fraction = TRUE,
                               n = FALSE,
                               optimal_clusters = TRUE,
                               heatmap = FALSE,
                               BPPARAM = SerialParam(),
                               mc.cores = n_cores) %>% 
    purrr::transpose()

bench_s_d <- bench_s_d_raw$result[map_lgl(bench_s_d_raw$error, is_null)] %>%
    bind_rows()

save(bench_s_d, file = "computations/bench_s_d.RData")

Binary strategy applied to H3K4me3

bench_b_k_raw <- mclapply(benchmark,
                          safeCallBinary,
                          peaks = pk_k4me3,
                          metadata = meta,
                          filter = TRUE,
                          filter_columns = c("qValue","signalValue"),
                          filter_thresholds = c(10, 4),
                          reduce = TRUE,
                          gap = 300,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          n_features = TRUE, mc.cores = n_cores) %>% 
  purrr::transpose()

bench_b_k <- bench_b_k_raw$result[map_lgl(bench_b_k_raw$error,
                                                is_null)] %>%
  bind_rows()

save(bench_b_k, file = "computations/bench_b_k.RData")

Binary strategy applied to DNase I

bench_b_d_raw <- mclapply(benchmark,
                          safeCallBinary,
                          peaks = pk_dnase,
                          metadata = meta,
                          filter = FALSE,
                          reduce = TRUE,
                          gap = 300,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          n_features = TRUE, mc.cores = n_cores) %>% 
  purrr::transpose()

bench_b_d <- bench_b_d_raw$result[map_lgl(bench_b_d_raw$error, is_null)] %>%
  bind_rows()

save(bench_b_d, file = "computations/bench_b_d.RData")

Summary strategy applied to ChromHMM

bench_s_c_raw <- mclapply(benchmark,
                          safeCallSummary,
                          peaks = chmm_states,
                          metadata = meta,
                          mark = "State",
                          filter = FALSE,
                          normalize = FALSE,
                          summarize_columns = NULL,
                          length = FALSE,
                          fraction = TRUE,
                          n = FALSE,
                          heatmap = FALSE,
                          optimal_clusters = TRUE,
                          BPPARAM = SerialParam(),
                          mc.cores = n_cores) %>% 
  purrr::transpose()

bench_s_c <- bench_s_c_raw$result[map_lgl(bench_s_c_raw$error, is_null)] %>%
    bind_rows()

save(bench_s_c, file = "computations/bench_s_c.RData")

Binary strategy applied to ChromHMM

bench_b_c_raw <- mclapply(benchmark,
                          safeCallBinary,
                          peaks = chmm_states,
                          metadata = meta,
                          filter = FALSE,
                          reduce = FALSE,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          n_features = TRUE,
                          mc.cores = n_cores) %>% 
  purrr::transpose()

bench_b_c <- bench_b_c_raw$result[map_lgl(bench_b_c_raw$error,
                                          is_null)] %>%
    bind_rows()

save(bench_b_c, file = "computations/bench_b_c.RData")

load("computations/bench_s_k.RData")
load("computations/bench_s_d.RData")
load("computations/bench_s_c.RData")
load("computations/bench_b_k.RData")
load("computations/bench_b_d.RData")
load("computations/bench_b_c.RData")

4.4 Random predictor

We compare with a random predictor by drawing from from a $Uniform(0, 1)$ distribution, repeating this 10,000 times, and then averaging the results to assign a score. This is done for each region. It should track with the diagonal and have an AUROC of around 0.5

#' @param k, the number of query regions to make a prediction for
#' @param N, the number of times to draw
predictRandom <- function(k, N) {
    
    rep(k, N) %>%
        mclapply(runif, mc.cores = 2) %>% 
        data.frame() %>% 
        rowMeans()
}

# Repeat twice, to get one set of random predictions per method
draws <- replicate(2, predictRandom(120, 10000), simplify = FALSE)
save(draws, file = "computations/draws.RData")

load("computations/draws.RData")

bench_s_k <- bench_s_k %>%
    add_column(Random = draws[[1]], .after = "Consensus")

bench_b_d <- bench_b_d %>%
    add_column(Random = draws[[2]], .after = "Consensus")

4.5 Shuffled labels

Our method predicts chromatin state switches by computing a score of the similarity between the known biological condition labels and inferred clusters. We can assess how these scores look when we shuffle the class labels. We cluster samples based on the feature matrix as usual, but before calculating any of the external validity indices, we randomly shuffle the labels.

predictShuffle <- function(clusters, N) {
  
  shuffle <- function(labels, clusters) {
    
    shuf <- sample(labels)
    contingency <- table(shuf, clusters)
    
    data.frame(
      ARI_shuffle          = mclust::adjustedRandIndex(clusters, shuf),
      NMI_shuffle          = chromswitch:::NMI(clusters, shuf),
      V_measure_shuffle    = chromswitch:::vMeasure(contingency)
    )
  }
  
  # Repeat N times
  replicate(N, shuffle(labels = meta$Condition, clust = unlist(clusters)),
            simplify = FALSE) %>%
    bind_rows() %>%
    colMeans()
}

shuffle_s_k <- bench_s_k %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_s_d <- bench_s_d %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_s_c <- bench_s_c %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

save(shuffle_s_k, shuffle_s_d, shuffle_s_c,
     file = "computations/shuffle_s.RData")

shuffle_b_k <- bench_b_k %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_b_d <- bench_b_d %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_b_c <- bench_b_c %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

save(shuffle_b_c, shuffle_b_d, shuffle_b_k,
     file = "computations/shuffle_b.RData")

load("computations/shuffle_s.RData")
load("computations/shuffle_b.RData")

addShuf <- function(df) {
  df %>% 
    mutate(Consensus_shuffle =
               rowMeans(dplyr::select(., ARI_shuffle, NMI_shuffle, V_measure_shuffle)))
}

bench_s_k <- bind_cols(bench_s_k, shuffle_s_k) %>% addShuf()
bench_s_d <- bind_cols(bench_s_d, shuffle_s_d) %>% addShuf()
bench_s_c <- bind_cols(bench_s_c, shuffle_s_c) %>% addShuf()

bench_b_k <- bind_cols(bench_b_k, shuffle_b_k) %>% addShuf()
bench_b_d <- bind_cols(bench_b_d, shuffle_b_d) %>% addShuf()
bench_b_c <- bind_cols(bench_b_c, shuffle_b_c) %>% addShuf()

4.6 ROC curves

The problem of a detecting a chromatin state switch in a region can be reframed as the problem of classifying the region as containing a switch between conditions or not. Therefore, our method can be regarded as a binary classifier, and evaluated accordingly. We evaluate the performance of our method on the benchmark set by computing Receiver-Operating Characteristic (ROC) curves based on the consensus score computed by chromswitch. ROC curves measure the tradeoff between the true positive rate and the false positive rate as the threshold on the consensus score is varied, as well as the area under the ROC curve (AUROC) which takes on a value of 1 for a perfect classifier and can be interpreted as the probability that chromswitch will score a randomly chosen region corresponding to a chromatin state switch higher than a randomly chosen region which does not exhibit a switch.

calcROC <- function(scores, method, input, smooth) {
    
    # Compute ROC info using each score as a predictor
    roc_out <- scores %>% 
        dplyr::select(
        matches("Consensus|Random")) %>% 
        map(~ pROC::roc(response = scores$switch,
                       predictor = as.numeric(.),
                       smooth = smooth))
    
    # Extract values to plot ROC curves
    roc_val <- roc_out %>%
        map(`[`, c("sensitivities", "specificities")) %>%
        map(as.data.frame) %>%
        map2_df(names(roc_out), function(df, index) mutate(df, index = index)) %>%
        mutate(TPR = sensitivities,
               FPR = 1 - specificities,
               Method = method,
               Input = input)

    return(roc_val)
}

benchmark_roc <- bind_rows(
  calcROC(bench_s_k, "Summary", "H3K4me3",  smooth = TRUE),
  calcROC(bench_s_d, "Summary", "DNase I",  smooth = TRUE),
  calcROC(bench_s_c, "Summary", "ChromHMM", smooth = TRUE),
  calcROC(bench_b_k, "Binary",  "H3K4me3",  smooth = TRUE),
  calcROC(bench_b_d, "Binary",  "DNase I",  smooth = TRUE),
  calcROC(bench_b_c, "Binary",  "ChromHMM", smooth = TRUE)) %>% 
  filter(grepl("Consensus|Random", index)) %>%
  filter(index != "Random" |
          (index == "Random") & (Method == "Binary") & (Input == "DNase I")) %>% 
  mutate(Input = ifelse(index == "Random", "NA", Input)) %T>% 
  write_tsv("tables/benchmark_roc.tsv")

roc_gg <- benchmark_roc %>% 
  makeLabel()

Figure 1b

roc_gg %>%
    filter(Input != "DNase I") %>%
    filter(index != "Consensus_shuffle") %>%
    ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
    geom_line(size = 1) +
    scale_colour_manual(name = "Strategy", values = pal_strategy) +
    scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
    theme_min() +
    xlab("False Positive Rate") + ylab("True Positive Rate") +
    theme(legend.position = "bottom") +
    labs(colour = "") + labs(linetype = "") +
    guides(colour = guide_legend(ncol = 1))

Figure S2

roc_gg %>% 
    filter(index != "Random") %>% 
    ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
    geom_line(size = 1) +
    scale_colour_manual(name = "Strategy", values = pal_strategy) +
    scale_linetype_manual(name = "Strategy", values = linetype_strategy2) +
    theme_min() +
    # theme(legend.position = "bottom") +
    xlab("False Positive Rate") + ylab("True Positive Rate") +
    theme(legend.position = "bottom") +
    facet_wrap(~ Input) +
    labs(colour = "") + labs(linetype = "") +
    guides(colour = guide_legend(ncol = 1))

4.7 AUROC

calcAUC <- function(scores, method, input) {

    scores %>%
        dplyr::select(
        matches("Consensus|Random")) %>%
        map(~pROC::roc(response = scores$switch, predictor = as.numeric(.))) %>%
        map_df("auc") %>%
        mutate(Method = method,
               Input = input)
}

benchmark_auc <- bind_rows(
  calcAUC(bench_s_k, "Summary", "H3K4me3"),
  calcAUC(bench_s_d, "Summary", "DNase I"),
  calcAUC(bench_s_c, "Summary", "ChromHMM"),
  calcAUC(bench_b_k, "Binary", "H3K4me3"),
  calcAUC(bench_b_d, "Binary", "DNase I"),
  calcAUC(bench_b_c, "Binary", "ChromHMM")) %>% 
  select(Method, Input, Consensus, Consensus_shuffle, Random)

benchmark_auc

ABCDEFGHIJ0123456789

Method <chr>	Input <chr>	Consensus <dbl>	Consensus_shuffle <dbl>	Random <dbl>
Summary	H3K4me3	0.9833333	0.6530556	0.5844444
Summary	DNase I	0.9466667	0.5350000	NA
Summary	ChromHMM	0.9373611	0.7661111	NA
Binary	H3K4me3	0.9969444	0.5513889	NA
Binary	DNase I	0.9902778	0.7527778	0.4975000
Binary	ChromHMM	0.9731944	0.6450000	NA

Figure S2

# A little manipulation to make the table for the supplementary

benchmark_auc %>%
  gather(Stat, Score, Consensus, Consensus_shuffle) %>%
  rowwise() %>%
  mutate(Method = ifelse(grepl("shuffle", Stat),
                         paste0(Method, " (shuffled)"), Method)) %>%
  ungroup() %>% 
  dplyr::select(Method, Input, Score) %>%
  spread(Input, Score) %T>%
  write_tsv("tables/benchmark_auc.tsv") %>% 
  xtable() # Generate latex code

ABCDEFGHIJ0123456789

	Method <chr>	ChromHMM <dbl>	DNase I <dbl>	H3K4me3 <dbl>
1	Binary	0.9731944	0.9902778	0.9969444
2	Binary (shuffled)	0.6450000	0.7527778	0.5513889
3	Summary	0.9373611	0.9466667	0.9833333
4	Summary (shuffled)	0.7661111	0.5350000	0.6530556

5 Robustness experiments

5.1 Tuning parameters

There are two parameters used in the feature construction process for the binary approach, gap which controls whether nearby peaks within samples are connected, and p which controls how much overlap is required to call two peaks the same. To assess the impact of these parameters, we’ll perform a grid search on various combinations.

Since we want to investigate the effect of varying the tuning parameters, we select the best threshold on the Consensus score on this benchmarking dataset:

getBest <- function(bench_out) {
    
    best_consensus <- pROC::roc(predictor = bench_out$Consensus,
                            response = bench_out$switch) %>%
    pROC::coords("best",
                 ret = c("threshold", "sensitivity",
                         "specificity", "accuracy", "ppv",
                 as.list = TRUE, drop = TRUE))
    
    best_threshold <- best_consensus["threshold"]
    return(best_threshold)
}

best_k <- getBest(bench_b_k)
best_k

## threshold 
## 0.2156931

grid <- expand.grid(gap = seq(200, 500, by = 50),
                    p_overlap = seq(0.2, 0.8, by = 0.1))

evaluateParams <- function(gap, p, best_threshold, ...) {
    
    out <- benchmark %>% 
        mclapply(safeCallBinary, gap = gap, p = p, ...) %>% 
        purrr::transpose()
    
    ok <- map_lgl(out$error, is_null)
    out <- out$result[ok] %>%
        bind_rows() %>% 
        mutate(pred = ifelse(Consensus >= best_threshold, 1, 0))
    
    # Now we can tally the results by comparing the predictions
    # to the known classes (switch or not)
    stats <- table(benchmark$switch[ok], out$pred)
    
    tn = fn = tp = fp = 0
    try(tn <- stats["0", "0"], silent = TRUE)
    try(fn <- stats["1", "0"], silent = TRUE)
    try(tp <- stats["1", "1"], silent = TRUE)
    try(fp <- stats["0", "1"], silent = TRUE)
    
    df <- data.frame(gap, p, tp, tn, fp, fn)
    return(list(predictions = out$pred, stats = df))
} 

safeEvalParams <- purrr::safely(evaluateParams)
grid_out <- Map(function(gap, p) safeEvalParams(gap, p,
                                                   best_threshold = best_k,
                                                   peaks = pk_k4me3,
                                                   metadata = meta,
                                                   filter = TRUE,
                                                   filter_columns = c("qValue", "signalValue"),
                                                   filter_thresholds = c(10, 4),
                                                   reduce = TRUE,
                                                   heatmap = FALSE,
                                                   BPPARAM = SerialParam(),
                                                   optimal_clusters = TRUE,
                                                   mc.cores = n_cores),
                   grid$gap, grid$p_overlap) %>%
    purrr::transpose()

save(grid_out, file = "computations/grid.RData")

Figure S4

load("computations/grid.RData")
makeGrid(grid_out, "H3K4me3")

load("tables/grid_scores.H3K4me3.RData")
plotGrid(grid_scores)

5.2 Sample size and class imbalance

Next, we investigate how sensitive or robust chromswitch is to small numbers of samples and/or to datasets with high class imbalance (i.e. when samples in one condition greatly outnumber samples in the other). To do so, we will subsample from the samples in the full 23-tissue dataset to create many smaller datasets with fewer samples and greater class imbalance, and evaluate how chromswitch performs in classifying the benchmark regions using these smaller datasets.

For the summary feature matrix construction strategy, the computation of the feature vector for each sample is independent of all other samples, so we can save the feature matrices and subsample rows from the matrices. Here we compute and save the feature matrix for each region:

getSummaryFeatures <- function(query) {
  
  cols <- c("qValue", "signalValue")
  
  ft_mat <- pk_k4me3 %>% 
    filterPeaks(columns = cols,
                thresholds = c(10, 4)) %>% 
    normalizePeaks(columns = cols) %>% 
    retrievePeaks(metadata = meta, region = query) %>% 
    summarizePeaks(mark = "H3K4me3", cols = cols,
                   length = FALSE, fraction = TRUE, n = FALSE)
  
  return(ft_mat)
  
}

safeGetFeatures <- purrr::safely(getSummaryFeatures)

ft_s_k_raw <- benchmark %>% 
  mclapply(safeGetFeatures, mc.cores = n_cores) %>% 
  purrr::transpose()

ft_s_k <- ft_s_k_raw$result
save(ft_s_k, ft_s_k_raw,
     file = "computations/benchmark_features_s_k.RData")

For the binary strategy, the feature matrices are derived from the union of peaks supplied across samples, and are therefore dependent on the exact set of samples, so we will only save the local peaks (the peaks in the query region) for each sample.

getBinaryPeaks <- function(query) {
  
  pks <- pk_k4me3 %>% 
    retrievePeaks(metadata = meta, region = query) %>% 
    reducePeaks(gap = 300)
  
  return(pks)
  
}

safeGetBinaryPeaks <- purrr::safely(getBinaryPeaks)

pk_b_k_raw <- benchmark %>% 
  mclapply(safeGetBinaryPeaks, mc.cores = n_cores) %>% 
  purrr::transpose()
  
pk_b_k  <- pk_b_k_raw$result

# Extract the peaks from the localPeaks object
lpk_b_k <- lapply(pk_b_k, chromswitch:::lpkPeaks)

save(pk_b_k_raw, pk_b_k, lpk_b_k,
     file = "computations/benchmark_peaks_b_k.RData")

load("computations/benchmark_features_s_k.RData")
load("computations/benchmark_peaks_b_k.RData")

These are the different datasets we’ll construct, each with a different number and proportion of brain & other samples.

combos <- data.frame(
    n_other = c(2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 16, 16, 10, 12, 15, 16, 16),
    n_brain = c(2, 3, 4, 5, 6, 7, 7,  7,  7,  7,  7,  2,  3,  5,  4,  3,  4,  5)
)

combos

ABCDEFGHIJ0123456789

n_other <dbl>	n_brain <dbl>
2	2
3	3
4	4
5	5
6	6
7	7
8	7
10	7
12	7
14	7

lvls_combos <- c(unlist(Map(function(x, y)
    paste0(x ," brain vs. ", y, " other"),
    combos$n_brain, combos$n_other)))

idx_other <- which(meta$Condition == "Other")
idx_brain <- which(meta$Condition == "Brain")

Since in some cases it’s possible that there are few enough combinations that we can exhaustively try all of them, we will do so when there are fewer than 100 combinations. Otherwise, we will evaluate 100 randomly subsample ddatasets to for each combination.

How many possible datasets exist for each pair of n_brain and n_other?

n_pos <- function(n_other, n_brain, idx_other) choose(length(idx_other), n_other) * choose(length(idx_brain), n_brain)

combos$n_pos <- unlist(Map(function(n_other, n_brain) n_pos(n_other, n_brain, idx_other),
                            combos$n_other, combos$n_brain))
combos

ABCDEFGHIJ0123456789

n_other <dbl>	n_brain <dbl>	n_pos <dbl>
2	2	2520
3	3	19600
4	4	63700
5	5	91728
6	6	56056
7	7	11440
8	7	12870
10	7	8008
12	7	1820
14	7	120

For the code below, the functions that carry out the subsampling experiment are sourced from functions/subsampling_functions.R.

safeSubsampleFt <- purrr::safely(subsampleFromFeatures)
safeSubsamplePk <- purrr::safely(subsampleFromPeaks)

subsamp_s_raw <- mcMap(function(n_other, n_brain)
    safeSubsampleFt(ft_s_k, n_other, n_brain, idx_other),
    combos$n_other, combos$n_brain,
    mc.cores = n_cores) %>%
    purrr::transpose()

subsamp_s <- subsamp_s_raw$result
save(subsamp_s, subsamp_s_raw, file = "computations/subsamp_s.RData")

subsamp_b_raw <- mcMap(function(n_other, n_brain)
  safeSubsamplePk(pk_b_k, n_other, n_brain, idx_other),
    combos$n_other, combos$n_brain,
    mc.cores = n_cores) %>%
    purrr::transpose()

subsamp_b <- subsamp_b_raw$result
save(subsamp_b, subsamp_b_raw, file = "computations/subsamp_b.RData")

Now we can process the results, and calculate mean AUC scores for each combination so we can quantify performance.

load("computations/subsamp_s.RData")
load("computations/subsamp_b.RData")

auc_s <- subsamp_s %>% map_df("auc") %>% 
  mutate(Method = "Summary", Input = "H3K4me3")

auc_b <- subsamp_b %>% map("scores") %>% 
  map_df("auc") %>% 
  mutate(Method = "Binary", Input = "H3K4me3")

# Simplify the facet labels
lvls_combos_new <- c(unlist(Map(function(x, y)
    paste0(x ," vs. ", y),
    combos$n_brain, combos$n_other)))
lvls_combos_new[1] <- "2 brain vs. 2 other"

# Calculate the mean AUC for each combination of n_brain, n_other
mean_auc <- bind_rows(auc_s, auc_b) %>% 
    rowwise() %>% 
    mutate(dataset = ifelse(n_brain == 2 && n_other == 2,
                            "2 brain vs. 2 other",
                            paste0(n_brain, " vs. ", n_other))) %>%
    ungroup() %>% 
    mutate(dataset = factor(dataset, levels = lvls_combos_new)) %>%
    dplyr::select(dataset, iteration, AUC = Consensus, Method, Input) %>%
    group_by(dataset, Method, Input) %>%
    summarise(mean_AUC = paste0("AUC: ", round(mean(AUC), 2))) %T>%
    write_tsv("tables/subsampling_auc.tsv")

roc_s <- subsamp_s %>% map_df("roc") %>%
  mutate(Method = "Summary", Input = "H3K4me3")

roc_b <- subsamp_b %>% map("scores") %>% 
  map_df("roc") %>% 
  mutate(Method = "Binary", Input = "H3K4me3")

roc_df <- bind_rows(roc_s, roc_b) %>% 
  rowwise() %>% 
  mutate(dataset = ifelse(n_brain == 2 && n_other == 2,
                          "2 brain vs. 2 other",
                          paste0(n_brain, " vs. ", n_other))) %>%
  ungroup() %>%
  mutate(dataset = factor(dataset, levels = lvls_combos_new)) %>%
  filter(index == "Consensus") %>% 
  dplyr::select(-index)

# Ensures that the iterations are plot in an order that looks nice
# with the colours of the lines
roc_df$iteration <- factor(roc_df$iteration,
                           levels = sort(unique(roc_df$iteration),
                                         decreasing = TRUE))

In the figures below, the ROC curve for every individual subsample dataset is plot in a shade of gray, while the mean across iterations, for each combination, is plot in red (summary strategy) or blue (binary strategy).

Figure S3a

roc_df %T>%
    write_tsv("tables/subsampling_roc.tsv") %>% 
    filter(Method == "Summary" & Input == "H3K4me3") %>%
    plotSubsamp("red", "Summary", "H3K4me3")

Figure S3b

roc_df %>%
    filter(Method == "Binary" & Input == "H3K4me3") %>%
    plotSubsamp("blue", "Binary", "H3K4me3")

5.3 Window size

Since the chromswitch algorithm depends on the data in the genomic query window, which is selected by the user, we next considered the effect of variations in the window size. We studied how the consensus score changed as a function of the window size for three genes in our benchmark dataset, using H3K4me3 peaks.

window_genes <- c("GRIK2", "CHRM1", "TTYH1")

windows <- refseq_5kbp %>%
    as.data.frame %>%
    filter(gene_name %in% window_genes) %>%
    select(-start, -end)

get_window <- function(window_size, genes) {

    d <- window_size / 2

    genes %>%
        # Define window around the TSS
        mutate(start = case_when(
            strand == "+" ~ as.numeric(txStart) - d,
            strand == "-" ~ as.numeric(txEnd) - d)) %>%
        mutate(end = case_when(
            strand == "+" ~ as.numeric(txStart) + d,
            strand == "-" ~ as.numeric(txEnd) + d)) %>%
         mutate(gene_name = paste0(gene_name, "_", window_size/1000, "kbp"))
}

# Window sizes to test, in base pairs
sizes <- c(seq(1000, 10000, by = 1000),
           seq(10000, 50000, by = 10000),
           100000, 250000, 500000)

windows_gr <- lapply(sizes, get_window, windows) %>% bind_rows() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

win_s <- callSummary(windows_gr,
         peaks = pk_k4me3,
         metadata = meta,
         mark = "H3K4me3",
         filter = TRUE,
         filter_columns = c("qValue", "signalValue"),
         filter_thresholds = c(10, 4),
         normalize = TRUE,
         normalize_columns = c("qValue", "signalValue"),
         summarize_columns = c("qValue", "signalValue"),
         length = FALSE,
         fraction = TRUE,
         n = FALSE,
         optimal_clusters = TRUE,
         heatmap = TRUE,
         titles = paste0(windows_gr$gene_name, "_summary"),
         outdir = "figures/heatmaps")

win_s

save(win_s, file = "computations/windowsize_s.RData")

win_b <- callBinary(query = win_to_test,
                          peaks = pk_k4me3,
                          metadata = meta,
                          filter = TRUE,
                          filter_columns = c("qValue","signalValue"),
                          filter_thresholds = c(10, 4),
                          reduce = TRUE,
                          gap = 300,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          heatmap = TRUE,
                          titles = paste0(windows_gr$gene_name, "_binary"),
                          outdir = "figures/heatmaps",
                          n_features = TRUE)

win_b

save(win_b, file = "computations/windowsize_b.RData")

Figure S5

load("computations/windowsize_s.RData")
load("computations/windowsize_b.RData")

windows <- bind_rows(mutate(win_s, method = "Summary"), mutate(win_b, method = "Binary")) %>% 
  select(gene_name, Consensus, method) %>% 
  separate(gene_name, into = c("gene_name", "window_size"), sep = "_") %>% 
  separate(window_size, into = c("window_size", "drop"), sep = "k") %>% 
  select(-drop) %>% 
  mutate(window_size = as.numeric(window_size)) %T>%
  write_tsv("tables/window_size.tsv")

# Plot TTYH1, "zoomed" in in the 0-50kbp range and in the 0-500kbp range
ttyh1_short <- windows %>% 
  filter(gene_name == "TTYH1") %>% 
  filter(window_size <= 50) %>% 
  mutate(range = "50kbp")

ttyh1_long <- windows %>% 
  filter(gene_name == "TTYH1") %>% 
  filter(window_size <= 500) %>% 
  mutate(range = "500kbp")

bind_rows(ttyh1_short, ttyh1_long) %>% 
  mutate(range = factor(range, levels = c("50kbp", "500kbp"))) %>% 
  ggplot(aes(x = window_size, y = Consensus)) +
  geom_point(aes(colour = method)) +
  geom_line(aes(colour = method), alpha = 0.8, size = 0.5) +
  scale_colour_manual(name = "Method", values = pal_methods) +
  xlab("Window Size (kbp)") + ylab("Consensus score") +
  facet_wrap(~ range, ncol = 1, scales = "free_x") +
  ggmin::theme_min() +
  theme(legend.position = c(0.9, 0.33)) +
  theme(strip.background = element_blank(), strip.text.x = element_blank())

5.4 Analysis of a dispersed mark

This package was initially designed to facilitate the task of identifying epigenetic changes for challenging datasets with abnormal patterns of enrichment of certain histone modifications, like H3K27me3. To analyze how chromswitch performs for a mark with a broad profile of enrichment, we will investigate a few regions where we expect a difference in K27me3 enrichment between human embryonic stem cells (ESC) and differentiated adult tissues.

5.4.1 Download data

We use ESC samples from the Roadmap Epigenomics Project, compared to the same adult tissues we used above.

esc_meta <- meta %>%
  filter(Condition == "Other") %>%
  mutate(Condition = "Adult") %>%
  bind_rows(read_tsv("input/esc_metadata.tsv"))

## Parsed with column specification:
## cols(
##   Sample = col_character(),
##   Tissue = col_character(),
##   Condition = col_character()
## )

esc_meta

ABCDEFGHIJ0123456789

Sample <chr>	Tissue <chr>	Condition <chr>
E065	Aorta	Adult
E066	Adult_Liver	Adult
E075	Colonic_Mucosa	Adult
E077	Duodenum_Mucosa	Adult
E079	Esophagus	Adult
E095	Left_Ventricle	Adult
E098	Pancreas	Adult
E100	Psoas_Muscle	Adult
E101	Rectal_Mucosa.Donor_29	Adult
E102	Rectal_Mucosa.Donor_31	Adult

k27m_regions <- c(GRanges("chr7:27122134-27298855"), # HOX
                  GRanges("chr16:54125741-57125741"), # IRX3 3 Mbp region
                  GRanges("chr16:54313663-54328454")) # IRX3

mcols(k27m_regions)$id <- c("HOX", "IRX3_3Mbp", "IRX3")

Download narrow peaks for H3K27me3:

urls_k27 <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/",
                      esc_meta$Sample, "-H3K27me3.narrowPeak.gz")
basenames_k27 <- lapply(urls_k27, basename) %>% unlist()
destpaths_k27 <- paste0("data/H3K27me3_peaks/", basenames_k27)

dl_k27 <- mcMap(noClobberDownload, urls_k27, destpaths_k27, mc.cores = n_cores)

extra_cols <- c("name" = "character", "score" = "integer", "strand" = "character",
                "signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric",
                "peak" = "numeric")

pk_k27me3 <- lapply(destpaths_k27, rtracklayer::import, format = "bed",
                   extraCols = extra_cols)
names(pk_k27me3) <- esc_meta$Sample

save(pk_k27me3, file = "computations/pk_k27me3.RData")

5.4.2 Run `chromswitch` using H3K27me3 peaks

k27m_s <- callSummary(k27m_regions,
            peaks = pk_k27me3,
            metadata = esc_meta,
            mark = "H3K27me3",
            filter = TRUE,
            filter_columns = c("qValue", "signalValue"),
            filter_thresholds = c(10, 10),
            normalize = TRUE,
            normalize_columns = c("qValue", "signalValue"),
            summarize_columns = c("qValue", "signalValue"),
            length = FALSE,
            fraction = TRUE,
            n = FALSE,
            optimal_clusters = TRUE,
            heatmap = TRUE,
            outdir = "figures/heatmaps",
            titles = paste0(k27m_regions$id, "_summary"))

save(k27m_s, file = "computations/h3k27me3_summary.RData")

k27m_b <- callBinary(query = k27m_regions,
           peaks = pk_k27me3,
           metadata = esc_meta,
           filter = TRUE,
           filter_columns = c("qValue","signalValue"),
           filter_thresholds = c(10, 10),
           reduce = TRUE,
           gap = 300,
           p = 0.4,
           optimal_clusters = TRUE,
           heatmap = TRUE,
           n_features = TRUE,
           outdir = "figures/heatmaps",
           titles = paste0(k27m_regions$id, "_binary"))

save(k27m_b, file = "computations/h3k27me3_binary.RData")

Presented in Figures S6-8

load("computations/h3k27me3_summary.RData")
load("computations/h3k27me3_binary.RData")

# By the summary strategy
k27m_s %>% 
  mutate(Method = "Summary") %>% 
  select(region, id, k, Average_Silhouette, Consensus, Method)

ABCDEFGHIJ0123456789

region <chr>	id <chr>	k <int>	Average_Silhouette <dbl>	Consensus <dbl>	Method <chr>
chr7:27122134-27298855	HOX	2	0.7676883	0.6610259	Summary
chr16:54125741-57125741	IRX3_3Mbp	5	0.7081706	0.5046871	Summary
chr16:54313663-54328454	IRX3	2	0.7829639	0.5168098	Summary

# By the binary strategy
k27m_b %>% 
  mutate(Method = "Binary") %>% 
  select(region, id, k, Average_Silhouette, Consensus, Method)

ABCDEFGHIJ0123456789

region <chr>	id <chr>	k <int>	Average_Silhouette <dbl>	Consensus <dbl>	Method <chr>
chr7:27122134-27298855	HOX	8	0.6307768	0.5760195	Binary
chr16:54125741-57125741	IRX3_3Mbp	2	0.5255516	0.5168098	Binary
chr16:54313663-54328454	IRX3	7	0.8333333	0.6283202	Binary

6 Genome-wide validation

To validate chromswitch, we will evaluate whether it’s capable of detecting chromatin state switches that result in brain-specific expression. We start by running chromswitch genome-wide to 5kb windows around TSS. Chromswitch makes a simple assessment of how chromatin states inferred from the data associate with the biological conditions of the samples, but we will need a more refined set of candidates to reasonably expect an association with gene expression data. Therefore, as candidates for validation, we select genes where the chromswitch finds an open chromatin state in brain tissues and closed chromatin in other tissues, evaluate whether these correspond to brain-specific expression.

6.1 Prepare GTEx data

We downloaded data from the Genotype-Tissue Expression Project database. The file data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_median_rpkm.gct was downloaded from the RNA-seq section at this link: https://gtexportal.org/home/datasets and contains the mean RPKM per tissue type per gene.

First we assign more concise names to the tissues and filter to a set which is comparable to our epigenomic dataset.

keep_tissue <- c("Amygdala", "Anterior cingulate cortex", "Basal ganglia", 
                 "Cerebella hemisphere", "Cerebellum", "Cortex",
                 "Frontal cortex",
                 "Hippocampus", "Hypothalamus", "Substantia nigra", "Aorta",
                 "Sigmoid colon", "Transverse colon", "Esophagus",
                 "Heart (atrial appendage)", "Heart (left ventricle)",
                 "Kidney", "Liver", "Lung", "Skeletal muscle",
                 "Pancreas", "Small intestine", "Stomach")

brain <-  c("Amygdala", "Anterior cingulate cortex", "Basal ganglia", 
            "Cerebella hemisphere", "Cerebellum", "Cortex", "Frontal cortex",
            "Hippocampus", "Hypothalamus", "Substantia nigra")

other <- keep_tissue[!(keep_tissue %in% brain)]

gtex_rpkm <- read_tsv(
      "data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_median_rpkm.gct",
      skip = 2) %>%
    dplyr::rename(gene_name = Description) %>%
    unite(col = "unite", sep = "=", Name, gene_name) %>%
    gather(key = "Tissue", value = "mean_RPKM", 2:length(.)) %>% 
    mutate(Tissue = recode(.x = Tissue,
        `Brain - Amygdala` = "Amygdala",
        `Brain - Anterior cingulate cortex (BA24)` = "Anterior cingulate cortex",
        `Brain - Caudate (basal ganglia)` = "Basal ganglia",
        `Brain - Cerebellar Hemisphere` = "Cerebellar hemisphere",
        `Brain - Cerebellum` = "Cerebellum",
        `Brain - Cortex` = "Cortex",
        `Brain - Frontal Cortex (BA9)` = "Frontal cortex",
        `Brain - Hippocampus` = "Hippocampus",
        `Brain - Hypothalamus` = "Hypothalamus",
        `Brain - Nucleus accumbens (basal ganglia)` = "Nucleus accumbens",
        `Brain - Putamen (basal ganglia)` = "Putamen",
        `Brain - Substantia nigra` = "Substantia nigra",
        `Adipose - Subcutaneous` = "Subcutaneous adipose",
        `Adipose - Visceral (Omentum)` = "Omentum",
        `Artery - Aorta` = "Aorta",
        `Artery - Coronary` = "Coronary artery",
        `Artery - Tibial` = "Tibial artery",
        `Bladder` = "Bladder",
        `Breast - Mammary Tissue` = "Breast",
        `Cervix - Ectocervix` = "Ectocervix",
        `Cervix - Endocervix` = "Endocervix", 
        `Colon - Sigmoid` = "Sigmoid colon",
        `Colon - Transverse` = "Transverse colon",
        `Esophagus - Gastroesophageal Junction` = "Esophagus (GEJ)",
        `Esophagus - Mucosa` = "Esophageal mucosa",
        `Esophagus - Muscularis` = "Esophagus (Muscularis)",
        `Fallopian Tube` = "Fallopian tube",
        `Heart - Atrial Appendage` = "Heart (atrial appendage)",
        `Heart - Left Ventricle` = "Heart (left ventricle)",
        `Kidney - Cortex` = "Kidney",
        `Liver` = "Liver",
        `Lung` = "Lung",
        `Muscle - Skeletal` = "Skeletal muscle",
        `Pancreas` = "Pancreas",
        `Prostate` = "Prostate",
        `Skin - Nut Sun Exposed (Suprapubic)` = "Skin",
        `Small Intestine - Terminal Ileum` = "Small intestine",
        `Spleen` = "Spleen",
        `Stomach` = "Stomach",
        `Thyroid` = "Thyroid",
        `Uterus` = "Uterus",
        `Vagina` = "Vagina", .default = "drop")) %>% 
    filter(Tissue %in% keep_tissue) %>% 
    spread(Tissue, mean_RPKM) %>% 
    separate(unite, into = c("Name", "gene_name"), sep = "=")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Name = col_character(),
##   Description = col_character()
## )

## See spec(...) for full column specifications.

Calculate the log2 fold change of expression between brain tissues and other tissues and the dataset.

gtex_fc <- gtex_rpkm

# For each gene, calculate the median RPKM across the tissues in each group
gtex_fc$median_brain_RPKM <- apply(select_(gtex_rpkm, brain), 1, median)
gtex_fc$median_other_RPKM <- apply(select_(gtex_rpkm, other), 1, median)

# Calculate log2 FC with a pseudocount
gtex_fc <- gtex_fc %>%
    mutate(FC     = (median_brain_RPKM + 1) / (median_other_RPKM + 1),
           log_FC = log(FC, base = 2))

6.2 Run `chromswitch` genome wide

6.2.1 Summary strategy

We evaluate the summary strategy on each of the three input types.

H3K4me3:

val_s_k_raw <- refseq_5kbp %>% 
  mclapply(safeCallSummary,
           peaks = pk_k4me3,
           metadata = meta,
           mark = "H3K4me3",
           filter = TRUE,
           filter_columns = c("qValue", "signalValue"),
           filter_thresholds = c(10, 4),
           normalize = TRUE,
           normalize_columns = c("qValue", "signalValue"),
           summarize_columns = c("qValue", "signalValue"),
           length = FALSE,
           fraction = TRUE,
           n = FALSE,
           heatmap = FALSE,
           estimate_state = TRUE,
           signal_col = "signalValue",
           test_condition = "Brain", 
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_s_k <- val_s_k_raw$result[map_lgl(val_s_k_raw$error, is_null)] %>% 
  bind_rows()

save(val_s_k, val_s_k_raw, file = "computations/validation_s_k.RData")

ChromHMM:

val_s_c_raw <- refseq_5kbp %>%
    mclapply(safeCallSummary,
             peaks = chmm_states,
             metadata = meta,
             mark = "State",
             filter = FALSE,
             normalize = FALSE,
             summarize_columns = NULL,
             length = FALSE,
             fraction = TRUE,
             n = FALSE,
             heatmap = FALSE,
             optimal_clusters = TRUE,
             estimate_state = TRUE,
             signal_col = "fraction",
             test_condition = "Brain",
             BPPARAM = SerialParam(),
             mc.cores = n_cores) %>% 
    purrr::transpose()

val_s_c <- val_s_c_raw$result[map_lgl(val_s_c_raw$error, is_null)] %>% 
    bind_rows()

save(val_s_c_raw, val_s_c, file = "computations/validation_s_c.RData")

DNase I:

val_s_d_raw <- refseq_5kbp %>%
    mclapply(safeCallSummary,
             peaks = pk_dnase,
             metadata = meta,
             mark = "DNase",
             filter = FALSE,
             normalize = FALSE,
             summarize_columns = NULL,
             length = FALSE,
             fraction = TRUE,
             n = FALSE,
             heatmap = FALSE,
             estimate_state = TRUE,
             signal_col = "fraction",
             test_condition = "Brain",
             optimal_clusters = TRUE,
             BPPARAM = SerialParam(), # Parallelize outside
             mc.cores = n_cores) %>% 
    purrr::transpose() 

val_s_d <- val_s_d_raw$result[map_lgl(val_s_d_raw$error, is_null)] %>%
    bind_rows()

save(val_s_d_raw, val_s_d, file = "computations/validation_s_d.RData")

6.2.2 Binary strategy

H3K4me3:

val_b_h_raw <- refseq_5kbp %>% 
  mclapply(safeCallBinary,
           peaks = pk_k4me3,
           metadata = meta,
           filter = TRUE,
           filter_columns = c("qValue", "signalValue"),
           filter_thresholds = c(10, 4),
           reduce = TRUE,
           gap = 300,
           p = 0.4,
           heatmap = FALSE,
           n_features = TRUE,
           estimate_state = TRUE,
           test_condition = "Brain",
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_b_h <- val_b_h_raw$result[map_lgl(val_b_h_raw$error, is_null)] %>% 
  bind_rows()

save(val_b_h, val_b_h_raw, file = "computations/validation_b_h.RData")

DNase I:

val_b_d_raw <- refseq_5kbp[1] %>% 
  mclapply(safeCallBinary,
           peaks = pk_dnase,
           metadata = meta,
           filter = FALSE,
           reduce = TRUE,
           gap = 300,
           p = 0.4,
           heatmap = FALSE,
           n_features = TRUE,
           estimate_state = TRUE,
           test_condition = "Brain",
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_b_d <- val_b_d_raw$result[map_lgl(val_b_d_raw$error, is_null)] %>% 
  bind_rows()

save(val_b_d, val_b_d_raw, file = "computations/validation_b_d.RData")

ChromHMM:

val_b_c_raw <- refseq_5kbp %>% 
  mclapply(safeCallBinary,
           peaks = chmm_states,
           metadata = meta,
           filter = FALSE,
           reduce = FALSE,
           gap = 300,
           p = 0.4,
           heatmap = FALSE,
           n_features = TRUE,
           estimate_state = TRUE,
           test_condition = "Brain",
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_b_c <- val_b_c_raw$result[map_lgl(val_b_c_raw$error, is_null)] %>% 
  bind_rows()

save(val_b_c, val_b_c_raw, file = "computations/validation_b_c.RData")

load("computations/validation_s_k.RData")
load("computations/validation_s_d.RData")
load("computations/validation_s_c.RData")
load("computations/validation_b_h.RData")
val_b_k <- val_b_h # Rename this object for consistency
load("computations/validation_b_d.RData")
load("computations/validation_b_c.RData")

6.3 Evaluate results

First, we will plot the mean RPKM across candidate genes and assess how the distributions differ betwen brain and other tissues.

Define a function to get the candidate switches for each threshold:

joinGtex <- function(val_df) {
  
  val_df %>% 
    dplyr::select(gene_name, Consensus, state, k) %>% 
    inner_join(gtex_fc, by = "gene_name")
}

thresholdScores <- function(threshold, scores) {
    
    scores %>%
        dplyr::filter(Consensus >= threshold) %>%
        mutate(Threshold = threshold)
}

getCandidates <- function(df, method, input) {
  
  thresholds <- c(0.25, 0.5, 0.75)
  
  df <- df %>% filter(state == "ON")
  df_thresholded <- map_df(thresholds, thresholdScores, df) %>% 
    mutate(Threshold = paste0("Consensus >= ", Threshold),
           Method = method,
           Input = input)
}

val_out <- list(val_s_k, val_s_d, val_s_c,
                val_b_k, val_b_d, val_b_c)

eval_methods <- c(rep("Summary", 3), rep("Binary", 3))
eval_input <- rep(c("H3K4me3", "DNase I", "ChromHMM"), 2)

val_gtex <- val_out %>%
  mclapply(joinGtex, mc.cores = 2)

val_candidates <- val_gtex %>% 
  {mcMap(getCandidates, df = ., method = eval_methods, input = eval_input,
         mc.cores = 2)} %>% 
  bind_rows()

val_candidates_long <- val_candidates %>% 
  gather(Tissue, mean_RPKM, 6:26) %>% 
  filter(Threshold == "Consensus >= 0.75")

# Count candidates
counts <- ddply(.data = val_candidates,
                .(Method, Input, Threshold),
                summarize,
                # Define a string with the number of candidates which we can plot
                n = paste0(unique(Input), "\n(n=", length(gene_name), ")")) %>% 
    filter(Threshold == "Consensus >= 0.75")

val_candidates_long_clean <- val_candidates_long %>% 
    mutate(Label = ifelse(Tissue %in% brain, "Brain", "Other")) %>% 
    mutate(Tissue = factor(Tissue, levels = keep_tissue)) %>% 
    makeLabelSimple() %T>%
    write_tsv("tables/validation_boxplot.tsv")

Figure 1c

val_candidates_long_clean %>%
    filter(Tissue %in% keep_tissue) %>%
    filter(Input == "H3K4me3") %>%
    mutate(Method = ifelse(Method == "Summary", "Summary (n=220)", 
                           "Binary (n=125)")) %>% 
    ggplot(aes(x = Tissue, y = mean_RPKM)) +
    geom_boxplot(width = 0.65, outlier.shape = NA, aes(fill = Method),
                 position = position_dodge(width = 0.8)) +
    scale_fill_manual(values = c("Summary (n=220)" = "red",
                                 "Binary (n=125)" = "#2f60af")) +
    facet_wrap(~ Input, ncol = 1) +
    ggmin::theme_min() +
    coord_cartesian(ylim = c(0, 55)) +
    ylab("Mean RPKM") +
    theme(text = element_text(size = 10),
          axis.text.x = element_text(angle = 90, hjust = 1),
      axis.ticks.x = element_blank(),
          legend.position = c(0.9, 0.7))

Figure S10

val_candidates_long_clean %>% 
    ggplot(aes(x = Tissue, y = mean_RPKM)) +
    geom_boxplot(outlier.shape = NA, aes(fill = Label), width = 0.65) +
    scale_fill_manual(values = pal_strategy) +
    ggmin::theme_min() +
    facet_grid(Input ~ Method) +
    coord_cartesian(ylim = c(0, 55)) +
    ylab("Mean RPKM") +
    theme(text = element_text(size = 10),
          axis.text.x = element_text(angle = 90, hjust = 1, size = 7),
          strip.text.y = element_blank(),
          legend.position = "none") +
    geom_text(data = counts, aes(x = 19, y = 52, label = n), size = 3,
                  colour = "black", inherit.aes = FALSE, parse = FALSE)

Second, to quantify validation, we will assess how the difference in expression between brain and other tissues associates with the consensus score output by chromswitch. To do so, we calculate the log2 fold change of the mean RPKM across tissues in each group, across candidate switches at each threshold on the consensus score.

thresholds <- seq(0, 1, by = 0.1)

getFC <- function(threshold, scores) {

    scores %>%
        dplyr::filter(Consensus >= threshold) %>%
        `[[`("log_FC") %>%
        mean()
}

val_fc <- data.frame(thresholds)

val_fc$Summary_H3K4me3  <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[1]])

val_fc$Summary_DNase    <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[2]])

val_fc$Summary_ChromHMM <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[3]])

val_fc$Binary_H3K4me3   <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[4]])

val_fc$Binary_DNase     <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[5]])

val_fc$Binary_ChromHMM  <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[6]])

save(val_fc, file = "computations/validation_fc.RData")

val_fc_tidy <- val_fc %>% 
  gather(strategy, median_logFC, 2:length(.)) %>% 
  separate(strategy, into = c("Method", "Input"), sep = "_") %>% 
  mutate(Input = recode(.x = Input, DNase = "DNase I")) %>% 
  makeLabelSimple() %T>%
  write_tsv("tables/validation_foldchange.tsv")

Figure 1d

val_fc_tidy %>% 
  filter(Input != "DNase I") %>% 
  ggplot(aes(x = thresholds, y = median_logFC)) +
  stat_smooth(aes(colour = Label, linetype = Label), se = FALSE) +
  scale_colour_manual(name = "Strategy", values = pal_strategy) +
  scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
  ggmin::theme_min() +
  ylab("Mean log2 fold change of expression") + xlab("Threshold") +
  theme(legend.position = "bottom") +
  labs(colour = "") + labs(linetype = "") +
  guides(colour = guide_legend(ncol = 1))

## `geom_smooth()` using method = 'loess'

Figure S11

val_fc_tidy %>% 
  ggplot(aes(x = thresholds, y = median_logFC)) +
  stat_smooth(aes(colour = Label, linetype = Label), se = FALSE) +
  scale_colour_manual(name = "Strategy", values = pal_strategy) +
  scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
  ggmin::theme_min() +
  # ggtitle("Median Log FC of expression across candidate genes") +
  ylab("Mean log2 fold change of expression") + xlab("Threshold") +
  theme(legend.position = "bottom") +
  labs(colour = "") + labs(linetype = "") +
  guides(colour = guide_legend(ncol = 1))

## `geom_smooth()` using method = 'loess'

7 Session Info

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /mnt/RICHARDS_JBOD1/KLEINMAN_SCRATCH/modules/software/R/3.4.1/lib64/R/lib/libRblas.so
## LAPACK: /mnt/RICHARDS_JBOD1/KLEINMAN_SCRATCH/modules/software/R/3.4.1/lib64/R/lib/libRlapack.so
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2                           
##  [2] chromswitch_0.99.0                     
##  [3] stringi_1.1.6                          
##  [4] forcats_0.2.0                          
##  [5] stringr_1.2.0                          
##  [6] dplyr_0.7.4                            
##  [7] readr_1.1.1                            
##  [8] tidyr_0.7.2                            
##  [9] tibble_1.3.4                           
## [10] tidyverse_1.2.1                        
## [11] purrr_0.2.4                            
## [12] plyr_1.8.4                             
## [13] magrittr_1.5                           
## [14] BiocParallel_1.10.1                    
## [15] xtable_1.8-2                           
## [16] pROC_1.10.0                            
## [17] biomaRt_2.32.1                         
## [18] Homo.sapiens_1.3.1                     
## [19] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [20] org.Hs.eg.db_3.4.1                     
## [21] GO.db_3.4.1                            
## [22] OrganismDbi_1.18.1                     
## [23] GenomicFeatures_1.28.5                 
## [24] AnnotationDbi_1.38.2                   
## [25] Biobase_2.36.2                         
## [26] rtracklayer_1.36.6                     
## [27] GenomicRanges_1.28.6                   
## [28] GenomeInfoDb_1.12.3                    
## [29] IRanges_2.10.3                         
## [30] S4Vectors_0.14.4                       
## [31] BiocGenerics_0.22.1                    
## [32] ggmin_0.0.0.9000                       
## [33] ggplot2_2.2.1                          
## [34] RColorBrewer_1.1-2                     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-131               bitops_1.0-6              
##  [3] matrixStats_0.52.2         lubridate_1.7.1           
##  [5] devtools_1.13.4            bit64_0.9-7               
##  [7] httr_1.3.1                 rprojroot_1.2             
##  [9] tools_3.4.1                backports_1.1.1           
## [11] R6_2.2.2                   DBI_0.7                   
## [13] lazyeval_0.2.1             colorspace_1.3-2          
## [15] withr_2.1.0                mnormt_1.5-5              
## [17] bit_1.1-12                 curl_3.0                  
## [19] compiler_3.4.1             git2r_0.19.0              
## [21] cli_1.0.0                  rvest_0.3.2               
## [23] graph_1.54.0               xml2_1.1.1                
## [25] DelayedArray_0.2.7         labeling_0.3              
## [27] scales_0.5.0               psych_1.7.8               
## [29] RBGL_1.52.0                digest_0.6.12             
## [31] Rsamtools_1.28.0           foreign_0.8-69            
## [33] rmarkdown_1.8              XVector_0.16.0            
## [35] pkgconfig_2.0.1            htmltools_0.3.6           
## [37] readxl_1.0.0               rlang_0.1.4               
## [39] rstudioapi_0.7             RSQLite_2.0               
## [41] BiocInstaller_1.26.1       bindr_0.1                 
## [43] jsonlite_1.5               RCurl_1.95-4.8            
## [45] GenomeInfoDbData_0.99.0    Matrix_1.2-12             
## [47] Rcpp_0.12.13               munsell_0.4.3             
## [49] yaml_2.1.15                SummarizedExperiment_1.6.5
## [51] zlibbioc_1.22.0            grid_3.4.1                
## [53] blob_1.1.0                 crayon_1.3.4              
## [55] lattice_0.20-35            Biostrings_2.44.2         
## [57] haven_1.1.0                hms_0.4.0                 
## [59] knitr_1.17                 codetools_0.2-15          
## [61] reshape2_1.4.2             XML_3.98-1.9              
## [63] glue_1.2.0                 evaluate_0.10.1           
## [65] modelr_0.1.1               cellranger_1.1.0          
## [67] gtable_0.2.0               assertthat_0.2.0          
## [69] broom_0.4.3                GenomicAlignments_1.12.2  
## [71] memoise_1.1.0              cluster_2.0.6

chromswitch: A flexible method to detect chromatin state switches - Analysis

Selin Jessa and Claudia L. Kleinman

27 December, 2017

1 Introduction

2 Preliminaries

3 Download epigenomic data

4 Benchmark `chromswitch` performance

4.1 Prepare benchmarking dataset

4.2 Prepare genome-wide windows for validation

4.3 Run `chromswitch` on benchmark dataset

4.4 Random predictor

4.5 Shuffled labels

4.6 ROC curves

4.7 AUROC

5 Robustness experiments

5.1 Tuning parameters

5.2 Sample size and class imbalance

5.3 Window size

5.4 Analysis of a dispersed mark

5.4.1 Download data

5.4.2 Run `chromswitch` using H3K27me3 peaks

6 Genome-wide validation

6.1 Prepare GTEx data

6.2 Run `chromswitch` genome wide

6.2.1 Summary strategy

6.2.2 Binary strategy

6.3 Evaluate results

7 Session Info

chromswitch: A flexible method to detect chromatin state switches - Analysis

Selin Jessa and Claudia L. Kleinman

27 December, 2017

1 Introduction

2 Preliminaries

3 Download epigenomic data

4 Benchmark chromswitch performance

4.1 Prepare benchmarking dataset

4.2 Prepare genome-wide windows for validation

4.3 Run chromswitch on benchmark dataset

4.4 Random predictor

4.5 Shuffled labels

4.6 ROC curves

4.7 AUROC

5 Robustness experiments

5.1 Tuning parameters

5.2 Sample size and class imbalance

5.3 Window size

5.4 Analysis of a dispersed mark

5.4.1 Download data

5.4.2 Run chromswitch using H3K27me3 peaks

6 Genome-wide validation

6.1 Prepare GTEx data

6.2 Run chromswitch genome wide

6.2.1 Summary strategy

6.2.2 Binary strategy

6.3 Evaluate results

7 Session Info

4 Benchmark `chromswitch` performance

4.3 Run `chromswitch` on benchmark dataset

5.4.2 Run `chromswitch` using H3K27me3 peaks

6.2 Run `chromswitch` genome wide