Processing math: 100%
  • 1 Introduction
  • 2 Preliminaries
  • 3 Download epigenomic data
  • 4 Benchmark chromswitch performance
    • 4.1 Prepare benchmarking dataset
    • 4.2 Prepare genome-wide windows for validation
    • 4.3 Run chromswitch on benchmark dataset
    • 4.4 Random predictor
    • 4.5 Shuffled labels
    • 4.6 ROC curves
    • 4.7 AUROC
  • 5 Robustness experiments
    • 5.1 Tuning parameters
    • 5.2 Sample size and class imbalance
    • 5.3 Window size
    • 5.4 Analysis of a dispersed mark
      • 5.4.1 Download data
      • 5.4.2 Run chromswitch using H3K27me3 peaks
  • 6 Genome-wide validation
    • 6.1 Prepare GTEx data
    • 6.2 Run chromswitch genome wide
      • 6.2.1 Summary strategy
      • 6.2.2 Binary strategy
    • 6.3 Evaluate results
  • 7 Session Info

1 Introduction

This document executes all the analysis presented in chromswitch: A flexible method for detecting chromatin state switches, from downloading the data, to running experiments, to generating the tables and figures included in the paper. The dropdown in the top right corner of this HTML doument controls whether code is shown or hidden. The document can be navigated using the menu on the left.

About the project:

In this paper, we present an R/Bioconductor package, chromswitch, which implements a flexible method to detect chromatin state switches between samples in two biological conditions in a specific genomic region of interest given peaks or chromatin state calls from ChIP-seq data. chromswitch is available from Bioconductor 3.6 at https://bioconductor.org/packages/chromswitch. We analyze its performance on a benchmark dataset, study its robustness to changes in tuning parameters, sample size and imbalance, window size, and varying mark profiles; and perform a genome-wide validation.

How this repository is set up:

  • This HTML file is generated by rendering analysis.Rmd, and executes all the analysis
  • input contains a few individual data and metadata files needed for the analysis
  • data stores data which is downloaded during the analysis
  • figures contains all figures produced in the analysis; the subfolder heatmaps stores the figures output by the tool
  • functions stores a few long functions used (source()’d by) the analysis
  • computations stores the results of computationally-intensive steps of the analysis, which are saved in the main document after being executed once, and then loaded for further analysis
  • tables stores the data behind each figure as TSV files, and tables presented in the supplementary material

How to re-run the analysis:

To run the analysis, this R Markdown file analysis.Rmd needs to be knit. This involves executing every chunk, or piece of code, sequentially. To feasibly handle chunks with computationally-intensive code, we typically evaluate the chunk once, and then set eval = FALSE for that chunk. The results are saved as an .RData file in the computations folder, and then loaded in a following chunk so that the results can be processed and visualized. To re-run the analysis, including these computationally-intensive steps, that option needs to be changed back to eval = TRUE for every chunk.

Since the analysis would take multiple days to run, we recommend running the analysis from the command line, using the run.sh bash script. If HPC resources are available, much of the work can be parallelized by modifying that script and setting the n_cores variable in the next section accordingly.

Dependencies:

We used R 3.3.1 and Bioconductor 3.5. The R packages required are loaded below, see the exact versions in Session Info. All the data needed for the analysis is provided in the repository or downloaded in this document.

2 Preliminaries

Load general packages:

# Visualization
library(RColorBrewer)
library(ggplot2)

# Install and download custom ggplot2 theme
devtools::install_github("sjessa/ggmin")
library(ggmin) 

# Genomics
library(GenomicRanges)
library(rtracklayer)
library(Homo.sapiens)
library(biomaRt)

# General-purpose
library(pROC)
library(xtable)
library(BiocParallel)
library(parallel)
library(magrittr)
library(plyr)
library(purrr)
library(tidyverse)
library(dplyr)
library(stringi)

# Source functions
source("functions/plotting_functions.R")
source("functions/subsampling_functions.R")
source("functions/misc_functions.R")

Install chromswitch from the GitHub repository:

devtools::install_github("sjessa/chromswitch@v0.99.0")
library(chromswitch)

Note on chromswitch version: chromswitch is now available from Bioconductor 3.6, but for reproducibility, the above command will download v0.99.0 of chromswitch, with which the bulk of the analysis was done: https://github.com/sjessa/chromswitch/releases/tag/v0.99.0.

Set-up for parallel computations:

# For other parallel computations
n_cores <- 1

chromswitch is implemented in a relatively modular way, with a function for each step of the algorithm. It also provides two wrapper functions, one for each feature construction strategy, which runs the whole analysis. Although the wrapper functions in chromswitch can analyze multiple query regions in parallel, we will parallelize outside the function in order to catch any errors that might arise.

safeCallSummary <- purrr::safely(chromswitch::callSummary)
safeCallBinary  <- purrr::safely(chromswitch::callBinary)

3 Download epigenomic data

The data we analyze comes from the Roadmap Epigenomics Program. We evaluate three types of input:

  1. H3K4me3 peaks
  2. DNase I peaks/DNase I hypersensitive sites
  3. ChromHMM segmentation learned on the core histone marks in the Roadmap analysis, filtered to regions assigned the state “Active TSS”.

We will work with 23 adult tissue samples from Roadmap. 7 are brain tissues, and 16 are various other tissues.

# Read in the samples
meta <- read_tsv("input/sample_metadata.tsv")
meta
ABCDEFGHIJ0123456789
Sample
<chr>
Tissue
<chr>
Condition
<chr>
E065AortaOther
E066Adult_LiverOther
E067Brain_Angular_GyrusBrain
E068Brain_Anterior_CaudateBrain
E069Brain_Cingulate_GyrusBrain
E071Brain_Hippocampus_MiddleBrain
E072Brain_Inferior_Temporal_LobeBrain
E073Brain_Mid_Frontal_LobeBrain
E074Brain_Substantia_NigraBrain
E075Colonic_MucosaOther
table(meta$Condition)
## 
## Brain Other 
##     7    16
# Download the file if it doesn't exist
noClobberDownload <- function(url, destpath) {
    if(!(file.exists(destpath))) download.file(url, destpath)
}

We retrieve the data from this link: http://egg2.wustl.edu/roadmap/data/byFileType/.

urls_k4 <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/",
                      meta$Sample, "-H3K4me3.narrowPeak.gz")
basenames_k4 <- lapply(urls_k4, basename) %>% unlist()
destpaths_k4 <- paste0("data/H3K4me3_peaks/", basenames_k4)

dl_k4 <- mcMap(noClobberDownload, urls_k4, destpaths_k4, mc.cores = n_cores)

extra_cols <- c("name" = "character", "score" = "integer", "strand" = "character",
                "signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric",
                "peak" = "numeric")

pk_k4me3 <- lapply(destpaths_k4, rtracklayer::import, format = "bed",
                   extraCols = extra_cols)
names(pk_k4me3) <- meta$Sample

save(pk_k4me3, file = "computations/pk_k4me3.RData")
urls_dnase <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidatedImputed/narrowPeak/",
               meta$Sample,
               "-DNase.imputed.narrowPeak.bed.nPk.gz")
basenames_dnase <- lapply(urls_dnase, basename) %>% unlist()
destpaths_dnase <- paste0("data/DNase_peaks/", basenames_dnase)

dl_dnase <- mcMap(noClobberDownload, urls_dnase, destpaths_dnase,
                  mc.cores = n_cores)

pk_dnase <- lapply(destpaths_dnase, rtracklayer::import, format = "bed")
names(pk_dnase) <- meta$Sample

save(pk_dnase, file = "computations/pk_dnase.RData")

To convert the chromatin state segmentations into a format that can be used by chromswitch, we will filter to regions assigned “Active_TSS”, the state associated with active transcription.

urls_st <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/",
                      meta$Sample, "_15_coreMarks_mnemonics.bed.gz")
basenames_st <- lapply(urls_st, basename) %>% unlist()
destpaths_st <- paste0("data/ChromHMM_segmentations/", basenames_st)

dl_st <- mcMap(noClobberDownload, urls_st, destpaths_st, mc.cores = 6)

extra_cols <- c("State" = "character")

chmm_states <- lapply(destpaths_st, rtracklayer::import, format = "bed",
                   extraCols = extra_cols) %>% 
    lapply(as.data.frame) %>% 
    # We will treat regions assigned the state associated with active
    # transcription as the epigenomic features in our analysis
    lapply(filter, State == "1_TssA") %>% 
    lapply(makeGRangesFromDataFrame)

names(chmm_states) <- meta$Sample

save(chmm_states, file = "computations/chmm_states.RData")
load("computations/pk_k4me3.RData")
load("computations/pk_dnase.RData")
load("computations/chmm_states.RData")

4 Benchmark chromswitch performance

4.1 Prepare benchmarking dataset

We manually assembled a list of 120 genes on which to benchmark chromswitch performance. 60 of these genes exhibit chromatin state switches between brain and other tissues based on change in H3K4me3 signal (whose enrichment at promoters is associated with active transcription), DNase I hypersensitivity (associated with open chromatin), and expression (the functional consequence of chromatin changes). The other 60 are negative examples, where there was no change across the tissues in H3K4me3, DNase I, or expression data. These genes are specified in the file input/benchmark.tsv.

To define windows for these genes to use as input for our experiment, we retrieve a list of genes and coordinates in the human genome from UCSC. This file is saved in the data directory, and we obtained it using the UCSC Table Browser, with the following selections:

  • group: Genes and Gene Predictions
  • track: RefSeq
  • Fields: name, chrom, strand, txStart, txEnd, and name2
all_genes <- read_tsv("data/refseq_genes.hg19.tsv",
                     col_names = c("transcript_id", "chromosome", "strand",
                                   "txStart", "txEnd", "gene_name"), skip = 1)

head(all_genes)
ABCDEFGHIJ0123456789
transcript_id
<chr>
chromosome
<chr>
strand
<chr>
txStart
<int>
txEnd
<int>
gene_name
<chr>
NM_032291chr1+6699963867216822SGIP1
NM_001308203chr1+6699925167216822SGIP1
NM_001350218chr1+6699963867216822SGIP1
NM_001350217chr1+6699963867216822SGIP1
NM_013943chr1+2507175925170815CLIC4
NM_018090chr1+1676716616786584NECAP2

Define 5kbp windows around the TSS:

benchmark <- read_tsv("input/benchmark.tsv") %>% 
    left_join(all_genes, by = c("transcript_id", "gene_name")) %>% 
    # Define window around the TSS
  mutate(start = case_when(
    strand == "+" ~ as.numeric(txStart) - 2500,
    strand == "-" ~ as.numeric(txEnd) - 2500)) %>%
  mutate(end = case_when(
    strand == "+" ~ as.numeric(txStart) + 2500,
    strand == "-" ~ as.numeric(txEnd) + 2500)) %T>% 
    write_tsv("input/benchmark_coords.5kbp.tsv") %>% 
    makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>% 
    keepStandardChromosomes()

Table S1

benchmark %>%
  as.data.frame %>%
  dplyr::select(gene_name, transcript_id, chr = seqnames, start, end, switch) %>% 
  arrange(desc(switch), gene_name) %T>% 
  write_tsv("tables/tableS2.tsv") %>% 
  xtable()
ABCDEFGHIJ0123456789
gene_name
<chr>
transcript_id
<chr>
chr
<fctr>
start
<int>
end
<int>
switch
<int>
ADCYAP1R1NM_001199636chr731089575310945751
ADGRA1NM_001083909chr101348990071349040071
AJAP1NM_018836chr1471260447176041
AK5NM_012093chr177745786777507861
ATCAYNM_033064chr19387811738831171
C2orf80NM_001099334chr22090522732090572731
CACNG3NM_006539chr1624264373242693731
CDH10NM_006727chr524642587246475871
CDH18NM_004934chr519985853199908531
CHRM1NM_000738chr1162686512626915121

4.2 Prepare genome-wide windows for validation

In order to later validate chromswitch genome-wide, we also clean up the gene list to define 5kbp windows genome-wide.

refseq_unique <- all_genes %>% 
      # Keep the longest transcript per gene
      mutate(length = txEnd - txStart) %>%
      group_by(gene_name) %>%
      dplyr::slice(which.max(length)) %>% 
      ungroup() %>% 
      # Filter out genes where the length of the gene is less than half the
      # window size
      filter(length >= 2500)
getRefSeq <- function(window) {
    
  d <- window / 2
    
  refseq_unique %>%
      # Define window around the TSS
      mutate(start = case_when(
          strand == "+" ~ as.numeric(txStart) - d,
          strand == "-" ~ as.numeric(txEnd) - d)) %>%
      mutate(end = case_when(
          strand == "+" ~ as.numeric(txStart) + d,
          strand == "-" ~ as.numeric(txEnd) + d)) %T>% 
      write_tsv(paste0("input/benchmark_coords.", window, ".tsv")) %>% 
      makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>% 
      keepStandardChromosomes(pruning.mode = "coarse")
}

refseq_5kbp <- getRefSeq(5000)

4.3 Run chromswitch on benchmark dataset

Here we evaluate each feature matrix construction strategy on each type of input.

Summary strategy applied to H3K4me3

bench_s_k_raw <- mclapply(benchmark, safeCallSummary,
                       peaks = pk_k4me3,
                       metadata = meta,
                       mark = "H3K4me3",
                       filter = TRUE,
                       filter_columns = c("qValue", "signalValue"),
                       filter_thresholds = c(10, 4),
                       normalize = TRUE,
                       normalize_columns = c("qValue", "signalValue"),
                       summarize_columns = c("qValue", "signalValue"),
                       length = FALSE,
                       fraction = TRUE,
                       n = FALSE,
                       optimal_clusters = TRUE,
                       heatmap = FALSE,
                       BPPARAM = SerialParam(), mc.cores = n_cores) %>% 
  purrr::transpose()

bench_s_k <- bench_s_k_raw$result[map_lgl(bench_s_k_raw$error, is_null)] %>%
    bind_rows()

save(bench_s_k, file = "computations/bench_s_k.RData")

Summary strategy applied to DNase I

bench_s_d_raw <- mclapply(benchmark, safeCallSummary,
                               peaks = pk_dnase,
                               metadata = meta,
                               mark = "DNase",
                               filter = FALSE,
                               normalize = FALSE,
                               summarize_columns = NULL,
                               length = FALSE,
                               fraction = TRUE,
                               n = FALSE,
                               optimal_clusters = TRUE,
                               heatmap = FALSE,
                               BPPARAM = SerialParam(),
                               mc.cores = n_cores) %>% 
    purrr::transpose()

bench_s_d <- bench_s_d_raw$result[map_lgl(bench_s_d_raw$error, is_null)] %>%
    bind_rows()

save(bench_s_d, file = "computations/bench_s_d.RData")

Binary strategy applied to H3K4me3

bench_b_k_raw <- mclapply(benchmark,
                          safeCallBinary,
                          peaks = pk_k4me3,
                          metadata = meta,
                          filter = TRUE,
                          filter_columns = c("qValue","signalValue"),
                          filter_thresholds = c(10, 4),
                          reduce = TRUE,
                          gap = 300,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          n_features = TRUE, mc.cores = n_cores) %>% 
  purrr::transpose()

bench_b_k <- bench_b_k_raw$result[map_lgl(bench_b_k_raw$error,
                                                is_null)] %>%
  bind_rows()

save(bench_b_k, file = "computations/bench_b_k.RData")

Binary strategy applied to DNase I

bench_b_d_raw <- mclapply(benchmark,
                          safeCallBinary,
                          peaks = pk_dnase,
                          metadata = meta,
                          filter = FALSE,
                          reduce = TRUE,
                          gap = 300,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          n_features = TRUE, mc.cores = n_cores) %>% 
  purrr::transpose()

bench_b_d <- bench_b_d_raw$result[map_lgl(bench_b_d_raw$error, is_null)] %>%
  bind_rows()

save(bench_b_d, file = "computations/bench_b_d.RData")

Summary strategy applied to ChromHMM

bench_s_c_raw <- mclapply(benchmark,
                          safeCallSummary,
                          peaks = chmm_states,
                          metadata = meta,
                          mark = "State",
                          filter = FALSE,
                          normalize = FALSE,
                          summarize_columns = NULL,
                          length = FALSE,
                          fraction = TRUE,
                          n = FALSE,
                          heatmap = FALSE,
                          optimal_clusters = TRUE,
                          BPPARAM = SerialParam(),
                          mc.cores = n_cores) %>% 
  purrr::transpose()

bench_s_c <- bench_s_c_raw$result[map_lgl(bench_s_c_raw$error, is_null)] %>%
    bind_rows()

save(bench_s_c, file = "computations/bench_s_c.RData")

Binary strategy applied to ChromHMM

bench_b_c_raw <- mclapply(benchmark,
                          safeCallBinary,
                          peaks = chmm_states,
                          metadata = meta,
                          filter = FALSE,
                          reduce = FALSE,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          n_features = TRUE,
                          mc.cores = n_cores) %>% 
  purrr::transpose()

bench_b_c <- bench_b_c_raw$result[map_lgl(bench_b_c_raw$error,
                                          is_null)] %>%
    bind_rows()

save(bench_b_c, file = "computations/bench_b_c.RData")
load("computations/bench_s_k.RData")
load("computations/bench_s_d.RData")
load("computations/bench_s_c.RData")
load("computations/bench_b_k.RData")
load("computations/bench_b_d.RData")
load("computations/bench_b_c.RData")

4.4 Random predictor

We compare with a random predictor by drawing from from a Uniform(0,1) distribution, repeating this 10,000 times, and then averaging the results to assign a score. This is done for each region. It should track with the diagonal and have an AUROC of around 0.5

#' @param k, the number of query regions to make a prediction for
#' @param N, the number of times to draw
predictRandom <- function(k, N) {
    
    rep(k, N) %>%
        mclapply(runif, mc.cores = 2) %>% 
        data.frame() %>% 
        rowMeans()
}

# Repeat twice, to get one set of random predictions per method
draws <- replicate(2, predictRandom(120, 10000), simplify = FALSE)
save(draws, file = "computations/draws.RData")
load("computations/draws.RData")

bench_s_k <- bench_s_k %>%
    add_column(Random = draws[[1]], .after = "Consensus")

bench_b_d <- bench_b_d %>%
    add_column(Random = draws[[2]], .after = "Consensus")

4.5 Shuffled labels

Our method predicts chromatin state switches by computing a score of the similarity between the known biological condition labels and inferred clusters. We can assess how these scores look when we shuffle the class labels. We cluster samples based on the feature matrix as usual, but before calculating any of the external validity indices, we randomly shuffle the labels.

predictShuffle <- function(clusters, N) {
  
  shuffle <- function(labels, clusters) {
    
    shuf <- sample(labels)
    contingency <- table(shuf, clusters)
    
    data.frame(
      ARI_shuffle          = mclust::adjustedRandIndex(clusters, shuf),
      NMI_shuffle          = chromswitch:::NMI(clusters, shuf),
      V_measure_shuffle    = chromswitch:::vMeasure(contingency)
    )
  }
  
  # Repeat N times
  replicate(N, shuffle(labels = meta$Condition, clust = unlist(clusters)),
            simplify = FALSE) %>%
    bind_rows() %>%
    colMeans()
}
shuffle_s_k <- bench_s_k %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_s_d <- bench_s_d %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_s_c <- bench_s_c %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

save(shuffle_s_k, shuffle_s_d, shuffle_s_c,
     file = "computations/shuffle_s.RData")
shuffle_b_k <- bench_b_k %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_b_d <- bench_b_d %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_b_c <- bench_b_c %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

save(shuffle_b_c, shuffle_b_d, shuffle_b_k,
     file = "computations/shuffle_b.RData")
load("computations/shuffle_s.RData")
load("computations/shuffle_b.RData")

addShuf <- function(df) {
  df %>% 
    mutate(Consensus_shuffle =
               rowMeans(dplyr::select(., ARI_shuffle, NMI_shuffle, V_measure_shuffle)))
}

bench_s_k <- bind_cols(bench_s_k, shuffle_s_k) %>% addShuf()
bench_s_d <- bind_cols(bench_s_d, shuffle_s_d) %>% addShuf()
bench_s_c <- bind_cols(bench_s_c, shuffle_s_c) %>% addShuf()

bench_b_k <- bind_cols(bench_b_k, shuffle_b_k) %>% addShuf()
bench_b_d <- bind_cols(bench_b_d, shuffle_b_d) %>% addShuf()
bench_b_c <- bind_cols(bench_b_c, shuffle_b_c) %>% addShuf()

4.6 ROC curves

The problem of a detecting a chromatin state switch in a region can be reframed as the problem of classifying the region as containing a switch between conditions or not. Therefore, our method can be regarded as a binary classifier, and evaluated accordingly. We evaluate the performance of our method on the benchmark set by computing Receiver-Operating Characteristic (ROC) curves based on the consensus score computed by chromswitch. ROC curves measure the tradeoff between the true positive rate and the false positive rate as the threshold on the consensus score is varied, as well as the area under the ROC curve (AUROC) which takes on a value of 1 for a perfect classifier and can be interpreted as the probability that chromswitch will score a randomly chosen region corresponding to a chromatin state switch higher than a randomly chosen region which does not exhibit a switch.

calcROC <- function(scores, method, input, smooth) {
    
    # Compute ROC info using each score as a predictor
    roc_out <- scores %>% 
        dplyr::select(
        matches("Consensus|Random")) %>% 
        map(~ pROC::roc(response = scores$switch,
                       predictor = as.numeric(.),
                       smooth = smooth))
    
    # Extract values to plot ROC curves
    roc_val <- roc_out %>%
        map(`[`, c("sensitivities", "specificities")) %>%
        map(as.data.frame) %>%
        map2_df(names(roc_out), function(df, index) mutate(df, index = index)) %>%
        mutate(TPR = sensitivities,
               FPR = 1 - specificities,
               Method = method,
               Input = input)

    return(roc_val)
}
benchmark_roc <- bind_rows(
  calcROC(bench_s_k, "Summary", "H3K4me3",  smooth = TRUE),
  calcROC(bench_s_d, "Summary", "DNase I",  smooth = TRUE),
  calcROC(bench_s_c, "Summary", "ChromHMM", smooth = TRUE),
  calcROC(bench_b_k, "Binary",  "H3K4me3",  smooth = TRUE),
  calcROC(bench_b_d, "Binary",  "DNase I",  smooth = TRUE),
  calcROC(bench_b_c, "Binary",  "ChromHMM", smooth = TRUE)) %>% 
  filter(grepl("Consensus|Random", index)) %>%
  filter(index != "Random" |
          (index == "Random") & (Method == "Binary") & (Input == "DNase I")) %>% 
  mutate(Input = ifelse(index == "Random", "NA", Input)) %T>% 
  write_tsv("tables/benchmark_roc.tsv")

roc_gg <- benchmark_roc %>% 
  makeLabel()

Figure 1b

roc_gg %>%
    filter(Input != "DNase I") %>%
    filter(index != "Consensus_shuffle") %>%
    ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
    geom_line(size = 1) +
    scale_colour_manual(name = "Strategy", values = pal_strategy) +
    scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
    theme_min() +
    xlab("False Positive Rate") + ylab("True Positive Rate") +
    theme(legend.position = "bottom") +
    labs(colour = "") + labs(linetype = "") +
    guides(colour = guide_legend(ncol = 1))

Figure S2

roc_gg %>% 
    filter(index != "Random") %>% 
    ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
    geom_line(size = 1) +
    scale_colour_manual(name = "Strategy", values = pal_strategy) +
    scale_linetype_manual(name = "Strategy", values = linetype_strategy2) +
    theme_min() +
    # theme(legend.position = "bottom") +
    xlab("False Positive Rate") + ylab("True Positive Rate") +
    theme(legend.position = "bottom") +
    facet_wrap(~ Input) +
    labs(colour = "") + labs(linetype = "") +
    guides(colour = guide_legend(ncol = 1))

4.7 AUROC

calcAUC <- function(scores, method, input) {

    scores %>%
        dplyr::select(
        matches("Consensus|Random")) %>%
        map(~pROC::roc(response = scores$switch, predictor = as.numeric(.))) %>%
        map_df("auc") %>%
        mutate(Method = method,
               Input = input)
}

benchmark_auc <- bind_rows(
  calcAUC(bench_s_k, "Summary", "H3K4me3"),
  calcAUC(bench_s_d, "Summary", "DNase I"),
  calcAUC(bench_s_c, "Summary", "ChromHMM"),
  calcAUC(bench_b_k, "Binary", "H3K4me3"),
  calcAUC(bench_b_d, "Binary", "DNase I"),
  calcAUC(bench_b_c, "Binary", "ChromHMM")) %>% 
  select(Method, Input, Consensus, Consensus_shuffle, Random)

benchmark_auc
ABCDEFGHIJ0123456789
Method
<chr>
Input
<chr>
Consensus
<dbl>
Consensus_shuffle
<dbl>
Random
<dbl>
SummaryH3K4me30.98333330.65305560.5844444
SummaryDNase I0.94666670.5350000NA
SummaryChromHMM0.93736110.7661111NA
BinaryH3K4me30.99694440.5513889NA
BinaryDNase I0.99027780.75277780.4975000
BinaryChromHMM0.97319440.6450000NA

Figure S2

# A little manipulation to make the table for the supplementary

benchmark_auc %>%
  gather(Stat, Score, Consensus, Consensus_shuffle) %>%
  rowwise() %>%
  mutate(Method = ifelse(grepl("shuffle", Stat),
                         paste0(Method, " (shuffled)"), Method)) %>%
  ungroup() %>% 
  dplyr::select(Method, Input, Score) %>%
  spread(Input, Score) %T>%
  write_tsv("tables/benchmark_auc.tsv") %>% 
  xtable() # Generate latex code
ABCDEFGHIJ0123456789
 
 
Method
<chr>
ChromHMM
<dbl>
DNase I
<dbl>
H3K4me3
<dbl>
1Binary0.97319440.99027780.9969444
2Binary (shuffled)0.64500000.75277780.5513889
3Summary0.93736110.94666670.9833333
4Summary (shuffled)0.76611110.53500000.6530556

5 Robustness experiments

5.1 Tuning parameters

There are two parameters used in the feature construction process for the binary approach, gap which controls whether nearby peaks within samples are connected, and p which controls how much overlap is required to call two peaks the same. To assess the impact of these parameters, we’ll perform a grid search on various combinations.

Since we want to investigate the effect of varying the tuning parameters, we select the best threshold on the Consensus score on this benchmarking dataset:

getBest <- function(bench_out) {
    
    best_consensus <- pROC::roc(predictor = bench_out$Consensus,
                            response = bench_out$switch) %>%
    pROC::coords("best",
                 ret = c("threshold", "sensitivity",
                         "specificity", "accuracy", "ppv",
                 as.list = TRUE, drop = TRUE))
    
    best_threshold <- best_consensus["threshold"]
    return(best_threshold)
}

best_k <- getBest(bench_b_k)
best_k
## threshold 
## 0.2156931
grid <- expand.grid(gap = seq(200, 500, by = 50),
                    p_overlap = seq(0.2, 0.8, by = 0.1))

evaluateParams <- function(gap, p, best_threshold, ...) {
    
    out <- benchmark %>% 
        mclapply(safeCallBinary, gap = gap, p = p, ...) %>% 
        purrr::transpose()
    
    ok <- map_lgl(out$error, is_null)
    out <- out$result[ok] %>%
        bind_rows() %>% 
        mutate(pred = ifelse(Consensus >= best_threshold, 1, 0))
    
    # Now we can tally the results by comparing the predictions
    # to the known classes (switch or not)
    stats <- table(benchmark$switch[ok], out$pred)
    
    tn = fn = tp = fp = 0
    try(tn <- stats["0", "0"], silent = TRUE)
    try(fn <- stats["1", "0"], silent = TRUE)
    try(tp <- stats["1", "1"], silent = TRUE)
    try(fp <- stats["0", "1"], silent = TRUE)
    
    df <- data.frame(gap, p, tp, tn, fp, fn)
    return(list(predictions = out$pred, stats = df))
} 

safeEvalParams <- purrr::safely(evaluateParams)
grid_out <- Map(function(gap, p) safeEvalParams(gap, p,
                                                   best_threshold = best_k,
                                                   peaks = pk_k4me3,
                                                   metadata = meta,
                                                   filter = TRUE,
                                                   filter_columns = c("qValue", "signalValue"),
                                                   filter_thresholds = c(10, 4),
                                                   reduce = TRUE,
                                                   heatmap = FALSE,
                                                   BPPARAM = SerialParam(),
                                                   optimal_clusters = TRUE,
                                                   mc.cores = n_cores),
                   grid$gap, grid$p_overlap) %>%
    purrr::transpose()

save(grid_out, file = "computations/grid.RData")

Figure S4

load("computations/grid.RData")
makeGrid(grid_out, "H3K4me3")

load("tables/grid_scores.H3K4me3.RData")
plotGrid(grid_scores)

5.2 Sample size and class imbalance

Next, we investigate how sensitive or robust chromswitch is to small numbers of samples and/or to datasets with high class imbalance (i.e. when samples in one condition greatly outnumber samples in the other). To do so, we will subsample from the samples in the full 23-tissue dataset to create many smaller datasets with fewer samples and greater class imbalance, and evaluate how chromswitch performs in classifying the benchmark regions using these smaller datasets.

For the summary feature matrix construction strategy, the computation of the feature vector for each sample is independent of all other samples, so we can save the feature matrices and subsample rows from the matrices. Here we compute and save the feature matrix for each region:

getSummaryFeatures <- function(query) {
  
  cols <- c("qValue", "signalValue")
  
  ft_mat <- pk_k4me3 %>% 
    filterPeaks(columns = cols,
                thresholds = c(10, 4)) %>% 
    normalizePeaks(columns = cols) %>% 
    retrievePeaks(metadata = meta, region = query) %>% 
    summarizePeaks(mark = "H3K4me3", cols = cols,
                   length = FALSE, fraction = TRUE, n = FALSE)
  
  return(ft_mat)
  
}

safeGetFeatures <- purrr::safely(getSummaryFeatures)

ft_s_k_raw <- benchmark %>% 
  mclapply(safeGetFeatures, mc.cores = n_cores) %>% 
  purrr::transpose()

ft_s_k <- ft_s_k_raw$result
save(ft_s_k, ft_s_k_raw,
     file = "computations/benchmark_features_s_k.RData")

For the binary strategy, the feature matrices are derived from the union of peaks supplied across samples, and are therefore dependent on the exact set of samples, so we will only save the local peaks (the peaks in the query region) for each sample.

getBinaryPeaks <- function(query) {
  
  pks <- pk_k4me3 %>% 
    retrievePeaks(metadata = meta, region = query) %>% 
    reducePeaks(gap = 300)
  
  return(pks)
  
}

safeGetBinaryPeaks <- purrr::safely(getBinaryPeaks)

pk_b_k_raw <- benchmark %>% 
  mclapply(safeGetBinaryPeaks, mc.cores = n_cores) %>% 
  purrr::transpose()
  
pk_b_k  <- pk_b_k_raw$result

# Extract the peaks from the localPeaks object
lpk_b_k <- lapply(pk_b_k, chromswitch:::lpkPeaks)

save(pk_b_k_raw, pk_b_k, lpk_b_k,
     file = "computations/benchmark_peaks_b_k.RData")
load("computations/benchmark_features_s_k.RData")
load("computations/benchmark_peaks_b_k.RData")

These are the different datasets we’ll construct, each with a different number and proportion of brain & other samples.

combos <- data.frame(
    n_other = c(2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 16, 16, 10, 12, 15, 16, 16),
    n_brain = c(2, 3, 4, 5, 6, 7, 7,  7,  7,  7,  7,  2,  3,  5,  4,  3,  4,  5)
)

combos
ABCDEFGHIJ0123456789
n_other
<dbl>
n_brain
<dbl>
22
33
44
55
66
77
87
107
127
147
lvls_combos <- c(unlist(Map(function(x, y)
    paste0(x ," brain vs. ", y, " other"),
    combos$n_brain, combos$n_other)))

idx_other <- which(meta$Condition == "Other")
idx_brain <- which(meta$Condition == "Brain")

Since in some cases it’s possible that there are few enough combinations that we can exhaustively try all of them, we will do so when there are fewer than 100 combinations. Otherwise, we will evaluate 100 randomly subsample ddatasets to for each combination.

How many possible datasets exist for each pair of n_brain and n_other?

n_pos <- function(n_other, n_brain, idx_other) choose(length(idx_other), n_other) * choose(length(idx_brain), n_brain)

combos$n_pos <- unlist(Map(function(n_other, n_brain) n_pos(n_other, n_brain, idx_other),
                            combos$n_other, combos$n_brain))
combos
ABCDEFGHIJ0123456789
n_other
<dbl>
n_brain
<dbl>
n_pos
<dbl>
222520
3319600
4463700
5591728
6656056
7711440
8712870
1078008
1271820
147120

For the code below, the functions that carry out the subsampling experiment are sourced from functions/subsampling_functions.R.

safeSubsampleFt <- purrr::safely(subsampleFromFeatures)
safeSubsamplePk <- purrr::safely(subsampleFromPeaks)
subsamp_s_raw <- mcMap(function(n_other, n_brain)
    safeSubsampleFt(ft_s_k, n_other, n_brain, idx_other),
    combos$n_other, combos$n_brain,
    mc.cores = n_cores) %>%
    purrr::transpose()

subsamp_s <- subsamp_s_raw$result
save(subsamp_s, subsamp_s_raw, file = "computations/subsamp_s.RData")
subsamp_b_raw <- mcMap(function(n_other, n_brain)
  safeSubsamplePk(pk_b_k, n_other, n_brain, idx_other),
    combos$n_other, combos$n_brain,
    mc.cores = n_cores) %>%
    purrr::transpose()

subsamp_b <- subsamp_b_raw$result
save(subsamp_b, subsamp_b_raw, file = "computations/subsamp_b.RData")

Now we can process the results, and calculate mean AUC scores for each combination so we can quantify performance.

load("computations/subsamp_s.RData")
load("computations/subsamp_b.RData")

auc_s <- subsamp_s %>% map_df("auc") %>% 
  mutate(Method = "Summary", Input = "H3K4me3")

auc_b <- subsamp_b %>% map("scores") %>% 
  map_df("auc") %>% 
  mutate(Method = "Binary", Input = "H3K4me3")

# Simplify the facet labels
lvls_combos_new <- c(unlist(Map(function(x, y)
    paste0(x ," vs. ", y),
    combos$n_brain, combos$n_other)))
lvls_combos_new[1] <- "2 brain vs. 2 other"

# Calculate the mean AUC for each combination of n_brain, n_other
mean_auc <- bind_rows(auc_s, auc_b) %>% 
    rowwise() %>% 
    mutate(dataset = ifelse(n_brain == 2 && n_other == 2,
                            "2 brain vs. 2 other",
                            paste0(n_brain, " vs. ", n_other))) %>%
    ungroup() %>% 
    mutate(dataset = factor(dataset, levels = lvls_combos_new)) %>%
    dplyr::select(dataset, iteration, AUC = Consensus, Method, Input) %>%
    group_by(dataset, Method, Input) %>%
    summarise(mean_AUC = paste0("AUC: ", round(mean(AUC), 2))) %T>%
    write_tsv("tables/subsampling_auc.tsv")

roc_s <- subsamp_s %>% map_df("roc") %>%
  mutate(Method = "Summary", Input = "H3K4me3")

roc_b <- subsamp_b %>% map("scores") %>% 
  map_df("roc") %>% 
  mutate(Method = "Binary", Input = "H3K4me3")

roc_df <- bind_rows(roc_s, roc_b) %>% 
  rowwise() %>% 
  mutate(dataset = ifelse(n_brain == 2 && n_other == 2,
                          "2 brain vs. 2 other",
                          paste0(n_brain, " vs. ", n_other))) %>%
  ungroup() %>%
  mutate(dataset = factor(dataset, levels = lvls_combos_new)) %>%
  filter(index == "Consensus") %>% 
  dplyr::select(-index)

# Ensures that the iterations are plot in an order that looks nice
# with the colours of the lines
roc_df$iteration <- factor(roc_df$iteration,
                           levels = sort(unique(roc_df$iteration),
                                         decreasing = TRUE))

In the figures below, the ROC curve for every individual subsample dataset is plot in a shade of gray, while the mean across iterations, for each combination, is plot in red (summary strategy) or blue (binary strategy).

Figure S3a

roc_df %T>%
    write_tsv("tables/subsampling_roc.tsv") %>% 
    filter(Method == "Summary" & Input == "H3K4me3") %>%
    plotSubsamp("red", "Summary", "H3K4me3")

Figure S3b

roc_df %>%
    filter(Method == "Binary" & Input == "H3K4me3") %>%
    plotSubsamp("blue", "Binary", "H3K4me3")

5.3 Window size

Since the chromswitch algorithm depends on the data in the genomic query window, which is selected by the user, we next considered the effect of variations in the window size. We studied how the consensus score changed as a function of the window size for three genes in our benchmark dataset, using H3K4me3 peaks.

window_genes <- c("GRIK2", "CHRM1", "TTYH1")

windows <- refseq_5kbp %>%
    as.data.frame %>%
    filter(gene_name %in% window_genes) %>%
    select(-start, -end)

get_window <- function(window_size, genes) {

    d <- window_size / 2

    genes %>%
        # Define window around the TSS
        mutate(start = case_when(
            strand == "+" ~ as.numeric(txStart) - d,
            strand == "-" ~ as.numeric(txEnd) - d)) %>%
        mutate(end = case_when(
            strand == "+" ~ as.numeric(txStart) + d,
            strand == "-" ~ as.numeric(txEnd) + d)) %>%
         mutate(gene_name = paste0(gene_name, "_", window_size/1000, "kbp"))
}

# Window sizes to test, in base pairs
sizes <- c(seq(1000, 10000, by = 1000),
           seq(10000, 50000, by = 10000),
           100000, 250000, 500000)

windows_gr <- lapply(sizes, get_window, windows) %>% bind_rows() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)
win_s <- callSummary(windows_gr,
         peaks = pk_k4me3,
         metadata = meta,
         mark = "H3K4me3",
         filter = TRUE,
         filter_columns = c("qValue", "signalValue"),
         filter_thresholds = c(10, 4),
         normalize = TRUE,
         normalize_columns = c("qValue", "signalValue"),
         summarize_columns = c("qValue", "signalValue"),
         length = FALSE,
         fraction = TRUE,
         n = FALSE,
         optimal_clusters = TRUE,
         heatmap = TRUE,
         titles = paste0(windows_gr$gene_name, "_summary"),
         outdir = "figures/heatmaps")

win_s

save(win_s, file = "computations/windowsize_s.RData")
win_b <- callBinary(query = win_to_test,
                          peaks = pk_k4me3,
                          metadata = meta,
                          filter = TRUE,
                          filter_columns = c("qValue","signalValue"),
                          filter_thresholds = c(10, 4),
                          reduce = TRUE,
                          gap = 300,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          heatmap = TRUE,
                          titles = paste0(windows_gr$gene_name, "_binary"),
                          outdir = "figures/heatmaps",
                          n_features = TRUE)

win_b

save(win_b, file = "computations/windowsize_b.RData")

Figure S5

load("computations/windowsize_s.RData")
load("computations/windowsize_b.RData")

windows <- bind_rows(mutate(win_s, method = "Summary"), mutate(win_b, method = "Binary")) %>% 
  select(gene_name, Consensus, method) %>% 
  separate(gene_name, into = c("gene_name", "window_size"), sep = "_") %>% 
  separate(window_size, into = c("window_size", "drop"), sep = "k") %>% 
  select(-drop) %>% 
  mutate(window_size = as.numeric(window_size)) %T>%
  write_tsv("tables/window_size.tsv")

# Plot TTYH1, "zoomed" in in the 0-50kbp range and in the 0-500kbp range
ttyh1_short <- windows %>% 
  filter(gene_name == "TTYH1") %>% 
  filter(window_size <= 50) %>% 
  mutate(range = "50kbp")

ttyh1_long <- windows %>% 
  filter(gene_name == "TTYH1") %>% 
  filter(window_size <= 500) %>% 
  mutate(range = "500kbp")

bind_rows(ttyh1_short, ttyh1_long) %>% 
  mutate(range = factor(range, levels = c("50kbp", "500kbp"))) %>% 
  ggplot(aes(x = window_size, y = Consensus)) +
  geom_point(aes(colour = method)) +
  geom_line(aes(colour = method), alpha = 0.8, size = 0.5) +
  scale_colour_manual(name = "Method", values = pal_methods) +
  xlab("Window Size (kbp)") + ylab("Consensus score") +
  facet_wrap(~ range, ncol = 1, scales = "free_x") +
  ggmin::theme_min() +
  theme(legend.position = c(0.9, 0.33)) +
  theme(strip.background = element_blank(), strip.text.x = element_blank())

5.4 Analysis of a dispersed mark

This package was initially designed to facilitate the task of identifying epigenetic changes for challenging datasets with abnormal patterns of enrichment of certain histone modifications, like H3K27me3. To analyze how chromswitch performs for a mark with a broad profile of enrichment, we will investigate a few regions where we expect a difference in K27me3 enrichment between human embryonic stem cells (ESC) and differentiated adult tissues.

5.4.1 Download data

We use ESC samples from the Roadmap Epigenomics Project, compared to the same adult tissues we used above.

esc_meta <- meta %>%
  filter(Condition == "Other") %>%
  mutate(Condition = "Adult") %>%
  bind_rows(read_tsv("input/esc_metadata.tsv"))
## Parsed with column specification:
## cols(
##   Sample = col_character(),
##   Tissue = col_character(),
##   Condition = col_character()
## )
esc_meta
ABCDEFGHIJ0123456789
Sample
<chr>
Tissue
<chr>
Condition
<chr>
E065AortaAdult
E066Adult_LiverAdult
E075Colonic_MucosaAdult
E077Duodenum_MucosaAdult
E079EsophagusAdult
E095Left_VentricleAdult
E098PancreasAdult
E100Psoas_MuscleAdult
E101Rectal_Mucosa.Donor_29Adult
E102Rectal_Mucosa.Donor_31Adult
k27m_regions <- c(GRanges("chr7:27122134-27298855"), # HOX
                  GRanges("chr16:54125741-57125741"), # IRX3 3 Mbp region
                  GRanges("chr16:54313663-54328454")) # IRX3

mcols(k27m_regions)$id <- c("HOX", "IRX3_3Mbp", "IRX3")

Download narrow peaks for H3K27me3:

urls_k27 <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/",
                      esc_meta$Sample, "-H3K27me3.narrowPeak.gz")
basenames_k27 <- lapply(urls_k27, basename) %>% unlist()
destpaths_k27 <- paste0("data/H3K27me3_peaks/", basenames_k27)

dl_k27 <- mcMap(noClobberDownload, urls_k27, destpaths_k27, mc.cores = n_cores)

extra_cols <- c("name" = "character", "score" = "integer", "strand" = "character",
                "signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric",
                "peak" = "numeric")

pk_k27me3 <- lapply(destpaths_k27, rtracklayer::import, format = "bed",
                   extraCols = extra_cols)
names(pk_k27me3) <- esc_meta$Sample

save(pk_k27me3, file = "computations/pk_k27me3.RData")

5.4.2 Run chromswitch using H3K27me3 peaks

k27m_s <- callSummary(k27m_regions,
            peaks = pk_k27me3,
            metadata = esc_meta,
            mark = "H3K27me3",
            filter = TRUE,
            filter_columns = c("qValue", "signalValue"),
            filter_thresholds = c(10, 10),
            normalize = TRUE,
            normalize_columns = c("qValue", "signalValue"),
            summarize_columns = c("qValue", "signalValue"),
            length = FALSE,
            fraction = TRUE,
            n = FALSE,
            optimal_clusters = TRUE,
            heatmap = TRUE,
            outdir = "figures/heatmaps",
            titles = paste0(k27m_regions$id, "_summary"))

save(k27m_s, file = "computations/h3k27me3_summary.RData")
k27m_b <- callBinary(query = k27m_regions,
           peaks = pk_k27me3,
           metadata = esc_meta,
           filter = TRUE,
           filter_columns = c("qValue","signalValue"),
           filter_thresholds = c(10, 10),
           reduce = TRUE,
           gap = 300,
           p = 0.4,
           optimal_clusters = TRUE,
           heatmap = TRUE,
           n_features = TRUE,
           outdir = "figures/heatmaps",
           titles = paste0(k27m_regions$id, "_binary"))

save(k27m_b, file = "computations/h3k27me3_binary.RData")

Presented in Figures S6-8

load("computations/h3k27me3_summary.RData")
load("computations/h3k27me3_binary.RData")

# By the summary strategy
k27m_s %>% 
  mutate(Method = "Summary") %>% 
  select(region, id, k, Average_Silhouette, Consensus, Method)
ABCDEFGHIJ0123456789
region
<chr>
id
<chr>
k
<int>
Average_Silhouette
<dbl>
Consensus
<dbl>
Method
<chr>
chr7:27122134-27298855HOX20.76768830.6610259Summary
chr16:54125741-57125741IRX3_3Mbp50.70817060.5046871Summary
chr16:54313663-54328454IRX320.78296390.5168098Summary
# By the binary strategy
k27m_b %>% 
  mutate(Method = "Binary") %>% 
  select(region, id, k, Average_Silhouette, Consensus, Method)
ABCDEFGHIJ0123456789
region
<chr>
id
<chr>
k
<int>
Average_Silhouette
<dbl>
Consensus
<dbl>
Method
<chr>
chr7:27122134-27298855HOX80.63077680.5760195Binary
chr16:54125741-57125741IRX3_3Mbp20.52555160.5168098Binary
chr16:54313663-54328454IRX370.83333330.6283202Binary

6 Genome-wide validation

To validate chromswitch, we will evaluate whether it’s capable of detecting chromatin state switches that result in brain-specific expression. We start by running chromswitch genome-wide to 5kb windows around TSS. Chromswitch makes a simple assessment of how chromatin states inferred from the data associate with the biological conditions of the samples, but we will need a more refined set of candidates to reasonably expect an association with gene expression data. Therefore, as candidates for validation, we select genes where the chromswitch finds an open chromatin state in brain tissues and closed chromatin in other tissues, evaluate whether these correspond to brain-specific expression.

6.1 Prepare GTEx data

We downloaded data from the Genotype-Tissue Expression Project database. The file data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_median_rpkm.gct was downloaded from the RNA-seq section at this link: https://gtexportal.org/home/datasets and contains the mean RPKM per tissue type per gene.

First we assign more concise names to the tissues and filter to a set which is comparable to our epigenomic dataset.

keep_tissue <- c("Amygdala", "Anterior cingulate cortex", "Basal ganglia", 
                 "Cerebella hemisphere", "Cerebellum", "Cortex",
                 "Frontal cortex",
                 "Hippocampus", "Hypothalamus", "Substantia nigra", "Aorta",
                 "Sigmoid colon", "Transverse colon", "Esophagus",
                 "Heart (atrial appendage)", "Heart (left ventricle)",
                 "Kidney", "Liver", "Lung", "Skeletal muscle",
                 "Pancreas", "Small intestine", "Stomach")

brain <-  c("Amygdala", "Anterior cingulate cortex", "Basal ganglia", 
            "Cerebella hemisphere", "Cerebellum", "Cortex", "Frontal cortex",
            "Hippocampus", "Hypothalamus", "Substantia nigra")

other <- keep_tissue[!(keep_tissue %in% brain)]

gtex_rpkm <- read_tsv(
      "data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_median_rpkm.gct",
      skip = 2) %>%
    dplyr::rename(gene_name = Description) %>%
    unite(col = "unite", sep = "=", Name, gene_name) %>%
    gather(key = "Tissue", value = "mean_RPKM", 2:length(.)) %>% 
    mutate(Tissue = recode(.x = Tissue,
        `Brain - Amygdala` = "Amygdala",
        `Brain - Anterior cingulate cortex (BA24)` = "Anterior cingulate cortex",
        `Brain - Caudate (basal ganglia)` = "Basal ganglia",
        `Brain - Cerebellar Hemisphere` = "Cerebellar hemisphere",
        `Brain - Cerebellum` = "Cerebellum",
        `Brain - Cortex` = "Cortex",
        `Brain - Frontal Cortex (BA9)` = "Frontal cortex",
        `Brain - Hippocampus` = "Hippocampus",
        `Brain - Hypothalamus` = "Hypothalamus",
        `Brain - Nucleus accumbens (basal ganglia)` = "Nucleus accumbens",
        `Brain - Putamen (basal ganglia)` = "Putamen",
        `Brain - Substantia nigra` = "Substantia nigra",
        `Adipose - Subcutaneous` = "Subcutaneous adipose",
        `Adipose - Visceral (Omentum)` = "Omentum",
        `Artery - Aorta` = "Aorta",
        `Artery - Coronary` = "Coronary artery",
        `Artery - Tibial` = "Tibial artery",
        `Bladder` = "Bladder",
        `Breast - Mammary Tissue` = "Breast",
        `Cervix - Ectocervix` = "Ectocervix",
        `Cervix - Endocervix` = "Endocervix", 
        `Colon - Sigmoid` = "Sigmoid colon",
        `Colon - Transverse` = "Transverse colon",
        `Esophagus - Gastroesophageal Junction` = "Esophagus (GEJ)",
        `Esophagus - Mucosa` = "Esophageal mucosa",
        `Esophagus - Muscularis` = "Esophagus (Muscularis)",
        `Fallopian Tube` = "Fallopian tube",
        `Heart - Atrial Appendage` = "Heart (atrial appendage)",
        `Heart - Left Ventricle` = "Heart (left ventricle)",
        `Kidney - Cortex` = "Kidney",
        `Liver` = "Liver",
        `Lung` = "Lung",
        `Muscle - Skeletal` = "Skeletal muscle",
        `Pancreas` = "Pancreas",
        `Prostate` = "Prostate",
        `Skin - Nut Sun Exposed (Suprapubic)` = "Skin",
        `Small Intestine - Terminal Ileum` = "Small intestine",
        `Spleen` = "Spleen",
        `Stomach` = "Stomach",
        `Thyroid` = "Thyroid",
        `Uterus` = "Uterus",
        `Vagina` = "Vagina", .default = "drop")) %>% 
    filter(Tissue %in% keep_tissue) %>% 
    spread(Tissue, mean_RPKM) %>% 
    separate(unite, into = c("Name", "gene_name"), sep = "=")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Name = col_character(),
##   Description = col_character()
## )
## See spec(...) for full column specifications.

Calculate the log2 fold change of expression between brain tissues and other tissues and the dataset.

gtex_fc <- gtex_rpkm

# For each gene, calculate the median RPKM across the tissues in each group
gtex_fc$median_brain_RPKM <- apply(select_(gtex_rpkm, brain), 1, median)
gtex_fc$median_other_RPKM <- apply(select_(gtex_rpkm, other), 1, median)

# Calculate log2 FC with a pseudocount
gtex_fc <- gtex_fc %>%
    mutate(FC     = (median_brain_RPKM + 1) / (median_other_RPKM + 1),
           log_FC = log(FC, base = 2))

6.2 Run chromswitch genome wide

6.2.1 Summary strategy

We evaluate the summary strategy on each of the three input types.

H3K4me3:

val_s_k_raw <- refseq_5kbp %>% 
  mclapply(safeCallSummary,
           peaks = pk_k4me3,
           metadata = meta,
           mark = "H3K4me3",
           filter = TRUE,
           filter_columns = c("qValue", "signalValue"),
           filter_thresholds = c(10, 4),
           normalize = TRUE,
           normalize_columns = c("qValue", "signalValue"),
           summarize_columns = c("qValue", "signalValue"),
           length = FALSE,
           fraction = TRUE,
           n = FALSE,
           heatmap = FALSE,
           estimate_state = TRUE,
           signal_col = "signalValue",
           test_condition = "Brain", 
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_s_k <- val_s_k_raw$result[map_lgl(val_s_k_raw$error, is_null)] %>% 
  bind_rows()

save(val_s_k, val_s_k_raw, file = "computations/validation_s_k.RData")

ChromHMM:

val_s_c_raw <- refseq_5kbp %>%
    mclapply(safeCallSummary,
             peaks = chmm_states,
             metadata = meta,
             mark = "State",
             filter = FALSE,
             normalize = FALSE,
             summarize_columns = NULL,
             length = FALSE,
             fraction = TRUE,
             n = FALSE,
             heatmap = FALSE,
             optimal_clusters = TRUE,
             estimate_state = TRUE,
             signal_col = "fraction",
             test_condition = "Brain",
             BPPARAM = SerialParam(),
             mc.cores = n_cores) %>% 
    purrr::transpose()

val_s_c <- val_s_c_raw$result[map_lgl(val_s_c_raw$error, is_null)] %>% 
    bind_rows()

save(val_s_c_raw, val_s_c, file = "computations/validation_s_c.RData")

DNase I:

val_s_d_raw <- refseq_5kbp %>%
    mclapply(safeCallSummary,
             peaks = pk_dnase,
             metadata = meta,
             mark = "DNase",
             filter = FALSE,
             normalize = FALSE,
             summarize_columns = NULL,
             length = FALSE,
             fraction = TRUE,
             n = FALSE,
             heatmap = FALSE,
             estimate_state = TRUE,
             signal_col = "fraction",
             test_condition = "Brain",
             optimal_clusters = TRUE,
             BPPARAM = SerialParam(), # Parallelize outside
             mc.cores = n_cores) %>% 
    purrr::transpose() 

val_s_d <- val_s_d_raw$result[map_lgl(val_s_d_raw$error, is_null)] %>%
    bind_rows()

save(val_s_d_raw, val_s_d, file = "computations/validation_s_d.RData")

6.2.2 Binary strategy

H3K4me3:

val_b_h_raw <- refseq_5kbp %>% 
  mclapply(safeCallBinary,
           peaks = pk_k4me3,
           metadata = meta,
           filter = TRUE,
           filter_columns = c("qValue", "signalValue"),
           filter_thresholds = c(10, 4),
           reduce = TRUE,
           gap = 300,
           p = 0.4,
           heatmap = FALSE,
           n_features = TRUE,
           estimate_state = TRUE,
           test_condition = "Brain",
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_b_h <- val_b_h_raw$result[map_lgl(val_b_h_raw$error, is_null)] %>% 
  bind_rows()

save(val_b_h, val_b_h_raw, file = "computations/validation_b_h.RData")

DNase I:

val_b_d_raw <- refseq_5kbp[1] %>% 
  mclapply(safeCallBinary,
           peaks = pk_dnase,
           metadata = meta,
           filter = FALSE,
           reduce = TRUE,
           gap = 300,
           p = 0.4,
           heatmap = FALSE,
           n_features = TRUE,
           estimate_state = TRUE,
           test_condition = "Brain",
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_b_d <- val_b_d_raw$result[map_lgl(val_b_d_raw$error, is_null)] %>% 
  bind_rows()

save(val_b_d, val_b_d_raw, file = "computations/validation_b_d.RData")

ChromHMM:

val_b_c_raw <- refseq_5kbp %>% 
  mclapply(safeCallBinary,
           peaks = chmm_states,
           metadata = meta,
           filter = FALSE,
           reduce = FALSE,
           gap = 300,
           p = 0.4,
           heatmap = FALSE,
           n_features = TRUE,
           estimate_state = TRUE,
           test_condition = "Brain",
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_b_c <- val_b_c_raw$result[map_lgl(val_b_c_raw$error, is_null)] %>% 
  bind_rows()

save(val_b_c, val_b_c_raw, file = "computations/validation_b_c.RData")
load("computations/validation_s_k.RData")
load("computations/validation_s_d.RData")
load("computations/validation_s_c.RData")
load("computations/validation_b_h.RData")
val_b_k <- val_b_h # Rename this object for consistency
load("computations/validation_b_d.RData")
load("computations/validation_b_c.RData")

6.3 Evaluate results

First, we will plot the mean RPKM across candidate genes and assess how the distributions differ betwen brain and other tissues.

Define a function to get the candidate switches for each threshold:

joinGtex <- function(val_df) {
  
  val_df %>% 
    dplyr::select(gene_name, Consensus, state, k) %>% 
    inner_join(gtex_fc, by = "gene_name")
}

thresholdScores <- function(threshold, scores) {
    
    scores %>%
        dplyr::filter(Consensus >= threshold) %>%
        mutate(Threshold = threshold)
}

getCandidates <- function(df, method, input) {
  
  thresholds <- c(0.25, 0.5, 0.75)
  
  df <- df %>% filter(state == "ON")
  df_thresholded <- map_df(thresholds, thresholdScores, df) %>% 
    mutate(Threshold = paste0("Consensus >= ", Threshold),
           Method = method,
           Input = input)
}
val_out <- list(val_s_k, val_s_d, val_s_c,
                val_b_k, val_b_d, val_b_c)

eval_methods <- c(rep("Summary", 3), rep("Binary", 3))
eval_input <- rep(c("H3K4me3", "DNase I", "ChromHMM"), 2)

val_gtex <- val_out %>%
  mclapply(joinGtex, mc.cores = 2)

val_candidates <- val_gtex %>% 
  {mcMap(getCandidates, df = ., method = eval_methods, input = eval_input,
         mc.cores = 2)} %>% 
  bind_rows()

val_candidates_long <- val_candidates %>% 
  gather(Tissue, mean_RPKM, 6:26) %>% 
  filter(Threshold == "Consensus >= 0.75")

# Count candidates
counts <- ddply(.data = val_candidates,
                .(Method, Input, Threshold),
                summarize,
                # Define a string with the number of candidates which we can plot
                n = paste0(unique(Input), "\n(n=", length(gene_name), ")")) %>% 
    filter(Threshold == "Consensus >= 0.75")
val_candidates_long_clean <- val_candidates_long %>% 
    mutate(Label = ifelse(Tissue %in% brain, "Brain", "Other")) %>% 
    mutate(Tissue = factor(Tissue, levels = keep_tissue)) %>% 
    makeLabelSimple() %T>%
    write_tsv("tables/validation_boxplot.tsv")

Figure 1c

val_candidates_long_clean %>%
    filter(Tissue %in% keep_tissue) %>%
    filter(Input == "H3K4me3") %>%
    mutate(Method = ifelse(Method == "Summary", "Summary (n=220)", 
                           "Binary (n=125)")) %>% 
    ggplot(aes(x = Tissue, y = mean_RPKM)) +
    geom_boxplot(width = 0.65, outlier.shape = NA, aes(fill = Method),
                 position = position_dodge(width = 0.8)) +
    scale_fill_manual(values = c("Summary (n=220)" = "red",
                                 "Binary (n=125)" = "#2f60af")) +
    facet_wrap(~ Input, ncol = 1) +
    ggmin::theme_min() +
    coord_cartesian(ylim = c(0, 55)) +
    ylab("Mean RPKM") +
    theme(text = element_text(size = 10),
          axis.text.x = element_text(angle = 90, hjust = 1),
      axis.ticks.x = element_blank(),
          legend.position = c(0.9, 0.7))

Figure S10

val_candidates_long_clean %>% 
    ggplot(aes(x = Tissue, y = mean_RPKM)) +
    geom_boxplot(outlier.shape = NA, aes(fill = Label), width = 0.65) +
    scale_fill_manual(values = pal_strategy) +
    ggmin::theme_min() +
    facet_grid(Input ~ Method) +
    coord_cartesian(ylim = c(0, 55)) +
    ylab("Mean RPKM") +
    theme(text = element_text(size = 10),
          axis.text.x = element_text(angle = 90, hjust = 1, size = 7),
          strip.text.y = element_blank(),
          legend.position = "none") +
    geom_text(data = counts, aes(x = 19, y = 52, label = n), size = 3,
                  colour = "black", inherit.aes = FALSE, parse = FALSE)

Second, to quantify validation, we will assess how the difference in expression between brain and other tissues associates with the consensus score output by chromswitch. To do so, we calculate the log2 fold change of the mean RPKM across tissues in each group, across candidate switches at each threshold on the consensus score.

thresholds <- seq(0, 1, by = 0.1)

getFC <- function(threshold, scores) {

    scores %>%
        dplyr::filter(Consensus >= threshold) %>%
        `[[`("log_FC") %>%
        mean()
}

val_fc <- data.frame(thresholds)

val_fc$Summary_H3K4me3  <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[1]])

val_fc$Summary_DNase    <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[2]])

val_fc$Summary_ChromHMM <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[3]])

val_fc$Binary_H3K4me3   <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[4]])

val_fc$Binary_DNase     <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[5]])

val_fc$Binary_ChromHMM  <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[6]])

save(val_fc, file = "computations/validation_fc.RData")
val_fc_tidy <- val_fc %>% 
  gather(strategy, median_logFC, 2:length(.)) %>% 
  separate(strategy, into = c("Method", "Input"), sep = "_") %>% 
  mutate(Input = recode(.x = Input, DNase = "DNase I")) %>% 
  makeLabelSimple() %T>%
  write_tsv("tables/validation_foldchange.tsv")

Figure 1d

val_fc_tidy %>% 
  filter(Input != "DNase I") %>% 
  ggplot(aes(x = thresholds, y = median_logFC)) +
  stat_smooth(aes(colour = Label, linetype = Label), se = FALSE) +
  scale_colour_manual(name = "Strategy", values = pal_strategy) +
  scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
  ggmin::theme_min() +
  ylab("Mean log2 fold change of expression") + xlab("Threshold") +
  theme(legend.position = "bottom") +
  labs(colour = "") + labs(linetype = "") +
  guides(colour = guide_legend(ncol = 1))
## `geom_smooth()` using method = 'loess'

Figure S11

val_fc_tidy %>% 
  ggplot(aes(x = thresholds, y = median_logFC)) +
  stat_smooth(aes(colour = Label, linetype = Label), se = FALSE) +
  scale_colour_manual(name = "Strategy", values = pal_strategy) +
  scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
  ggmin::theme_min() +
  # ggtitle("Median Log FC of expression across candidate genes") +
  ylab("Mean log2 fold change of expression") + xlab("Threshold") +
  theme(legend.position = "bottom") +
  labs(colour = "") + labs(linetype = "") +
  guides(colour = guide_legend(ncol = 1))
## `geom_smooth()` using method = 'loess'

7 Session Info

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /mnt/RICHARDS_JBOD1/KLEINMAN_SCRATCH/modules/software/R/3.4.1/lib64/R/lib/libRblas.so
## LAPACK: /mnt/RICHARDS_JBOD1/KLEINMAN_SCRATCH/modules/software/R/3.4.1/lib64/R/lib/libRlapack.so
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2                           
##  [2] chromswitch_0.99.0                     
##  [3] stringi_1.1.6                          
##  [4] forcats_0.2.0                          
##  [5] stringr_1.2.0                          
##  [6] dplyr_0.7.4                            
##  [7] readr_1.1.1                            
##  [8] tidyr_0.7.2                            
##  [9] tibble_1.3.4                           
## [10] tidyverse_1.2.1                        
## [11] purrr_0.2.4                            
## [12] plyr_1.8.4                             
## [13] magrittr_1.5                           
## [14] BiocParallel_1.10.1                    
## [15] xtable_1.8-2                           
## [16] pROC_1.10.0                            
## [17] biomaRt_2.32.1                         
## [18] Homo.sapiens_1.3.1                     
## [19] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [20] org.Hs.eg.db_3.4.1                     
## [21] GO.db_3.4.1                            
## [22] OrganismDbi_1.18.1                     
## [23] GenomicFeatures_1.28.5                 
## [24] AnnotationDbi_1.38.2                   
## [25] Biobase_2.36.2                         
## [26] rtracklayer_1.36.6                     
## [27] GenomicRanges_1.28.6                   
## [28] GenomeInfoDb_1.12.3                    
## [29] IRanges_2.10.3                         
## [30] S4Vectors_0.14.4                       
## [31] BiocGenerics_0.22.1                    
## [32] ggmin_0.0.0.9000                       
## [33] ggplot2_2.2.1                          
## [34] RColorBrewer_1.1-2                     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-131               bitops_1.0-6              
##  [3] matrixStats_0.52.2         lubridate_1.7.1           
##  [5] devtools_1.13.4            bit64_0.9-7               
##  [7] httr_1.3.1                 rprojroot_1.2             
##  [9] tools_3.4.1                backports_1.1.1           
## [11] R6_2.2.2                   DBI_0.7                   
## [13] lazyeval_0.2.1             colorspace_1.3-2          
## [15] withr_2.1.0                mnormt_1.5-5              
## [17] bit_1.1-12                 curl_3.0                  
## [19] compiler_3.4.1             git2r_0.19.0              
## [21] cli_1.0.0                  rvest_0.3.2               
## [23] graph_1.54.0               xml2_1.1.1                
## [25] DelayedArray_0.2.7         labeling_0.3              
## [27] scales_0.5.0               psych_1.7.8               
## [29] RBGL_1.52.0                digest_0.6.12             
## [31] Rsamtools_1.28.0           foreign_0.8-69            
## [33] rmarkdown_1.8              XVector_0.16.0            
## [35] pkgconfig_2.0.1            htmltools_0.3.6           
## [37] readxl_1.0.0               rlang_0.1.4               
## [39] rstudioapi_0.7             RSQLite_2.0               
## [41] BiocInstaller_1.26.1       bindr_0.1                 
## [43] jsonlite_1.5               RCurl_1.95-4.8            
## [45] GenomeInfoDbData_0.99.0    Matrix_1.2-12             
## [47] Rcpp_0.12.13               munsell_0.4.3             
## [49] yaml_2.1.15                SummarizedExperiment_1.6.5
## [51] zlibbioc_1.22.0            grid_3.4.1                
## [53] blob_1.1.0                 crayon_1.3.4              
## [55] lattice_0.20-35            Biostrings_2.44.2         
## [57] haven_1.1.0                hms_0.4.0                 
## [59] knitr_1.17                 codetools_0.2-15          
## [61] reshape2_1.4.2             XML_3.98-1.9              
## [63] glue_1.2.0                 evaluate_0.10.1           
## [65] modelr_0.1.1               cellranger_1.1.0          
## [67] gtable_0.2.0               assertthat_0.2.0          
## [69] broom_0.4.3                GenomicAlignments_1.12.2  
## [71] memoise_1.1.0              cluster_2.0.6