26 minute read

RNA Quest: Rethinking RNA-Seq reports through reanalysis of previously published neuronal transcriptomes

by Ryan Pevey

If you’re reading this right now, this is a rough draft and I’m working on outputting the final post. Feel free to check back in or drop me a message or comment and I can keep you posted on when I’m done.

Also welcome to evolio, I hope you enjoy the experience.

Intro

Introduce the problem of reporting and the gap in communication between bioinformaticists and biologists. I sit at this nexus, with training and experiencce in boths worlds and I can act as a liason between the two.

Summary

This reports the example results of the differential expression analysis for the Pai et al. epilepsy neurons dataset. The dataset consists of three temporal lobe control post-mortem samples (TL), and three temporal lobe with epilepsy patient samples (TLE). In the original paper, four populations of cells were isolated from the samples via Flourescence activated cell sorting (FACS). The main paper focused on glial cells but here I’ve focused on the neuronal isolates.

Results summary text here

Data management

The analysis starts with loading all of the proper software and data files for downstream analysis. If you’re not a programmer and you’ve ever been curious about the steps that go into this type of analysis click on the buttons labelled ‘Show’ and it will expand the code that goes into each step. It’s also worth it to note that this script is only the very last stage of the analysis. I started by downloading the fastq files for each sample from the NCBI website, then aligned reads to the reference genome and produced a set of counts files which contain the number of reads counted for each genetic feature in the reference transcriptome. Those counts files are the input files for this analysis. If you are not a biologist, I’ve tried to make this report focused on their viewpoint, to help facilitate a deeper exploration of the dataset that we’ve created for them.

Click to expand code
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
# input_directory = '../../data/'
# setwd(input_directory)

Load libraries

Click to expand code
library(ggplot2)
library(limma)
library(Glimma)
library(viridis)
# library(AnnotationDbi)
# library(org.Hs.eg.db)
library(DESeq2)
library(treemap)
library(vsn)
# library(plotly)
library(pheatmap)
library(EnhancedVolcano)
# library(htmlwidgets)
library(kableExtra)
library(DT)

Read in datasets

Click to expand code
# Import read count matrix
dat <- read.delim("../../data/merged.tsv", header = TRUE, sep = "\t", check.names = FALSE)

# Remove special tagged rows, the first four, from the head of dat.
dat <- dat[5:length(dat[, 1]), ]
# Extract Ensembl ID and gene symbols
geneID <- as.data.frame(cbind(rownames(dat), dat$gene_symbols))
colnames(geneID) <- c("EnsID", "gene_symbols")
dat <- dat[, 2:7]

# Import sample metadata.
metadata <- read.delim("../../data/SraRunMetadata.csv", header = TRUE, sep = ",")

# Metadata condition as factor
metadata$condition <- factor(metadata$condition)
levels(metadata$condition) <- c("TL", "TLE")
metadata$ID <- c("Ctrl1", "Ctrl2", "Ctrl3", "TLE1", "TLE2", "TLE3")

# Create sample information table
coldata <- metadata[, c("ID", "condition", "Run")]
rownames(coldata) <- metadata$ID
# Assign row names of coldata to column names of dat
colnames(dat) <- rownames(coldata)

# Load RDS files for following code chunk
dds <- readRDS("../../bin/dds_DE_TL-TLE.rds")

# Import results tables
resTable <- read.csv(paste0("../../results/", "TL_TLE_allGenes_DEresults.csv"), header = TRUE)

Create Differential expression object

# This block of code is not executed, but provides the details of how the DESeq object (dds) was
# created. Create sample information table
coldata <- metadata[, c("condition", "ID", "Run")]
rownames(coldata) <- metadata$SampleID
colnames(dat)  #<- metadata$SampleID

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = dat, colData = coldata, design = ~condition)
dds

# Pre-filter low count genes Smallest group size is the number of samples in each group (3 in this
# dataset).
smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep, ]

# Differential expression analysis
dds <- DESeq(dds)
# Remove nonconverged genes
dds <- dds[which(mcols(dds)$betaConv), ]

Exploratory data analysis

Exploratory Data Analysis (EDA) is the process of visually and statistically examining data to identify patterns, trends, and potential issues before formal analysis. It helps ensure data quality, detect outliers, and guide analytical decisions. If you aren’t doing the analyses yourself, you might normally not even be presented these figures. But they’re important sanity checks to ensure data quality.

All samples total gene counts

Gene counts across all samples are shown both before and after normalization. The first plot illustrates that raw counts are already within a comparable range across samples. Normalization further refines the data, ensuring consistency for downstream analysis, as seen in the second plot.

boxplot(log2(counts(dds, normalized = FALSE) + 1), axes = FALSE, ylab = "log2 counts")
axis(1, labels = FALSE, at = c(seq(1, 40, 1)))
axis(2, labels = TRUE)
title("Non-normalized Gene counts")
text(x = seq_along(colnames(dds)), y = -2, labels = colnames(dds), adj = 0.5, xpd = TRUE, cex = 0.75)

boxplot(log2(counts(dds, normalized = TRUE) + 1), axes = FALSE, ylab = "log2 counts")
axis(1, labels = FALSE, at = c(seq(1, 40, 1)))
axis(2, labels = TRUE)
title("Normalized Gene counts")
text(x = seq_along(colnames(dds)), y = -2, labels = colnames(dds), adj = 0.5, xpd = TRUE, cex = 0.75)

Plot dispersion estimates

A diagnostic plot of how well the data fits the model created by DESeq2 for differential expression testing. Good fit of the data to the model produces a scatter of points (black and blue) around the fitted curve (red). Low expressing genes, with counts below 10 reads across all samples, have been filtered out for computational efficiency.

# Plot dispersion estimates, good fit of the data to the model produces a scatter of points around
# the fitted curve.
plotDispEsts(dds)
title("Dispersion estimates")

Gene Biotype treemap for all genes

The treemap below shows the relative proportion of each gene biotype for all genes in the dataset. The area of the rectangle is proportional to the percentage of each biotype. The majority are protein coding which is normal, but protein coding genes make up a significantly higher proportion of differentially expressed genes (right figure).

biotype <- data.frame(table(mcols(dds)$GENETYPE))
biotype$Var1 <- gsub("_", " ", biotype$Var1)
biotype <- biotype[order(biotype$Freq, decreasing = TRUE), ]
treemap(biotype, index = "Var1", vSize = "Freq", type = "index", palette = viridis(length(biotype$Var1)),
    aspRatio = 1.618/1, title = "", inflate.labels = TRUE, lowerbound.cex.labels = 0)

biotype$Percent <- sapply(biotype$Freq, function(prop) round((prop/sum(biotype$Freq)) * 100, digits = 2))
colnames(biotype) <- c("Biotype", "Frequency", "Percent")
datatable(as.data.frame(biotype), options = list(scrollX = TRUE, scrollCollapse = TRUE), rownames = FALSE,
    caption = paste0("These are the percent of each gene biotype for all genes in the dataset."))

Differential Expression Analysis Results

Temporal lobe neurons of Epilepsy patients (TLE) vs. Temporal lobe neurons of control post-mortem tissue (ref.)

Alright, the meat of the analysis. The star of the show. RESULTS SUMMARY TEXT

# Build results table Set contrast groups, reference level should be listed last.
contrast <- c("condition", "TLE", "TL")
res <- results(dds, contrast = contrast, alpha = 0.05, pAdjustMethod = "BH")

# Create significant results table ordered by fold change
res05 <- res[which(res$padj < 0.05), ]
res05 <- res05[order(-res05$log2FoldChange), ]

There are 1576 significant differentially expressed genes (FDR < 0.05) when comparing TLE to TL as reference. 758 are up-regulated, 818 are down-regulated. The results of the DE analysis are summarized in the table below. Overall, roughly the same number of genes(~3%) are down-regulated compared to up-regulated genes.

Summary of results

summary(res)
## 
## out of 26557 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 758, 2.9%
## LFC < 0 (down)     : 818, 3.1%
## outliers [1]       : 36, 0.14%
## low counts [2]     : 7209, 27%
## (mean count < 22)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Up-regulated vs. Down-regulated genes

bar <- data.frame(direction = c("up", "down"), genes = c(length(which(res05$log2FoldChange > 0)), length(which(res05$log2FoldChange <
    0))))

# Create stacked barplot
ggplot(bar, aes(x = direction, y = genes, fill = direction)) + geom_col(position = "stack") + scale_fill_viridis(2,
    begin = 0.3, end = 0.7, direction = 1, discrete = TRUE) + labs(x = NULL, y = "Number of genes", fill = "Direction") +
    theme(legend.title = element_text(hjust = 0.5), panel.grid.major = element_blank(), panel.background = element_blank()) +
    scale_x_discrete(limits = rev)

Results table: TLE vs. TL (ref.)

resTable[, 3:7] <- round(resTable[, 3:7], digits = 3)
datatable(as.data.frame(resTable), options = list(scrollX = TRUE, scrollCollapse = TRUE), caption = paste0("Table 1: These are the statistically significant differentially expressed genes (FDR < 0.05) when comparing ",
    contrast[2], " and ", contrast[3], " (ref.). The columns are ordered by log2FoldChange with up-regulated genes at the top and down-regulted genes at the bottom. Numeric values listed to 3 significant digits. You can reorder the columns or search for a gene of interest by Ensembl ID or Gene name in the search bar in the top right of the table."))

Gene Biotype treemap TLE vs. TL (ref.)

The treemaps below show the relative proportion of each gene feature biotype for all of the significant genes in the above result table. Think of it like a pie chart, but better. The area of each of the rectangles is proportional to the percentage of each biotype when comparing TLE and TL (ref.). The majority are protein coding however there are a large amount of lncRNA as well. TDP-43 is known to interact with miRNA as well.

# Create treemap of gene biotypes
resTable$GENETYPE <- as.factor(resTable$GENETYPE)
biotype <- data.frame(table(resTable$GENETYPE))
biotype$Var1 <- gsub("_", " ", biotype$Var1)
biotype <- biotype[order(biotype$Freq, decreasing = TRUE), ]
treemap(biotype, index = "Var1", vSize = "Freq", type = "index", palette = viridis(length(biotype$Var1)),
    aspRatio = 1.618/1, title = "", inflate.labels = TRUE, lowerbound.cex.labels = 0)

biotype$Percent <- sapply(biotype$Freq, function(prop) round((prop/sum(biotype$Freq)) * 100, digits = 2))
colnames(biotype) <- c("Biotype", "Frequency", "Percent")
datatable(as.data.frame(biotype), options = list(scrollX = TRUE, scrollCollapse = TRUE), rownames = FALSE,
    caption = paste0("These are the percent of each gene biotype when comparing ", contrast[2], " and ",
        contrast[3], " (ref.). The majority are protein coding however there are a large amount of lncRNA as well. TDP-43 is known to interact with miRNA as well."))

Log fold change shrinkage for visualization and ranking

LFC shrinkage is an important regularization technique that is used to improve the reliability of fold change estimates, especially for low expression or high variability genes.

# Log fold change shrinkage for visualization and ranking plotMA(res, ylim=c(-3,3))
resLFC <- lfcShrink(dds, contrast = contrast, type = "ashr")
# plotMA(resLFC, ylim=c(-3,3))
res <- resLFC

# Write out results table Order results table by smallest p-value
colnames(res)[5] <- "FDR"
res <- res[order(res$FDR), ]

Heatmaps

Heatmaps provide a visual representation of gene expression across multiple samples, using color gradients to indicate expression levels. Clustering patterns can reveal relationships between genes and conditions, helping to identify co-regulated genes and distinct expression profiles.

Up-regulated genes: TLE vs. TL (ref.)

These are the most up-regulated genes in the epilepsy neurons.

# rlog for visualization and ranking, generally works well for small datasets
rld <- rlog(dds, blind = FALSE)
# I'll use rld for visualization moving forward, because of the small sample size.

# Data Quality assessment by sample clustering and visualization
ann_colors = viridis(3, begin = 0, end = 1, direction = -1)
ann_colors = list(Condition = c(TL = ann_colors[2], TLE = ann_colors[3]))
df <- data.frame(Condition = colData(dds)[, "condition"])
rownames(df) <- colnames(dds)  # Ensure row names match colnames in heatmap
# df$Condition

# Declare expression matrix
rld <- rld[, c(which(rld$condition == contrast[2]), which(rld$condition == contrast[3]))]
# Extract sample names in the correct order (TL first, then TLE)
ordered_samples <- colnames(rld)[order(rld$condition)]
# Up-regulated genes
geneNum <- 40
selectUp <- assay(rld)[rownames(res05), ordered_samples][1:geneNum, ]
selectUpNames <- rownames(res05)[1:geneNum]
selectUpNames <- which(rownames(dds) %in% selectUpNames)
zUp <- (selectUp - rowMeans(selectUp))/sd(selectUp)  #z-score
pheatmap(zUp, cluster_rows = TRUE, show_rownames = TRUE, cluster_cols = FALSE, annotation_col = df, annotation_colors = ann_colors,
    labels_row = mcols(rld)$geneNew[selectUpNames], color = viridis(n = 256, begin = 0, end = 1, direction = -1))


Down-regulated genes: TLE vs. TL (ref.)

These are the most down-regulated genes in the epilepsy neurons.

# Down-regulated genes
geneNum <- 40
selectDown <- assay(rld)[rownames(res05), ordered_samples][(length(res05[, 1]) - geneNum):length(res05[,
    1]), ]
selectDownNames <- rownames(res05)[(length(res05[, 1]) - geneNum):length(res05[, 1])]
selectDownNames <- which(rownames(dds) %in% selectDownNames)
zDown <- (selectDown - rowMeans(selectDown))/sd(selectDown)  #z-score
pheatmap(zDown, cluster_rows = TRUE, show_rownames = TRUE, cluster_cols = TRUE, annotation_col = df,
    annotation_colors = ann_colors, labels_row = mcols(rld)$geneNew[selectDownNames], color = viridis(n = 256,
        begin = 0, end = 1, direction = -1))


Up-regulated and Down-regulated genes: TLE vs. TL (ref.)

Both the top up-regulated and down-regulated genes in the dataset.

# Up and Down-regulated genes
selectUpDownNames <- c(selectUpNames, selectDownNames)
zUpDown <- rbind(zUp, zDown)  #z-score
pheatmap(zUpDown, fontsize = 8, cluster_rows = TRUE, show_rownames = TRUE, cluster_cols = FALSE, annotation_col = df,
    annotation_colors = ann_colors, labels_row = mcols(rld)$geneNew[selectUpDownNames], color = viridis(n = 256,
        begin = 0, end = 1, direction = -1))

Volcano plot: TLE vs. TL (ref.)

Gene expression plots

Gene expression plots visualize how a gene’s expression varies across conditions or samples, helping identify trends, variability, and potential differential expression. Higher expression levels indicate greater transcript abundance, while statistical comparisons (e.g., boxplots, dot plots) help assess significance between groups.

Most up-regulated gene: TLE vs. TL (ref.)

# Plot counts highest foldChange gene
ens <- rownames(res)[which.max(res$log2FoldChange)]
plt <- plotCounts(dds, gene = ens, intgroup = "condition", normalized = TRUE, transform = TRUE, returnData = TRUE)
y_max <- max(plt$count, na.rm = TRUE)
padj_value <- format(res[ens, "FDR"], digits = 3)

p <- ggplot(plt, aes(x = condition, y = count, color = condition)) + geom_boxplot(color = "black", outlier.shape = NA,
    width = 0.25) + geom_point(cex = 5, position = position_jitter(w = 0.1, h = 0)) + scale_color_viridis_d(begin = 0.75,
    end = 0.25, direction = 1) + labs(title = bquote(italic(.(ens))), y = "Counts", x = NULL) + theme_bw() +
    theme(text = element_text(size = 30), plot.title = element_text(hjust = 0.5), legend.position = "none",
        panel.border = element_blank(), panel.grid = element_blank(), axis.line = element_line("black"),
        axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_log10() + annotate("text", x = 1.5,
    y = y_max * 1.5, label = paste("FDR =", padj_value)) + annotate("segment", x = 1, xend = 2, y = y_max *
    1.3, yend = y_max * 1.3) + annotate("segment", x = 1, xend = 1, y = y_max * 1.3, yend = y_max * 1.2) +
    annotate("segment", x = 2, xend = 2, y = y_max * 1.3, yend = y_max * 1.2)

p


Most down-regulated gene: TLE vs. TL (ref.)

# Plot counts lowest foldChange gene
ens <- rownames(res)[which.min(res$log2FoldChange)]
plt <- plotCounts(dds, gene = ens, intgroup = "condition", normalized = TRUE, transform = TRUE, returnData = TRUE)
y_max <- max(plt$count, na.rm = TRUE)
padj_value <- format(res[ens, "FDR"], digits = 3)

p <- ggplot(plt, aes(x = condition, y = count, color = condition)) + geom_boxplot(color = "black", outlier.shape = NA,
    width = 0.25) + geom_point(cex = 5, position = position_jitter(w = 0.1, h = 0)) + scale_color_viridis_d(begin = 0.75,
    end = 0.25, direction = 1) + labs(title = bquote(italic(.(ens))), y = "Counts", x = NULL) + theme_bw() +
    theme(text = element_text(size = 30), plot.title = element_text(hjust = 0.5), legend.position = "none",
        panel.border = element_blank(), panel.grid = element_blank(), axis.line = element_line("black"),
        axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_log10() + annotate("text", x = 1.5,
    y = y_max * 1.5, label = paste("FDR =", padj_value)) + annotate("segment", x = 1, xend = 2, y = y_max *
    1.3, yend = y_max * 1.3) + annotate("segment", x = 1, xend = 1, y = y_max * 1.3, yend = y_max * 1.2) +
    annotate("segment", x = 2, xend = 2, y = y_max * 1.3, yend = y_max * 1.2)

p


Plot notable genes of interest: TLE vs. TL (ref.)

Gene expression dashboard.

# Plot counts gene of interest
geneOfInt <- "LGR6"
ens <- geneOfInt
plt <- plotCounts(dds, gene = geneOfInt, intgroup = "condition", normalized = TRUE, transform = TRUE,
    returnData = TRUE)
y_max <- max(plt$count, na.rm = TRUE)
padj_value <- format(res[ens, "FDR"], digits = 3)

p <- ggplot(plt, aes(x = condition, y = count, color = condition)) + geom_boxplot(color = "black", outlier.shape = NA,
    width = 0.25) + geom_point(cex = 5, position = position_jitter(w = 0.1, h = 0)) + scale_color_viridis_d(begin = 0.75,
    end = 0.25, direction = 1) + labs(title = bquote(italic(.(ens))), y = "Counts", x = NULL) + theme_bw() +
    theme(text = element_text(size = 30), plot.title = element_text(hjust = 0.5), legend.position = "none",
        panel.border = element_blank(), panel.grid = element_blank(), axis.line = element_line("black"),
        axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_log10() + annotate("text", x = 1.5,
    y = y_max * 1.5, label = paste("FDR =", padj_value)) + annotate("segment", x = 1, xend = 2, y = y_max *
    1.3, yend = y_max * 1.3) + annotate("segment", x = 1, xend = 1, y = y_max * 1.3, yend = y_max * 1.2) +
    annotate("segment", x = 2, xend = 2, y = y_max * 1.3, yend = y_max * 1.2)

p

Discussion

Congrats, you can now hand your report off to collaborators.


Methods

Pai et al. (DOI: 10.1186/s40478-022-01453-1) methods summary

My methods.

Scripts

This is the R script that I used for this analysis, it has been ported into this report file. The following code is not executed, but is presented for transparency and reproducibility. Again, this is only the script that I used for this analysis. If you want all of the scripts for the full analysis you can find it on my github repository for this project here: rnaQuest at GitHub.

#This script performs differential expression analysis on the Pai et al. dataset (DOI: 10.1186/s40478-022-01453-1) using DESeq2.

library(ggplot2)
library(limma)
library(Glimma)
library(viridis)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(DESeq2)
library(treemap)
library(vsn)
library(plotly)
library(pheatmap)
library(EnhancedVolcano)
library(htmlwidgets)

#####################
###Data Management###
#####################

###Import read count matrix for TLE study
setwd(rstudioapi::selectDirectory(caption = "Select Project Directory"))
dat <- read.delim('data/merged.tsv', header = TRUE, sep="\t", check.names = FALSE)
dat[1:6,1:6]

#Remove special tagged rows, the first four, from the head of dat.
length(dat[,1])
head(dat)[,1:3]
dat <- dat[5:length(dat[,1]),]
head(dat)[,1:3]
length(dat[,1])
#Extract Ensembl ID and gene symbols
geneID <- as.data.frame(cbind(rownames(dat), dat$gene_symbols))
colnames(geneID) <- c('EnsID', 'gene_symbols')
head(geneID)
dat <- dat[,2:7]
head(dat)

#Import sample metadata.
metadata <- read.delim('data/SraRunMetadata.csv', header=TRUE, sep=",")
head(metadata)

#Metadata condition as factor
metadata$condition <- factor(metadata$condition)
levels(metadata$condition) <- c('TL','TLE')
metadata$ID <- c('Ctrl1', 'Ctrl2', 'Ctrl3', 'TLE1', 'TLE2', 'TLE3')

#Create sample information table
coldata <- metadata[,c('ID','condition', 'Run')]
rownames(coldata) <- metadata$ID
coldata
#Assign row names of coldata to column names of dat
colnames(dat) <- rownames(coldata)
head(dat)


#######################
###DESeq DE analysis###
#######################

#Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = dat,
                              colData = coldata,
                              design = ~ condition)
dds

#Replace Ensembl ID with gene symbol were available
IDs <- rownames(dds)
head(IDs)
head(geneID)
colnames(geneID)[1:2] <- c('ensemblID', 'geneID')
#check that IDs and geneID match each other
all(IDs == geneID$ensemblID)
length(which(IDs!=geneID$ensemblID))
length(IDs)
length(geneID$ensemblID)
tail(IDs)
tail(geneID$ensemblID)
#Create a new gene feature column that uses gene symbol, if available, and Ensembl ID otherwise.
geneID$geneNew <- ifelse(is.na(geneID$geneID), 
                         geneID$ensemblID, 
                         geneID$geneID)

#Check that geneNew shows desired output for missing gene symbols
head(geneID)
#Spot check random IDs and ensemblIDs for sanity check
rndm <- round(runif(10, min = 0, max = length(IDs)), digits = 0)
IDs[rndm]
geneID[rndm,]

#Add gene annotations
gene_info <- AnnotationDbi::select(org.Hs.eg.db, 
                    keys = geneID$ensemblID, 
                    columns = c("SYMBOL", "MAP", "GENETYPE", "GENENAME"), 
                    keytype = "ENSEMBL")
head(gene_info)
#return only unique mappings
gene_info <- gene_info[isUnique(gene_info$ENSEMBL),]
#geneID and gene_info do not match yet
all(geneID$ensemblID == gene_info$ENSEMBL)
gene_info <- merge(geneID, gene_info, 
                   by.x = "ensemblID",  # Column in colData to match
                   by.y = "ENSEMBL",    # Column in gene_info to match
                   all.x = TRUE)  # Keep all colData rows
# Restore original order of colData
gene_info <- gene_info[match(geneID$ensemblID, gene_info$ensemblID), ]
#Now they do match!
all(geneID$ensemblID == gene_info$ENSEMBL)
head(geneID)
head(gene_info)

#Add geneID to dds metadata
mcols(dds) <- cbind(mcols(dds), gene_info)
rownames(dds) <- geneID$geneNew

#Pre-filter low count genes
#Smallest group size is the number of samples in each group (3 in this ).
smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep,]

#Confirm factor levels
dds$ID
dds$condition
dds$Run
# mcols(dds)$MAP

#Differential expression analysis
dds <- DESeq(dds)
#Remove nonconverged genes
dds <- dds[which(mcols(dds)$betaConv),]
boxplot(log2(counts(dds, normalized = TRUE)+1), axes = FALSE, ylab = 'log2 counts')
axis(1, labels = FALSE, at = c(seq(1,40,1)))
axis(2, labels = TRUE)
text(x = seq_along(colnames(dds)), y = -2, labels = colnames(dds), 
     # srt = 45,    #rotate
     adj = 0.5,    #justify
     xpd = TRUE, #print outside of plot area
     cex = 0.75)  #smaller font size

#Plot dispersion estimates
plotDispEsts(dds)

#Save an rds file to start the script here
# saveRDS(dds, 'bin/dds_DE_TL-TLE.rds')
# dds <- readRDS('bin/dds_DE_TL-TLE.rds')


#########################
###Build results table###
#########################

levels(dds$condition)
#Set contrast groups, reference level should be listed last.
contrast <- c('condition', 'TLE', 'TL')
res <- results(dds,
               contrast = contrast,
               alpha = 0.1,
               pAdjustMethod = 'BH')
res
summary(res)

#How many adjusted p-values are less than 0.1?
sum(res$padj < 0.1, na.rm=TRUE)
table(res$padj < 0.1)
#Create significant results table ordered by fold change
res10 <- res[which(res$padj < 0.1),]
res10 <- res10[order(-res10$log2FoldChange),]
#How many adjusted p-values are less than 0.05?
sum(res$padj < 0.05, na.rm=TRUE)
table(res$padj < 0.05)

#How many up-regulated (FDR < 0.1)
length(which(res10$log2FoldChange > 0))
#How many down-regulated (FDR < 0.1)
length(which(res10$log2FoldChange < 0))
bar <- data.frame(direction = c('up','down'), genes = c(length(which(res10$log2FoldChange > 0)),length(which(res10$log2FoldChange < 0))))
bar

#Log fold change shrinkage for visualization and ranking, especially since we only have 3 samples per group
plotMA(res, ylim=c(-3,3))
resultsNames(dds)
resLFC <- lfcShrink(dds, contrast = contrast, type = 'ashr')
resLFC
plotMA(resLFC, ylim=c(-3,3))
res <- resLFC

#Add geneID variables
res$geneNew <- row.names(res)
res_annotated <- merge(as.data.frame(res), gene_info, by="geneNew", all.x=TRUE)
head(mcols(dds)[3])
head(res_annotated)

#Write out results table
#Reorder columns and order results table by smallest p-value
colnames(res_annotated)[1] <- 'gene'
colnames(res_annotated)[6] <- 'FDR'
res_annotated <- res_annotated[order(res_annotated$FDR),]
head(res_annotated)
# write.csv(as.data.frame(res_annotated), file=paste0('results/','TL','_','TLE','_allGenes_DEresults.csv'), row.names = TRUE)

#Glimma MD plot
res.df <- as.data.frame(res_annotated)
res.df$log10MeanNormCount <- log10(res.df$baseMean)
idx <- rowSums(counts(dds)) > 0
res.df <- res.df[idx,]
res.df$FDR[is.na(res.df$FDR)] <- 1
res.df <- res.df[isUnique(res.df$gene),]
exm <- res.df[,c(1:6,13)]
rownames(exm) <- exm$gene
anno <- res.df[,c(1,7:12)]
rownames(anno) <- anno$gene
# color_vector <- viridis(100)[cut(exm$FDR, breaks=100)]
# de_colors <- setNames(viridis(3), c('DE', 'Not Sig', 'Not DE'))
de_colors <- viridis(3)
status <- ifelse(exm$FDR < 0.05 & abs(exm$log2FoldChange) > 1, 'DE',
                 ifelse(exm$FDR > 0.05 & abs(exm$log2FoldChange) < 1, 'Not Sig', 'Not DE'))

glMDPlot(exm,
         xval = 'log10MeanNormCount',
         yval = 'log2FoldChange',
         anno = anno,
         groups = dds$condition,
         # cols = de_colors,
         cols = c('#BEBEBEFF', '#440154FF', '#21908CFF'),
         status = status,
         samples = colnames(dds)
)


#########
###EDA###
#########

#Create treemap of gene biotypes for all genes
biotype_all <- data.frame(table(mcols(dds)$GENETYPE))
biotype_all$Var1 <- gsub('_',' ',biotype_all$Var1)
biotype_all <- biotype_all[order(biotype_all$Freq, decreasing = TRUE),]
# png(filename='results/figs/tree.png',width=800, height=1295)
treemap(biotype_all,
        index = 'Var1',
        vSize = 'Freq',
        type = 'index',
        palette = viridis(length(biotype_all$Var1)),
        aspRatio = 1/1.618,
        title = '',
        inflate.labels = TRUE,
        lowerbound.cex.labels = 0
)
# dev.off()

#Create treemap of gene biotypes for significant results genes
res_annotated$GENETYPE <- as.factor(res_annotated$GENETYPE)
levels(res_annotated$GENETYPE)
biotype_sig <- data.frame(
  table(
    res_annotated$GENETYPE[which(res_annotated$FDR < 0.05)]
    )
  )
biotype_sig$Var1 <- gsub('_',' ',biotype_sig$Var1)
biotype_sig <- biotype_sig[order(biotype_sig$Freq, decreasing = TRUE),]
# png(filename='results/figs/tree.png',width=800, height=1295)
treemap(biotype_sig,
        index = 'Var1',
        vSize = 'Freq',
        type = 'index',
        palette = viridis(length(biotype_sig$Var1)),
        aspRatio = 1/1.618,
        title = '',
        inflate.labels = TRUE,
        lowerbound.cex.labels = 0
)
# dev.off()
biotype <- merge(biotype_all, biotype_sig, by = 'Var1')
colnames(biotype) <- c('biotype','AllGenes','SigGenes')
biotype <- biotype[order(biotype$AllGenes, decreasing = TRUE),]
# Combine rows with low counts into 'other' for the Chi-squared test.
biotype <- rbind(biotype,c('other', sum(biotype$AllGenes[4:8]), sum(biotype$SigGenes[4:8])))
biotype <- biotype[c(1:3,9),]
biotype$AllGenes <- as.integer(biotype$AllGenes)
biotype$SigGenes <- as.integer(biotype$SigGenes)
biotype$AllProp <- round(biotype$AllGenes/sum(biotype$AllGenes), digits = 3)
biotype$SigProp <- round(biotype$SigGenes/sum(biotype$SigGenes), digits = 3)
biotype <- biotype[,c(1,2,4,3,5)]
biotype
# Perform Chi-Squared Goodness-of-Fit Test
# Compute total counts
total_all <- sum(biotype$AllGenes)  # Total genes
total_sig <- sum(biotype$SigGenes)  # Total significant genes
# Compute expected counts under the null hypothesis
biotype$Expected <- total_sig * (biotype$AllGenes / total_all)
chi_sq_test <- chisq.test(biotype$SigGenes, p = biotype$AllGenes / total_all, rescale.p = TRUE)
# Print results
print(chi_sq_test)

#VST for visualization and ranking, generally scales well for large datasets
vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
meanSdPlot(assay(vsd))
#NTD for visualization and ranking 
ntd <- normTransform(dds)
head(assay(ntd), 3)
meanSdPlot(assay(ntd))
#rlog for visualization and ranking, generally works well for small datasets
rld <- rlog(dds, blind = FALSE)
head(assay(rld), 3)
meanSdPlot(assay(rld))
#I'll use rld for visualization moving forward, because of the small sample size.

#Dimensional reduction EDA
#PCA plot
plotPCA(rld, intgroup='condition')
plotPCA(rld, intgroup='condition', pcsToUse=3:4)

#MDS plot
# rld <- calcNormFactors(rld)
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
mdsData <- data.frame(cmdscale(sampleDistMatrix))
mds <- cbind(mdsData, as.data.frame(colData(rld)))
ggplot(mds, aes(X1,X2,color=condition,shape=condition)) + geom_point(size=3)

#######################
###Visualize results###
#######################

#Create a function that will output a standalone html file of a plotly figure for integration into the subsequent markdown report.
# setwd('results/figs/')
# outdir <- getwd()
saveInteractivePlot <- function(plot, outdir, filename) {
  # Ensure the output directory exists
  if (!dir.exists(outdir)) {
    dir.create(outdir, recursive = TRUE)
  }
  # Construct full file path
  file_path <- file.path(outdir, filename)
  # Save the widget
  saveWidget(plot, file_path, selfcontained = TRUE)
  # Read the saved HTML file
  html_content <- readLines(file_path)
  # Write the modified HTML file back
  writeLines(html_content, file_path)
  message("Interactive plot saved successfully: ", file_path)
}

#Data Quality assessment by sample clustering and visualization
ann_colors = viridis(3, begin = 0, end = 1, direction = -1)
ann_colors = list(Condition = c(TL=ann_colors[2], TLE=ann_colors[3]))
# df <- as.data.frame(colData(dds)[,'condition'])
df <- data.frame(Condition = colData(dds)[, 'condition'])
rownames(df) <- colnames(dds)  # Ensure row names match colnames in heatmap

select <- order(rowMeans(counts(dds, normalized = TRUE)),
                decreasing = TRUE)[1:10]
#Declare expression matrix
# rld <- rld[,c(which(rld$condition=='TL'),which(rld$condition=='TLE'))]
rld <- rld[,c(which(rld$condition==contrast[2]),which(rld$condition==contrast[3]))]
# Extract sample names in the correct order (TL first, then TLE)
ordered_samples <- colnames(rld)[order(rld$condition)]
pheatmap(assay(rld)[select,ordered_samples],
         cluster_rows = TRUE,
         show_rownames = TRUE,
         cluster_cols = FALSE,
         annotation_col = df,
         annotation_colors = ann_colors,
         color = viridis(n=256, begin = 0, end = 1, direction = -1),
         labels_row = mcols(rld)$geneNew
         )

# select <- assay(rld)[head(order(res$FDR),100),]
#Up-regulated genes
geneNum <- 40
selectUp <- assay(rld)[rownames(res10),ordered_samples][1:geneNum,]
selectUpNames <- rownames(res10)[1:geneNum]
selectUpNames <- which(rownames(dds)%in%selectUpNames)
zUp <- (selectUp - rowMeans(selectUp))/sd(selectUp) #z-score
pheatmap(zUp,
         cluster_rows = TRUE,
         show_rownames = TRUE,
         cluster_cols = FALSE,
         annotation_col = df,
         annotation_colors = ann_colors,
         labels_row = mcols(rld)$geneNew[selectUpNames],
         color = viridis(n=256, begin = 0, end = 1, direction = -1))

#Down-regulated genes
geneNum <- 40
selectDown <- assay(rld)[rownames(res10),ordered_samples][(length(res10[,1])-geneNum):length(res10[,1]),]
selectDownNames <- rownames(res10)[(length(res10[,1])-geneNum):length(res10[,1])]
selectDownNames <- which(rownames(dds)%in%selectDownNames)
zDown <- (selectDown - rowMeans(selectDown))/sd(selectDown) #z-score
pheatmap(zDown,
         cluster_rows = TRUE,
         show_rownames = TRUE,
         cluster_cols = FALSE,
         annotation_col = df,
         annotation_colors = ann_colors,
         labels_row = mcols(rld)$geneNew[selectDownNames],
         color = viridis(n=256, begin = 0, end = 1, direction = -1))

#Up and Down-regulated genes
selectUpDownNames <- c(selectUpNames,selectDownNames)
zUpDown <- rbind(zUp, zDown) #z-score
# setwd('results/figs/')
# tiff(filename= paste0('TLE','_','TL','_upDownHeatmap.tiff'),width=1200, height=1000)
gap_index <- sum(df$Condition == "TL")
pheatmap(zUpDown,
         fontsize = 8,
         cluster_rows = TRUE,
         show_rownames = TRUE,
         cluster_cols = FALSE,
         annotation_col = df,
         annotation_colors = ann_colors,
         # gaps_col = gap_index,
         labels_row = mcols(rld)$geneNew[selectUpDownNames],
         labels_col = "",
         show_colnames = TRUE,
         color = viridis(n=256, begin = 0, end = 1, direction = -1))
# dev.off()

#Volcano plot
EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'pvalue',
                # title = paste0(contrast[2], ' vs. ', contrast[3]),
                title = bquote('TL vs. TLE'),
                subtitle = NULL,
                legendPosition = 'none',
                # pCutoff = 10e-5,
                # FCcutoff = 1,
                boxedLabels = TRUE,
                drawConnectors = TRUE,
                colConnectors = 'black',
                widthConnectors = 1.0,
                arrowheads = FALSE,
                col = c('grey30', viridis(3, direction = -1)), colAlpha = 0.7,
                gridlines.major = FALSE,
                gridlines.minor = FALSE) #+ coord_flip()



#Plot counts highest foldChange gene
plotCounts(dds, gene = rownames(res)[which.max(res$log2FoldChange)], intgroup = 'condition', normalized = TRUE, transform = TRUE)

plt <- plotCounts(dds, gene = rownames(res)[which.max(res$log2FoldChange)], intgroup = 'condition', normalized = TRUE, transform = TRUE, returnData=TRUE)

ggplot(plt, aes(x=condition, y=count)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) +
  labs(title = rownames(res)[which.max(res$log2FoldChange)]) +
  # labs(title = rownames(res)[which.min(res$padj)]) +
  theme(plot.title = element_text(hjust = 0.5)) +
  #scale_y_log10(breaks=c(25,100,400))
  scale_y_log10()

#Create annotation function for significance bracket
annotate_significance <- function(p) {
  # Extract gene of interest data
  y_max <- max(plt$count, na.rm = TRUE)
  padj_value <- res[ens, "padj"]
  # Compute positions for bracket and text
  y_bracket <- y_max * 1.30  # 5% above max expression
  y_text <- y_max * 1.5     # 10% above max
  # Add annotation to the plot
  p <- p +
    # Horizontal line (bracket)
    annotate("segment", x = 1, xend = 2, y = y_bracket, yend = y_bracket) +
    # Vertical ticks at each end
    annotate("segment", x = 1, xend = 1, y = y_bracket, yend = y_bracket * 0.90) +
    annotate("segment", x = 2, xend = 2, y = y_bracket, yend = y_bracket * 0.90) +
    # P-value text annotation
    annotate("text", x = 1.5, y = y_text, 
             label = paste("FDR =", format(padj_value, digits=3)))
  p  # Return the modified plot
}

#Plot counts gene of interest
geneOfInt <- 'LGR6'
ens <- which(res$geneNew == geneOfInt)
# plotCounts(dds, gene = rownames(res)[ens], intgroup = 'group')

plt <- plotCounts(dds, gene = rownames(res)[ens], intgroup = 'condition', normalized = TRUE, transform = TRUE, returnData=TRUE)

p <- ggplot(plt, aes(x=condition, y=count, color = condition)) + 
  # geom_boxplot(color = 'black', outlier.shape = NA) +
  geom_point(position=position_jitter(w=0.1,h=0)) +
  scale_color_viridis_d(begin = 0.75, end = 0.25, direction = 1) +
  labs(title = geneOfInt) +
  # labs(title = rownames(res)[which.min(res$padj)]) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = 'none',
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.line = element_line('black')) +
  #scale_y_log10(breaks=c(25,100,400))
  scale_y_log10()
p

max(plt$count)
# setwd('results/figs/')
# tiff(filename= paste0(geneOfInt,'_plotCounts.tiff'),width=800, height=1295)
p <- ggplot(plt, aes(x=condition, y=count, color = condition)) + 
  geom_boxplot(color = 'black', outlier.shape = NA, width = 0.25) +
  geom_point(cex = 5, position=position_jitter(w=0.1,h=0)) +
  scale_color_viridis_d(begin = 0.75, end = 0.25, direction = 1) +
  labs(title = bquote(italic(.(geneOfInt))),
       y = 'Counts',
       x = NULL) +
  theme_bw() +
  theme(text = element_text(size=30),
        plot.title = element_text(hjust = 0.5),
        legend.position = 'none',
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.line = element_line('black'),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_log10()
p

#Add significance bracket
p + 
  annotate("text",
         x=1.5,
         y=max(plt$count)*1.5,
         label=paste("FDR =",format(res[ens, "padj"],digits=3))
         ) +
  annotate("segment", x = 1, xend = 2, y = max(plt$count) * 1.30, yend = max(plt$count) * 1.30, size = 0.5)

# dev.off()

annotate_significance(p)

p <- ggplot(plt, aes(x=condition, y=count, color = condition)) + 
  geom_boxplot(color = 'black', outlier.shape = NA, width = 0.25) +
  geom_point(cex = 5, position=position_jitter(w=0.1,h=0)) +
  scale_color_viridis_d(begin = 0.75, end = 0.25, direction = 1) +
  labs(title = geneOfInt,
       y = 'Counts',
       x = NULL) +
  theme_bw() +
  theme(text = element_text(size=30),
        plot.title = element_text(hjust = 0.5),
        legend.position = 'none',
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.line = element_line('black'),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_log10()
  # annotate("text",
  #          x=1.5,
  #          y=max(plt$count)*1.05,
  #          label=paste("FDR =",format(res[ens, "padj"],digits=3))
  #          )

p_pltly <- ggplotly(annotate_significance(p))
p_pltly

#Save the interactive plot out as a standalone html file
#Uncomment the next line to run the function from source
# source("bin/rnaQuest_html.R")
saveInteractivePlot(p_pltly, 'results/figs/', paste0(geneOfInt, "_plotlyCounts.html"))

#Gene set enrichment analysis
#Coming soon!

Session info

Session Info provides details about the computational environment used for this analysis, including the versions of R, loaded packages, and system settings. This ensures that the analysis can be reproduced accurately in the future, even if software updates change how certain functions behave.

sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Phoenix
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DT_0.33                     kableExtra_1.4.0           
##  [3] EnhancedVolcano_1.22.0      ggrepel_0.9.6              
##  [5] pheatmap_1.0.12             vsn_3.72.0                 
##  [7] treemap_2.4-4               DESeq2_1.44.0              
##  [9] SummarizedExperiment_1.34.0 Biobase_2.64.0             
## [11] MatrixGenerics_1.16.0       matrixStats_1.5.0          
## [13] GenomicRanges_1.56.2        GenomeInfoDb_1.40.1        
## [15] IRanges_2.38.1              S4Vectors_0.42.1           
## [17] BiocGenerics_0.50.0         viridis_0.6.5              
## [19] viridisLite_0.4.2           Glimma_2.14.0              
## [21] limma_3.60.6                ggplot2_3.5.1              
## 
## loaded via a namespace (and not attached):
##  [1] gridExtra_2.3           formatR_1.14            rlang_1.1.4            
##  [4] magrittr_2.0.3          gridBase_0.4-7          compiler_4.4.1         
##  [7] systemfonts_1.2.1       vctrs_0.6.5             stringr_1.5.1          
## [10] pkgconfig_2.0.3         crayon_1.5.3            fastmap_1.2.0          
## [13] XVector_0.44.0          labeling_0.4.3          promises_1.3.2         
## [16] rmarkdown_2.29          UCSC.utils_1.0.0        preprocessCore_1.66.0  
## [19] xfun_0.51               zlibbioc_1.50.0         cachem_1.1.0           
## [22] jsonlite_1.8.8          later_1.4.1             DelayedArray_0.30.1    
## [25] BiocParallel_1.38.0     irlba_2.3.5.1           parallel_4.4.1         
## [28] R6_2.6.1                bslib_0.9.0             stringi_1.8.4          
## [31] RColorBrewer_1.1-3      SQUAREM_2021.1          jquerylib_0.1.4        
## [34] Rcpp_1.0.14             knitr_1.49              httpuv_1.6.15          
## [37] Matrix_1.7-0            igraph_2.1.4            tidyselect_1.2.1       
## [40] rstudioapi_0.17.1       abind_1.4-8             yaml_2.3.10            
## [43] codetools_0.2-20        affy_1.82.0             lattice_0.22-6         
## [46] tibble_3.2.1            shiny_1.10.0            withr_3.0.2            
## [49] evaluate_1.0.3          xml2_1.3.6              pillar_1.10.1          
## [52] affyio_1.74.0           BiocManager_1.30.25     generics_0.1.3         
## [55] invgamma_1.1            truncnorm_1.0-9         munsell_0.5.1          
## [58] scales_1.3.0            ashr_2.2-63             xtable_1.8-4           
## [61] glue_1.7.0              tools_4.4.1             data.table_1.17.0      
## [64] locfit_1.5-9.11         grid_4.4.1              crosstalk_1.2.1        
## [67] edgeR_4.2.2             colorspace_2.1-1        GenomeInfoDbData_1.2.12
## [70] cli_3.6.3               mixsqp_0.3-54           S4Arrays_1.4.1         
## [73] svglite_2.1.3           dplyr_1.1.4             gtable_0.3.6           
## [76] sass_0.4.9              digest_0.6.36           SparseArray_1.4.8      
## [79] htmlwidgets_1.6.4       farver_2.1.2            htmltools_0.5.8.1      
## [82] lifecycle_1.0.4         httr_1.4.7              statmod_1.5.0          
## [85] mime_0.12

Updated:

Leave a comment