s
#Tip: To run any line in RStudio you can use Ctrl+Enter or Command+Enter.
#To use any package in R, we need to load its library first.To run DESeq2
#you just need to type the following.
library("DESeq2")
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
## setdiff, sort, table, tapply, union, unique, unlist, unsplit
##
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Loading required package: Rcpp
## Loading required package: RcppArmadillo
source("https://bioinfo.umassmed.edu/content/pub/funcs.R")
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:IRanges':
##
## space
##
## The following object is masked from 'package:stats':
##
## lowess
#1. Reading the data.
# We already made gene quantifications and merged expected counts into
# a single table. Here we prepared another table using whole reads
# (not only using reduced reads) to give you an idea about the complete
# picture of DESeq analysis.
url <- "https://galaxyweb.umassmed.edu/pub/class/data.tsv"
#install a package
install.packages("RCurl")
library(RCurl)
file <- textConnection(getURL(url))
#file <- "/project/umw_biocore/class/data.tsv"
#1. Read the data with row.names. We have the gene names in the
# first column in this table. When we set row.names=1 It will remove
# the first column from the data and put them to the row name section.
rsem <- read.table(file,sep="\t", header=TRUE, row.names=1)
#2. Creating the data structure for DESeq Analysis, we need to select
# the columns that we are going to use in the analysis. Here we are
# going to use experiment and control data that will be 6 columns.
# Here in this step make sure to include the columns you want to use in
# your analysis.
columns <- c("exper_rep1","exper_rep2","exper_rep3",
"control_rep1","control_rep2","control_rep3")
data <- data.frame(rsem[, columns])
#3. Scatter plots are a key approach to assess differences between conditions.
# We plot the average expression value (counts or TPMs for example) of
# two conditions.First calculate the average reads per gene. To find the
# average we sum experiments per gene and divide it to the number of replicas
# In this case we will divide the sum to 3. We will calculate the average for
# control libraries too. After that, we are ready to merge the average values
# from experiment and control to create a two column data structure using the
# cbind function that merges tables.
avgall<-cbind(rowSums(data[c("exper_rep1","exper_rep2","exper_rep3")])/3,
rowSums(data[c("control_rep1","control_rep2","control_rep3")])/3)
# We can also change the column names using colnames function.
colnames(avgall)<-c("Treat", "Control")
#4.There are several packages in R that can be used to
# generate scatter plots. The same plot can be generated using
# the ggplot package.
gdat<-data.frame(avgall)
ggplot() +
geom_point(data=gdat, aes_string(x="Treat", y="Control"),
colour="black", alpha=6/10, size=3) +
scale_x_log10() +scale_y_log10()
#5. In Eukaryotes only a subset of all genes are expressed in
# a given cell. Expression is therefore a bimodal distribution,
# with non-expressed genes having counts that result from experimental
# and biological noise. It is important to filter out the genes
# that are not expressed before doing differential gene expression.
# You can decide which cutoff separates expressed vs non-expressed
# genes by looking your histogram we created.
# In our case a total sum of 10 counts separates well expressed
# from non-expressed genes
sumd <- rowSums(data)
hist(log10(sumd), breaks=100)
abline(v=1)
#To make all to all scater plot, use the function below
all2all(data)
######################################
## LETS START DESeq ANALYSIS
######################################
#
#6. The goal of Differential gene expression analysis is to find
# genes or transcripts whose difference in expression, when accounting
# for the variance within condition, is higher than expected by chance.
# The first step is to indicate the condition that each column (experiment)
# in the table represent.
# Here we define the correspondence between columns and conditions.
# Make sure the order of the columns matches to your table.
columns
## [1] "exper_rep1" "exper_rep2" "exper_rep3" "control_rep1"
## [5] "control_rep2" "control_rep3"
conds <- factor( c("Control","Control", "Control",
"Treat", "Treat","Treat") )
# DESeq will compute the probability that a gene is differentially
# expressed (DE) for ALL genes in the table. It outputs both a
# nominal and a multiple hypothesis corrected p-value (padj).
# Because we are testing DE for over 20,000 genes, we do need to correct
# for multiple hypothesis testing. To find genes that are significantly
# DE, we select the ones has lower padj values and higher fold changes and
# visualize them on our scatter plot with different color.
# padj values are corrected p-values which are multiplied by the number
# of comparisons. Here we are going to use 0.01 for padj value
# and > 1 log2foldchange. (1/2 < foldChange < 2)
de_res <- runDESeq(data, columns, conds, padj=0.01, log2FoldChange=1, non_expressed_cutoff=10)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
overlaid_data <- overlaySig(gdat, de_res$res_selected)
#7. Make ggplot and overlay the significantly DE genes in the scatter plot
ggplot() +
geom_point(data=overlaid_data, aes_string(x="Treat", y="Control",
colour="Legend"), alpha=6/10, size=3) +
scale_colour_manual(values=c("All"="darkgrey","Significant"="red"))+
scale_x_log10() +scale_y_log10()
#8. The Second way to visualize it, we use MA plots.
# For MA Plot there is another builtin function that you can use.
plotMA(de_res$res_detected,ylim=c(-2,2),main="DESeq2");