Table of contents




expected learning outcome

To understand the basics of RNA-Seq data, how to use RNA-Seq for different objectives and to familiarize yourself with some standard software packages for such analysis.




getting started

Sample pooling has revolutionized sequencing. It is now possible to sequence 10s of samples together. Different objectives require different sequencing depths. Doing differential gene expression analysis requires less sequencing depth than transcript reconstruction so when pooling samples it is critical to keep the objective of the experiment in mind.

In this activity we will use subsets of experimentally generated datasets. One dataset was generated for differential gene expression analysis while the other towards transcript annotation.

For quantification we will use a set of data generated from the same strain as the genome reference mouse (C57BL/6J). We selected three replicates from control (wild type) and three from a knock-out strain. The idea is to find genes that are in the same pathway as the gene that were knock out. We will use a reduced genome consisting of the first 9.5 million bases of mouse chromosome 16 and the first 50.5 million bases of chromosome 7.







Quantification and differential gene expression analysis

The main goal of this activity is to go through a standard method to obtain gene expression values and perform differential gene expression analysis from an RNA-Seq experiment.

We will start by alignment and visualizing the data using the TopHat2 spliced aligner. We will then perform gene quantification using the RSEM program and finally differential gene expression analysis of the estimated counts using DESeq.

This activity should also serve as a review of the previous two classes as you will work on the hpcc

 

Before you start

1. Connect to cluster using the command below with your username and password

 ssh -X your_user@ghpcc06.umassrc.org

2. Please start an interactive job in the cluster using the command below for any of your operations in the cluster. Head node is used only for job submissions. An interactive session is a way to log in to a node and use it as if it were your local workstation for the length of the session. You may use this same script in the future whenever you want to start an work on a node and run programs in interactive session

 /project/umw_biocore/bin/qlogin

3. Prepare your working directory

/project/umw_biocore/bin/class/prepWD.bash



Exercise 1: prepare for genomic alignment

Both TopHat and RSEM rely on bowtie to perform read alignment (similar to the BWA aligner you used in genome assembly tutorial). Bowtie like BWA uses very efficient genome compression algorithm (Burrows-Wheeler transform) that allows for quick matching of sequences with less than 3 missmatches. To use these alignments it is necessary to create the BW transform of our genome before mapping reads. The bowtie-build2 program in the Bowtie distribution creates the BW index. Change your directory to genome.quantification. Invoke the BW transform on the mm10.fa file found in the directory genome.quantification:

1. First load necessary modules

module load RSEM/1.2.11
module load tophat/2.0.9
module load bowtie2/2.3.2
module load samtools/0.0.19
module load java/1.7.0_25

2. Build bowtie2 index files

cd ~/RNASeqWS/transcriptomics/genome.quantification
bowtie2-build -f mm10.fa mm10

We named it mm10 (following the UCSC genome browser naming convention. Although we named it similarly to the full genome, this sequence file only contains a very small region of the mouse genome. Our alignment database will be called mm10, and will include partial sequences for chromosomes 7 and 16.

3. In addition we also will prepare the transcriptome for RSEM alignment. RSEM will align directly to the set of transcripts included (ucsc.gtf file). The transcript file was downloaded directly from the UCSC table browser. The file does not contain the information necessary to map isoforms to genes, we therefore compiled a table, ucsc_into_genesymbol.rsem that contains this information. To generate the necessary index files use the command below. Please not that the \ is used to span multiple lines:

rsem-prepare-reference \
--gtf ucsc.gtf --transcript-to-gene-map ucsc_into_genesymbol.rsem \
mm10.fa mm10.rsem

Expected result: Your genome.quantification directory now should contain the following files (Tip: use ls -1):

Bowtie2 indeces:

mm10.1.bt2
mm10.2.bt2
mm10.3.bt2
mm10.4.bt2
mm10.rev.1.bt2
mm10.rev.2.bt2

Bowtie indeces:

mm10.rsem.ti
mm10.rsem.grp
mm10.rsem.chrlist
mm10.rsem.transcripts.fa
mm10.rsem.seq
mm10.rsem.idx.fa
mm10.rsem.1.ebwt
mm10.rsem.2.ebwt
mm10.rsem.3.ebwt
mm10.rsem.4.ebwt
mm10.rsem.rev.1.ebwt
mm10.rsem.rev.2.ebwt

 

Can you tell what are these indeces for? What is the difference between the fasta files used for each index set?





exercise 2 Quantify with the RSEM program

RSEM depends on an existing annotation and will only scores transcripts that are present in the given annotation file. We will compare the alignments produced by RSEM and tophat and this will become clear.

The first step is to prepare the transcript set that we will quantify. We selected the UCSC genes which is a very comprehensive, albeit a bit noisy dataset. As with all the data in this activity we will only use the subset of the genes that map to the genome regions we are using.

2.1 Calculate expression

RSEM now is ready to align and then attempt to perform read assignment and counting for each isoform in the file provided above. You must process each one of the 6 libraries:

cd ~/RNASeqWS/transcriptomics
mkdir rsem

rsem-calculate-expression --paired-end -p 2 \
 --output-genome-bam fastq.quantification/control_rep1.1.fq \
fastq.quantification/control_rep1.2.fq genome.quantification/mm10.rsem rsem/ctrl1.rsem

And similarly for each of the other 5 libraries

/project/umw_biocore/bin/class/runRSEM.bash

You should take the time to familiarize yourself with the output

2.2 Create consolidated table

In the bin directory (we provide a simple script to take all the independent RSEM output and combine it into a single table, which is then useful for inspection and ready for differential gene expression analysis.

To find out what the script does you may type the following command in the transcriptomics directory

cd ~/RNASeqWS/transcriptomics
perl bin/rsem.to.table.pl -help

We will generate two tables with isoform and gene level expression:

perl bin/rsem.to.table.pl -out rsem.gene.summary.count.txt -indir rsem -gene_iso genes -quantType expected_count
perl bin/rsem.to.table.pl -out rsem.isoforms.summary.count.txt -indir rsem -gene_iso isoforms -quantType expected_count

You should now take the time to inspect these tables, find genes that look affected by the experiment.

2.3 Visualize the raw data:

We are now ready to look at the data. Download the IGV program from the IGV site and transfer your files from the hpcc to your laptop. In general you will not transfer the files. There are ways to access the files from your dekstop or laptop directly at the HPCC, we'll cover this later. Since the files are very small transfer time will not be a problem. Use Filezila or the scp command if you are on a mac, to copy the files onto your laptop/desktop, please ensure you note to which directory you transfer the files as later you will need to load them into IGV.

Launch the IGV browser, and use the File -> load to load the files onto the browser. Load only one or two .bam files to begin with. Don't forget to choose mm10 as a genome build from the corner

A few genes are good examples of differentially expressed genes. For example the whole region around the key Fgf21 gene is upregulated in experiment vs controls, while the gene Crebbp is downregulated in experiments vs controls. To point your browser to either gene just type or copy the name of the gene in the location box at the top.

We will revisit these genes below when we do differential gene expression.





exercise 3: Genome alignment of RNA-seq reads

The fastq.quantification folder contains a relative small set of illumina sequencing reads. We will examine this set by first directly mapping to the reduced mouse genome.

Make sure you are in the transcriptomics directory for this activity. genome.quantification, fastq.quantification should be subdirectories. Check this before you proceed.

1. To avoid cluttering the workspace we will direct the output of each exercise to its own directory. In this case for example:

mkdir ~/RNASeqWS/transcriptomics/tophat

2. Then align each of the libraries to the genome. The fastq.quantification subdirectory contains six different libraries, three for a control experiment from wild type mouse liver and from mouse that are deficient in two different proteins. Each genotype was sequenced in triplicates using paired-end 50 base paired reads.

To first explore the data visually in IGV, we’ll use the TopHat2 aligner to map these reads to our reduced genome:

You have to be in this directory to run tophat alignment commands

cd ~/RNASeqWS/transcriptomics
tophat2 --library-type fr-firststrand --segment-length 20 -G  genome.quantification/ucsc.gtf \
-o  tophat/ctrl1 genome.quantification/mm10 fastq.quantification/control_rep1.1.fq \
fastq.quantification/control_rep1.2.fq

 

And using this command as a template, align the other libraries or use the script below (Your homework)

/project/umw_biocore/bin/class/runTophat.bash

 

Question: What percent of reads were mapped for the library?


Check the tophat reports for the library you aligned, in particular the align_summary.txt file

4. Tophat always reports its alignment in a file named “accepted_hits.bam”. To make things clear we’ll move this files onto a clean directory. Move the files by, for example, doing

mv tophat/ctrl1/accepted_hits.bam tophat/ctrl1.bam

For the rest of the libraries (Your homework)

/project/umw_biocore/bin/class/moveBamFiles.bash 

To visualize the alignments we sort this file and generate indexes (for rapid data access):

cd ~/RNASeqWS/transcriptomics/tophat
samtools sort ctrl1.bam ctrl1.sorted
samtools index ctrl1.sorted.bam

and with all the other libraries: (Homework)

/project/umw_biocore/bin/class/makeSortIndex.bash

Now we can compare the rsem and tophat alignments on IGV to see the difference




exercise 4: Differential gene expression analysis with DESeq

DESeq is an R package available via Bioconductor and is designed to normalise count data from high-throughput sequencing assays such as RNA-Seq and test for differential expression.

Please finish the tutorial to learn how to run R commands
tryr.codeschool.com

To run R you just need to type the following in the command line:

module load R/3.0.1
Rscript  bin/rsem.quant.r

We will revisit this R section in the 6th week.

Note:In the R code change the naming of the control and expr according to the order of colnames(d).
colData = as.data.frame((colnames(d)));
colData[1,2] = "control";
colData[2,2] = "expr";
colData[3,2] = "expr";
colData[4,2] = "control";
colData[5,2] = "expr";
colData[6,2] = "control";




Tophat2 Mapping & Visualization Summary:

1. Map all the libraries using Tophat2.

2. Download and install igv to your laptops
http://www.broadinstitute.org/software/igv

3. sort and make index files of your bam files using the commands below. In this way you can visualize them in IGV.

a. # Load the module

module load samtools/0.0.19


b. # Go to the directory that we are going to use
cd ~/RNASeqWS/transcriptomics


c. #Use samtools sort function to sort your bam file and give a new name ctrl1.sorted.bam
samtools sort -f tophat/th.quant.ctrl1/accepted_hits.bam tophat/ctrl1.sorted.bam


d. #Use samtools index function to be able visualize the bam file in IGV. This will make a tophat/ctrl1.sorted.bam.bai file.
samtools index  tophat/ctrl1.sorted.bam


4. Copy *.sorted.bam and *.sorted.bam.bai files to your computer using FileZilla. These files supposed to be under ~/RNASeqWS/transcriptomics/tophat directory.

5. Open IGV select mouse (mm10) from the corner.

6. Open your bam file using File => Load from file

7. Do the steps 3-6 for other libraries.

8. Go to fgf21 gene from entering gene name to the coordinate box at the top.