Six weeks basic computing workshop

Week 3: Awk, grep, make, UCSC Table Browser, bed12 file and bedtools

($ signs at the begining of the command lines indicate after $ is a command)
  1. Connect to the cluster using ssh.
  2. $ ssh 
  3. Prepare your working directory.
  4. $ clear 
    $ mkdir ~/class/week3
    $ cd ~/class/week3
  5. We can install bedtools into our home. Before you install bedtools into your home directory you may want to check if it is exist in the system or not and its version. To check if it is exist;
  6. $ bedtools --version 

    If you can see a version and you think it is sufficient for your purpose, you don't have to install you can skip the steps below and directly start using it.

  7. We used wget command to download the bedtools into our directory.
  8. $ wget
  9. After we downloaded the latest version of bedtools, unpack it using the command below;
  10. $ tar xvfz BEDTools.v2.17.0.tar.gz
  11. Enter opened directory to compile the source.
  12. $ cd bedtools-2.17.0
  13. To be able to compile the source you need to use gcc. If make is not working in your system. Please install gcc or load it into your system. Check available modules
  14. $ module avail
  15. Load gcc module
  16. $ module load gcc/4.8.1
  17. You are ready to compile the code using `make`. you can also read README or INSTALL files to learn more about installation.
  18. $ make
  19. Download repeat masker and refseq annotations for mm10

  20. Upload them using FileZilla to your ~/class/week3/
  21. Add the path of the bin directory to your PATH. Either using export PATH or .bashrc file. To use export;
  22. export PATH=$PATH:/home/ak97w/class/week3/bedtools-2.17.0/bin
  23. or ~/.bashrc file
  24. $ vi ~/.bashrc
    press i
    export PATH=$PATH:/home/ak97w/class/week3/bedtools-2.17.0/bin
    press :wq!
  25. To add ~/.bashrc to your environmental variables
  26. source  ~/.bashrc
  27. Now bedtools should run without full path
  28. $ bedtools
  29. We want to select tRNA in our mm10_rmsk.bed repeat masker file
  30. $ grep tRNA mm10_rmsk.bed > mm10_tRNA.bed
  31. Now I want to find tRNAs that are inside of the genes to findout distances to transcription start sites of the RNAs
  32. $ intersectBed -a mm10_tRNA.bed -b mm10_annot.bed -wao > mm10_tRNA_annot_intersect.tsv

    When we look at the output file. As you can see, the tRNAs overlapped with genes are in the same lines. The columns after column 7 is coming from the annotaion file.

  33. To make the distance calculations simple, I will just select the tRNAs and genes on watson strand
  34. $ grep "+.*+"  mm10_tRNA_annot_intersect.bed > mm10_tRNA_annot_intersect_pp.bed
  35. To find the distance between the 5' end of tRNAs and gene start sites
  36. $ awk '{print $4" dist:"($8-$2)"\t geneId:"$10}'  mm10_tRNA_annot_intersect_pp.bed
  37. To submit an interactive job in the cluster to do computain in the cluster please use qlogin
  38. $ /project/umw_biocore/bin/qlogin
  39. If you want to monitor the jobs you are running
  40. $ bjobs
  41. To list the jobs you recently run
  42. $ bjobs -a
  43. To submit a job

    $  /project/umw_biocore/bin/r "your_command" a_name_for_command
    $  /project/umw_biocore/bin/r "ls -l" myls

    Your homework (deadline: Jul 31, Thursday 17:00pm):

    1. Please create “homework/week3” directory in your home folder.

    2. Download all exonic regions for mouse mm10 from ucsc. Upload it to ~/homework/week3 using FileZilla.

    3. Find exonic repeat regions and report their distances to 3' ends of the exons for both strands.

    4. Ex:
      Repeat: Alu1 GeneId: NM_183838  Distance: 14 Strand:+
      Repeat: Alu2 GeneId: NM_232342  Distance: 30 Strand:-
    5. Write the awk command/s you used into Readme.txt file in the same folder.