Six weeks basic computing workshop


Week 2: Awk (if), sort, grep, rm commands, UCSC Table Browser and bed file

($ signs at the begining of the command lines indicate after $ is a command)
  1. Connect to the cluster using ssh.
  2. $ ssh your_user@ghpcc06.umassrc.org 
  3. Clean the terminal.
  4. $ clear 
  5. Go to home directory
  6. $ cd ~/
  7. Create multiple directories with "-p".
  8. $  mkdir class/week2 
    mkdir: cannot create directory `class/week2': No such file or directory
    
    $ mkdir -p class/week
  9. Go to class/week2
  10. $ cd ~/class/week2
  11. Create a six column bed file
  12. $ vi lib1.bed
    $ press i for insert mode
    chr3	3000	5000	Fgf21	0	+
    chr4	1000	7000	Swr1	0	-
    chr1	2000	3000	Gypsy	0	+
    chr2	4000	6000	Tdp43	0	-
  13. To copy a line in vi
  14. Press esc and Press yy on top of the line you want to copy
  15. Paste copied line
  16. press p
  17. Copy 2 lines
  18. press y2y
  19. Delete a line
  20. press dd
  21. Delete four lines
  22. press d4d
  23. Save and exit
  24. press esc and press :wq! and enter
  25. Save without exit
  26. press esc and press :w and enter
  27. Copy files using FTP client: Connect to the host with FileZilla like below. Drag and drop the files you want to transfer from remote machine to your local or vise versa.


  28. To remove the files use rm command.
  29. $ rm file_name
    $ rm *.bed
  30. Delete a directory if it is empty
  31. $ rmdir directory_name
  32. When attempting to remove a directory using a command such as the rmdir command, you may receive a prompt such as "rmdir: 'dir': Directory not empty" and be unable to delete the directory. To remove a directory that is full with other files or directories, use the below command. (pleas be carueful using rm commands)
  33. $ rm -rf directory_name
  34. Get help of a command using man. To exit press q
    $ man ls
    $ man rm
  35. In order to sort the lines in a file you would use a command line like:

    $ cd ~/class/week2
    $ sort -k1 lib1.bed
  36. To ignore leading blanks in the file we use b option
  37. $ sort -k1b lib1.bed
  38. For numeric sort
  39. $ sort -k1n lib1.bed
  40. To reverse the sort
  41. $ sort -k1n -r lib1.bed 
    $ sort -k1r lib1.bed 
  42. Copy the file
  43. $ cp lib1.bed lib2.bed
  44. grep command: Search a word in all the bed files in the directory

  45. Grep Tutorial
    $ grep Swr1 *.bed 
    $ grep Swr1 lib1.bed 
    $ grep "+" lib1.bed
    chr3	3000	5000	Fgf21	0	+
    chr1	2000	3000	Gypsy	0	+
    
  46. Print 4th column of a bed file which are gene names
  47. $ awk '{print $4}' lib1.bed 
    Fgf21
    Swr1
    Gypsy
    Gypsy
    Gypsy
    Gypsy
    Tdp43
    
  48. If you have repeated genes in the file. You can use the command below to unify them
  49. $ awk '{print $4}' lib1.bed | sort -u
    $ awk '{print $4}' lib1.bed | uniq
  50. Or you can put the results first into a file and use the commands to unify like below
  51. $ awk '{print $4}' lib1.bed > genenames.txt
    $ sort -u genenames.txt
    $ uniq genenames.txt
  52. Select only + strand from the bed file and write the genenames
  53. $ awk '{ if($6=="+" ) print $4 }' lib1.bed 
  54. Select only + strand from the bed file and write whole rows with $0
  55. $ awk '{ if($6=="+" ) print $0 }' lib1.bed 
  56. Write first four columns of the genes in "+" strand
  57. $ awk '{ if($6=="+" ) print $1"\t"$2"\t"$3"\t"$4 }' lib1.bed 
  58. Get the lines that the end coordinates are over 4000
  59. $ awk '{ if( $3 > 4000 ) print $0 }' lib1.bed 
  60. Get the lines that the gene lengths are over 4000
  61. $ awk '{ if( ($3-$2) > 2000 ) print $0 }' lib1.bed 
  62. Add end positions 10 nucleotide
  63. $ awk '{ if($6=="+" ) print $1"\t"$2"\t"($3+10)"\t"$4 }' lib1.bed
  64. Download a file from ucsc genome browser. Please go genome.ucsc.org and press table browser link

  65. upload the file using your FTP client to your week2 directory and find the lengths of 3'utrs with awk. Print first 10 lines
  66. $ awk '{ print ($3-$2) }' hg19ref.bed | head
  67. Print with their names into a file
  68. $ awk '{ print $4 " Len: " ($3-$2) }' hg19ref.bed > utr3len.txt
  69. Print only first 10 lines with their names into a file
  70. $ awk '{ print $4 " Len: " ($3-$2) }' hg19ref.bed | head > firstten.txt
  71. Search pattern with vi using regular expressions. Try the command below.
  72. $ vi hg19ref.bed
    /_utr3.*_r
    /chr1\t8404073\t8404227
  73. Replace with vi; g means global
  74. :%s/utr3/UTR3/g
    press u to undo
  75. Pattern search with vi
  76. :%s/_utr3.*_r/UTR3 r/g
    :%s/_utr3.*_f/UTR3 f/g
  77. Pattern search with vi and put some of the patterns into a special varialbe \1 and usit in the replacemant. We are going to do the two lines replacement with one line code. [] anything inside of brackets will be or [fr] means f or r in a position. \([fr]\) put whatever it is into \1 variable
  78. :%s/_utr3.*_\([fr]\)/UTR3 \1/g


For more vi regular expression tutorials;
http://www.thegeekstuff.com/2009/04/vi-vim-editor-search-and-replace-examples/
https://www.youtube.com/watch?v=dsKGMxoydCc
http://www.softpanorama.org/Editors/Vimorama/vim_regular_expressions.shtml


Your homework (deadline: Jul 24, Thursday 17:00pm):


  1. Please create “homework/week2” directory in your home folder.

  2. Download all intronic regions for mouse mm10 from ucsc. Upload it to ~/homework/week2 using FileZilla.

  3. Find long intronic regions that their lengths are over 10K and put the lines in positive strand to intron_watson.bed and negative ones into intron_crick.bed.

  4. Create another file and write the lengths in the following form => gene_id Len: length
  5. Ex:
    NM_183838  Len: 14543
    NM_232342  Len: 53322
    ...
    
  6. Write the awk command/s you used into Readme.txt file in the same folder.