MCB 5472 : Database searches, Blast and Blastall

Please let me know, how far you got during the lab. If most students didn't finish, we will continue this next week!

Questions you should answer are given in blue. Please send your answers per email to gogarten@uconn.edu with subject MCB5472

Database Searches :

  1. For a scientist of your choice (e.g., your advisor, or someone who publishes in your field of interest), using ISI's Web of Science database (Click on the Web of Science link on top of the page. Note: there might be a limit on the number of simultaneous users, if you don't get in, try again later.). To search for articles that cite this author:
    - Select the web of science link
    - click on cited reference search
    - enter an author of your choice, and start the search
    - check all the articles you want to trace (or select all)
    - click finish search
    - explore some of the links that come with the individual article
    For which author did you search?
    Which was the most cited article?
    How often was it cited?
    When was the most recent citation?
    Did you find any interesting article (if yes, list title)?            
    Was this article available online
    ?
     
  2. Repeat the search using Pubmed as available on Entrez. Use this link (it installs "uconn-tools"). If you frequently work from outside the University, and if you have no done so already, set up one of your browsers to connect via a uconn proxy (the automatic configuartion works fine).

  3. Repeat the search using Google scholar.   Enter the author name as "Initals Familyname" (e.g. "JP Gogarten). If the page says "Insufficient data in this page. Try to ask Scholar for 100 results by clicking here", click on the "here"
    Why would the H value be a reasonable measure for an authors importance?
    Where you able to determine the H-value of the scientist chosen in 1? If yes, what was this value?

    [Asides: If you are interested in publicatin statistics, Indiana University has developped Scholarometer, a plugin for the google's chrome browser, that also calculates H values.; but realistically, now that goggle scholar does these automatically, the alternatives become less useful. A problem with google scholar is that you cannot add back articels that Google has decided are duplicates].

  4. If you have not done the blast search exercise using the NCBIs web interface form last week, do it now.
  1. Using Blast on a "local" machine:

Many completed genomes can be found on the NCBI FTP site (ftp://ftp.ncbi.nih.gov) in the folder named genomes. The eukaryotic genomes are accessible from the genomes directory while all prokaryotic (bacteria + archaea) genomes are stored in the the subfolder /bacteria (a clear case of archaeal discrimination). Go into the bacteria sub-folder and download two or more genomes or your choice. to do so, click on the appropriately named folder (e.g., Pyrococcus abyssi, Pyrococcus is a member of the archaeal domain). Among the files in these folders we are are interested in the .ffn file which contains all nucleotide sequences for individual predicted ORF's, the .faa file which contains the predicted ORF's amino acids sequences, and the .fna file, which contains the whole genome. These files are in a fasta format. For this exercise we will only use the .faa files. If a folder contains more than one .faa file, take both of these, and copy them into a single file*. On to your computer rename the files into something you recognize. (e.g., p_abyssi.faa and t_maritima.faa). Note: the *.ptt and *.rnt files contain tables where genes are located in the genome sequence.

Load the two .faa files into a folder in your home directory on bbcxsrv1.biotech.uconn.edu.
As usual there are many ways to do this.

Alternative #1: The easiest might be to use the finder. Select Go->connect to server in the finder-menu. In the address field enter afp://bbcxsrv1.biotech.uconn.edu . When prompted enter your username and password (if you still need to change your password, see below first)

Alternative #2: ssh "always" works, but it is very pedestrian: Save the two files onto your computer. Then open a terminal window on your mac and type
ssh your_account_name@bbcxsrv1.biotech.uconn.edu   enter your password when prompted.
passwd    enter a new non-trivial password. Remember UNIX is case sensitive. This permanently changes your password. DO NOT LOOSE IT.
mkdir blasttest    creates a directory named blasttest

Open another terminal window on your mac and type
sftp your_account_name@bbcxsrv1.biotech.uconn.edu  enter your password when prompted. This establishes a secure ftp connection to the XSERV cluster
cd blasttest changes to the directory blasttest. lcd changes the directory on your local machine, lpwd reports your local directory name
mput *.faa transfers all files from your local directory that end with .faa from the local machine to the remote machine.

Alternative 3: use FUGU or filezilla, SSH client and Jellyfissh, or finder and jellyfissh. Jellyfissh is a simple program that keeps track of the ssh connections. It saves you from typing ssh username .... JELLYFISSH maintains a list of data for your ssh connections, FileZilla and FUGU are file transfer programs with GUI that works, even if afp is not installed on the server.

Whatever works for you - get the two files into your account on the cluster and establish a terminal with an ssh connection to the cluster.

On the remote machine, you should NOT run long processes on the master node. Either qrsh to a node that is not busy (type qrsh **), or submit your command to the queue using qsub (maybe later).

On the XSERV cluster using the ssh connection established earlier type
qrsh <enter> enter your password when prompted,

[aside: If the XSERV cluster is very busy, the normal queue may be so long that your qrsh command joins the end of the queue and is not immediately excuted. In this circumstance use
qrsh -l hipri=TRUE <enter> enter your password when prompted, ]

cd blasttest # (If you haven't done above you need to make the directory 1st. Use your GUI connection, or type mkdir blasttest)
formatdb -i p_abyssi.faa -o T -p T
The -i is to indicate your infile.name, -p T is for a protein sequence, for a nucleotide sequence, you need to use -p F (for False); the -o T instructs the formatdb program to create indices. For more information on the formatdb program check http://bioinformatics.ubc.ca/resources/tools/formatdb or type  formatdb - <enter>
You need to replace p_abyssi.faa with one of the genomes you downloaded. This genome will be converted into a searchable database.

If you want to use the new blast+ program, you could use the command
legacy_blast.pl formatdb -i p_abyssi.faa -o T -p T
(but this works fro only about 50% of the blastall commands)

To do a blast search of every sequence in t_maritima.faa against the P.abyssi genome, we use blastall. To get information on the program type
blastall <enter>
For details check one of the many help pages on blastall, examples are here and here.

We will use the following options:
blastall -i t_maritima.faa -d p_abyssi.faa -o blast.out -p blastp -e 10e-9 -m 9 -a2
-i the query sequence(s)
-d the database
-o the output file
-p the blast program to use
-e the e-value cut off
-m the format of the output. option 8 gives a tabular output, that can be easily read into excel spreadsheet or into a database program. option 9 is similar but includes comment lines.
the individual columns in the output denote the following
# Query id # Subject id # % identity # alignment length # number of mismatches # number of gap openings # position of query start # position of query end # position of subject start # position of subject end # e-value of a hit # bit score of a hit

Which option turns off the the low complexity filter?
Which option, and which setting, sets the wordsize to 2?
Which option allows to use two processors?

Inspect the output using more blast.out

Often one is only interested in the top scoring hit for each query. A simple perlscript (here) goes through the blastall output in m8 or m9 format and retains only the top scoring hit, and it adds a header line that will be correctly interpreted by EXCEL. Save this script into your blasttest folder on the cluster, and read through the script using a text editor (vi or textwrangler).
A modified version of this program (here) adds another column to the listing that for each alignment gives the bitscore per residue in the alignment.

To run the blast.out file through the perl script type
perl extract_lines.pl blast.out
This will create a file in your directory called blast.out.top that only contains the top scoring hits, and from which the comment lines are removed.

Import this file into an excel spreadsheet. If you have an afp connection to the cluster, you can start EXCEL on you local machine, open an empty spreadsheet, select Data -> Get external data; select your file (you'll need to tell it to look for any file).

Sorting the data on the different columns you can figure out which gene was found to be most similar between the two genomes.
Which were the genomes you selected?
Which gene pair had the highest % identity?

To retrieve the target sequence, you can use the command fastacmd. If the GI number of the target is 14521962, and the database is p_all.faa the command is as follows:
fastacmd -s 14521962 -d p_all.faa

By default the databank is nr as installed on the cluster. Thus fastacmd -s 14521962 also works (and you get more annotation), provided the sequences in your databank are also in nr.

Next, we will calculate a histogram for the % identity values for all the significant alignments.

FOR THE OLD VERSION OF EXCEL: In Excel select Tools -> Data Analysis, select histogram. (If this does not show-up in your menu, you need to install the analysis tool pack.) In the window select the column you want to analyze, to beautify things, you might add a columns with bin ranges - see demo in class.

For the 2008 version of Excel for Macs: This version no longer allows to calculate histograms via the add-ons, but there are spreadsheets abailable on the net into which you can copy paste the the data you want to plot (a histrogram generating spreadsheet for Excel2008 is here)

Alternatively, you can use R to calculate and plot a histogram (R is a software package for statistical analysis and graph plotting -- R is free, open source, and probably the most powerful software package available. You can also do publication quality plotting. There is a steep learning curve at the beginning, but it is a worthwhile investment of time! See here, here and here for more info.) An R script to plot histograms is here, a read_me word document is here You do not need to learn or understand R today (or even for this course), but it will do no harm to open the R script in a text editor and possibly do some minor editing.
Download the script from here (control click and 'save link as...') in the blasttest folder on the cluster. The data to be analyzed and plotted need to be in a separate file called indata. The first line should contain the title of the plot (no_spaces,_use_underlines), the following lines should contain one data point each. (one way to generate the indata file is to load the blast output into EXCEL, copy/paste the column to be analyzed into textdocument (text wranger) and save it as file as indata (no extension).

R CMD BATCH histogramScript_pdf.R

The script should have generated two plots labelled plot1.pdf and plot2.pdf (the latter compares the data to a normal distribution). Using the afp connection you made earlier (or one of the alternative methods described above) locate this PDF file in Finder and open it.

Do you observe a smooth distribution?
Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons?
Can you can up with other interesting thing you could do with these data in excel?
(histogram for the values in other columns, correlation between the values in different columns, calculate additional columns with corrected values).

6. More blast from the command line:

Inteins are molecular parasites that invade in the most conserved regions of the most conserved proteins (see here). Two of these inteins have invaded the Archaeal/Vacuolar type ATPase/ATP synthase (the gene is sometimes known as vma-1). We are particularily interested in the one in insertion site b, often denoted as vma-1b.

The sequence surrounding the insertion site is very conserved, the intein itself accumulates substitutions at a much higher rate.

Reads from a second generation metagenome sequencing project are at

/Users/jpgogarten/Thane_smetagenomes/santa_pola_37_percent_reads_longer_than_30.txt

We will use these sequences to search for inteins that are located in vma-1b

The sequence of a V-ATPases with intein from Candidatus Nanosalinarum is

>gi|339757395|gb|EGQ40976.1| archaeal/vacuolar-type H+ ATPase subunit A [Candidatus Nanosalinarum sp. J07AB56]
MTQETIESEGEIYKITGPVVVAEDLDCQMNDVVYVGGEELLGEVIQIEGGQAYIQVYEETTGVSPGEPVK NTGEPLSVELGPGLLGSIYDGLQRPLPELEEQMGSFIQRGEDAPGINPEEEYSFEPAVEEGDHVEPGDVI GTVQVSYGEHKVLLPPHSDGGEVEEVREGNYTVTDHVAELDTGEKVSMRQEVPIRDQRPAEEQLPPEVPL ITGQRVFDGLFPLAKGGTAAVPGGFGTGKCVVGDTPVALADGSKRAIEDIYHDYEGRGDQTQRENERWTD LDNGPRVLSRKKGEIVEKQVSTVYKGKTDSTVSVGTRSGREVELTPVHRLFVLTPEMEIEEREAQNLQKG DTLLVPRNLPVDSSNVEIEVAEALPEKRVVGNGFEAVREAITVLEENHGTRKAAAEHLGIDDHRMDAYAT GRNRPRVADATAILEEAGKTHKLRCVKGEKQSKPTRIPEEVDEDFAELLGLLLGDGTVKPRSVQFYNNNE RLLDRVEHLAEKLFGLKSARTTANTVESVRVDSKVLRDLLVYLGFPETEKSLNCSVPETVMKGSEDVVAA FVRGYFLADGHFSEYEAELSTSSRRMQEDLAYALTRIGITPRVSEKQTDKNPHFRVRFSGDQLIEFHRKL EADYDKFEQIEEYLDRVEDHFRGTESVDIAPETVRSAFESSEATRADLKSAGAKLSNYETQEERISVPAL QNFADVTKNKHLAEMAGNHLEHFHPDRVASIETHQETKEVYDLTVPDTHNFVGGNAPMLLHNTVTQQSLA KFSDADVIVYIGCGERGNEMTEVLEEFPELEDPQTGEKLIDRTVLIANTSNMPVAAREASIYTGVTIAEY FRDQGLDVALMADSTSRWAEAMREVSARLEEMPGERGYPAYLASRLAEFYERGGRVRPLGPEEPGSVSII GAVSPPGGDFSEPVTQNTLRVVKNFFALDKDLAEKRHFPSINWNDSYSGYSETLSDYWQEEVDEDWNQNV QRLRDLLQESDDLEETVQLVGKDALPDRDRLTLEIGDMLKEFYLQQNAFHPVDQYSSPQKTFDMLEVILE YADHAYEALDEGALVEDIVSLNSRAEIGRIKTAEDHREKLEEVREQMQDEFGEVAEQ

The sequence of the selfspliching domain of the intein is

>Candidatus Nanosalinarum sp. J07AB56 intein splicing domain
CVVGDTPVALADGSKRAIEDIYHDYEGRGDQTQRENERWTDLDNGPRVLSRKKGEIVE KQVSTVYKGKTDSTVSVGTRSGREVELTPVHRLFVLTPEM-EIEEREAQNLQKGDTLL VPRNLPVDSSNVHFHPDRVASIETHQETKEVYDLTVPDTHNFVGGNAPMLLHN

The sequence of the selfspliching domain with 18 nucleotides upstream and downstream of the einsertion site is:

> Candidatus Nanosalinarum sp. J07AB56 intein splicing domain +-18 nucleotides
LAKGGTAAVPGGFGTGKCVVGDTPVALADGSKRAIEDIYHDYEGRGDQTQRENERWTDLDNGPRVLSRKKGEIVE KQVSTVYKGKTDSTVSVGTRSGREVELTPVHRLFVLTPEMEIEEREAQNLQKGDTLL VPRNLPVDSSNVHFHPDRVASIETHQETKEVYDLTVPDTHNFVGGNAPMLLHNTVTQQSLAKFSDADVIVY

We will attempt to use tblastn to determine how many intein are encoded in the read collection, and howmany possible insertion sites are present.

cd $HOME

mkdir blasttest2

cd blasttest2

cp /Users/jpgogarten/Thane_smetagenomes/santa_pola_37_percent_reads_longer_than_30.txt reads.ffn

(this is s big file, and copying it may take a few minutes)

While the file is being copied, decide which would be an appropiate query sequence (see above). Use a texteditor to put together a appropriate query. Save this sequence into a file called query.fa

formatdb -i reads.ffn -o T -p F

The commant for the search could be something like this:

blastall -i query.fa -d reads.ffn -p tblastn -e 10e-2 -a2

If you want to write the output to a file

blastall -i query.fa -d reads.ffn -o blastout1 -p tblastn -e 10e-2 -a2

If you want to use tabular output, use the -m 9 option.

What was your query, and which reasoning did you use to decide on it?

How many reads include the intein target sequence?

Howmany of these did include the intein?

[Aside: If you don't have decided on a student project, one possibility would be to devise an automated strategy to use ngs reads from a population or from metagenomes to determine how many vma-1b target sites are occupied by an intein].

=========================================================

The following is for your information only. We will not do this in class.

You can install blastall on your own machine. The program is available at the NCBI ftp site (see above).

In case you want to do this, you need to add the directory where blast is residing to your PATH, i.e, the list of directories where the operating system will be looking for things. The path is stored in a variable called $PATH .
To add a new directory to the end of $PATH type the following at the command prompt in a terminal window:
PATH=$PATH:/Blast-2.2.10/bin
export PATH


(for the advanced: you can add these commands to a file called .profile. This file is "hidden" and it is read every time you log in. You can do this using the text editor vi. To do so type
vi .profile
<type "i" to enter insert mode>
PATH=$PATH:/Blast-2.2.10/bin
export PATH

<type ESC to leave insert mode>
:wq

(: opens a command line. w writes to file, q quites. To quit without changes type :q!)


To submit a job to the queue, you need to have a script that executes the blast search.
The first line of the script tells the the operating system what to do with the following lines. For example, the first line in the file named blastall.sh could be
#! /bin/bash/
The next line could be the command you want to execute:
blastall -i t_maritima.faa -d p_abyssi.faa -p blastp -m 8 -e 10e-4 -o test.out -a 2
To make the file executable type the following at the command line:
> chmod 755 blastall.sh
To submit the file to the queue type at the command line:
> qsub -cwd blastall.sh
# -cwd tells the queue to use the current working directory.



* You can use a text editor for this (textwrangler), but the files might be pretty large. An alternative is to use unix:

open a terminal window (the application is in the utility folder).
cd to the directory where you downloaded the two files (e.g., cd Desktop)
cat name1.faa name2.faa > new_name_for_both_faa_files.faa
where name1.faa name2.faa are the names of the files you want to copy. cat list the content of one ore more files and > directs the output from the default (the screen) to a file

** Screen is a unix program that can be very useful, especially, if you want to go home and keep the cluster running on a program that you started. The best is to start screen immediately after you connect to the server via ssh. To get going type screen -a <return>, then type qrsh <return> this sends you to a compute node, you might need to enter your password again. Once you are ready to leave, enter the following keys <ctrl> and a at the same time followed by a d. This detaches the screen. logout from the master node; go home; ....; when you want to check the program, login to the master node and type screen -r to reattach to the screen (if you have several screens running you get instructions how to connect to a particular screen).