The National Center for Biotechnology Information (NCBI) Web site is the primary repository of biological information in the US. The NCBI site includes the Basic Local Alignment Search Tool (BLAST), which can be used to search the GenBank database of DNA sequences and find regions of local similarity. This lab, based on exercises developed at NCBI and the University of New Hampshire, will familiarize you with the use of BLAST.
In 1990, Michael Crichton published the book Jurassic Park about the resurrection of dinosaurs using the blood from the stomachs of insects which had been encased in amber. At one point in the book, Dr. Henry Wu is asked to explain some of DNA techniques used in reconstructing the extinct dinosaur genomes. Dr. Wu describes the use of restriction enzymes and how the fragmented pieces of dino DNA can be spliced together with these enzymes. He also alludes to the fact that they don't have the entire genome but that they "fill in the gaps" with modern day frog DNA. At one point during his discussion he points to a computer screen and remarks "Here you see the actual structure of a small fragment of dinosaur DNA."
gcgttgctgg cgtttttcca taggctccgc ccccctgacg agcatcacaa aaatcgacgc ggtggcgaaa cccgacagga ctataaagat accaggcgtt tccccctgga agctccctcg tgttccgacc ctgccgctta ccggatacct gtccgccttt ctcccttcgg gaagcgtggc tgctcacgct gtaggtatct cagttcggtg taggtcgttc gctccaagct gggctgtgtg ccgttcagcc cgaccgctgc gccttatccg gtaactatcg tcttgagtcc aacccggtaa agtaggacag gtgccggcag cgctctgggt cattttcggc gaggaccgct ttcgctggag atcggcctgt cgcttgcggt attcggaatc ttgcacgccc tcgctcaagc cttcgtcact ccaaacgttt cggcgagaag caggccatta tcgccggcat ggcggccgac gcgctgggct ggcgttcgcg acgcgaggct ggatggcctt ccccattatg attcttctcg cttccggcgg cccgcgttgc aggccatgct gtccaggcag gtagatgacg accatcaggg acagcttcaa cggctcttac cagcctaact tcgatcactg gaccgctgat cgtcacggcg atttatgccg caagtcagag gtggcgaaac ccgacaagga ctataaagat accaggcgtt tcccctggaa gcgctctcct gttccgaccc tgccgcttac cggatacctg tccgcctttc tcccttcggg ctttctcatt gctcacgctg taggtatctc agttcggtgt aggtcgttcg ctccaagctg acgaaccccc cgttcagccc gaccgctgcg ccttatccgg taactatcgt cttgagtcca acacgactta acgggttggc atggattgta ggcgccgccc tataccttgt ctgcctcccc gcggtgcatg gagccgggcc acctcgacct gaatggaagc cggcggcacc tcgctaacgg ccaagaattg gagccaatca attcttgcgg agaactgtga atgcgcaaac caacccttgg ccatcgcgtc cgccatctcc agcagccgca cgcggcgcat ctcgggcagc gttgggtcct gcgcatgatc gtgctagcct gtcgttgagg acccggctag gctggcgggg ttgccttact atgaatcacc gatacgcgag cgaacgtgaa gcgactgctg ctgcaaaacg tctgcgacct atgaatggtc ttcggtttcc gtgtttcgta aagtctggaa acgcggaagt cagcgccctg
In 1992 Dr. Mark Boguski at NCBI entered this sequence into a text editor and searched all of the known DNA sequences at the time. Dr. Boguski wrote up his findings and submitted a manuscript to the journal BioTechniques, as a tongue-in-cheek joke. His manuscript was accepted and published. (Boguski, M.S. A Molecular Biologist Visits Jurassic Park. (1992) BioTechniques 12(5):668-669). You will reproduce this experiment using BLAST.
EXERCISE 1: From the main BLAST page select Nucleotide-nucleotide BLAST (blastn). This brings up a web page where you can specify your query sequence along with various parameters (including the genetic database to use). Cut and paste the above "dinosaur DNA" sequence into the window labeled Search, select the "nr" database under the "Other" option, and then click the BLAST! button to start the search. Click the Format! button on the web page that appears. After a short, delay, the results of your search will be displayed in the page.
The most obvious feature of the resulting page is the graphic near the top which depicts the "hits" or database matches for your query sequence. The number of hits depends on the degree of similarity found between your input sequence and the sequences in the database. The uppermost red line in the graphic represents your query sequence. The colored lines below represent the "hits" or sequences that closely match your query sequence. Lavender lines represent close or identical matches while green, blue and black lines are more imperfect matches. The text immediately below the graphic describes the DNA sequences represented by the lines in the graphic with the best matches presented first. The hyperlink at the start of each line of text will take you to an entry in the DNA sequence database that corresponds to the gene named in that line of text.
For each of the top three matches, click on the link to the left and report the entry that appears under the heading SOURCE ORGANISM.
If a sequence does not correspond to a natural organism but instead represents a man-made construct, the SOURCE ORGANISM entry will identify it as an artificial sequence. How many of the top ten matches are artificial sequences? Identify any actual organisms in the top ten.
In practice, researchers rarely have complete and exact DNA samples. Some mistakes will undoubtedly occur in extracting sequences from samples, and gaps may occur as pieces of a sample are lost or incorrectly combined. This is why BLAST reports multiple matches and provides matching information via the colored lines and overall score. Advanced users of BLAST can specify additional search parameters to control how similar a match must be in order to be reported.
EXERCISE 2: Introduce errors into the Jurassic Park sequence by deleting the first two lines and last two lines in the sequence, and randomly changing five bases in the remaining sequence. How, if any, do these changes affect the search results?
Mark Boguski's published article was brought to Crichton's attention. In his second book, "The Lost World", Mr. Crichton used Dr. Boguski as a consultant. Dr. Boguski constructed an interesting sequence from existing species and also embedded a message in the protein translation of the DNA sequence which he submitted for use in the book. Here is the sequence Dr. Boguski gave Crichton "The Lost World":
gaattccgga agcgagcaag agataagtcc tggcatcaga tacagttgga gataaggacg gacgtgtggc agctcccgca gaggattcac tggaagtgca ttacctatcc catgggagcc atggagttcg tggcgctggg ggggccggat gcgggctccc ccactccgtt ccctgatgaa gccggagcct tcctggggct gggggggggc gagaggacgg aggcgggggg gctgctggcc tcctaccccc cctcaggccg cgtgtccctg gtgccgtggg cagacacggg tactttgggg accccccagt gggtgccgcc cgccacccaa atggagcccc cccactacct ggagctgctg caaccccccc ggggcagccc cccccatccc tcctccgggc ccctactgcc actcagcagc gggcccccac cctgcgaggc ccgtgagtgc gtcatggcca ggaagaactg cggagcgacg gcaacgccgc tgtggcgccg ggacggcacc gggcattacc tgtgcaactg ggcctcagcc tgcgggctct accaccgcct caacggccag aaccgcccgc tcatccgccc caaaaagcgc ctgcgggtga gtaagcgcgc aggcacagtg tgcagccacg agcgtgaaaa ctgccagaca tccaccacca ctctgtggcg tcgcagcccc atgggggacc ccgtctgcaa caacattcac gcctgcggcc tctactacaa actgcaccaa gtgaaccgcc ccctcacgat gcgcaaagac ggaatccaaa cccgaaaccg caaagtttcc tccaagggta aaaagcggcg ccccccgggg gggggaaacc cctccgccac cgcgggaggg ggcgctccta tggggggagg gggggacccc tctatgcccc ccccgccgcc ccccccggcc gccgcccccc ctcaaagcga cgctctgtac gctctcggcc ccgtggtcct ttcgggccat tttctgccct ttggaaactc cggagggttt tttggggggg gggcgggggg ttacacggcc cccccggggc tgagcccgca gatttaaata ataactctga cgtgggcaag tgggccttgc tgagaagaca gtgtaacata ataatttgca cctcggcaat tgcagagggt cgatctccac tttggacaca acagggctac tcggtaggac cagataagca ctttgctccc tggactgaaa aagaaaggat ttatctgttt gcttcttgct gacaaatccc tgtgaaaggt aaaagtcgga cacagcaatc gattatttct cgcctgtgtg aaattactgt gaatattgta aatatatata tatatatata tatatctgta tagaacagcc tcggaggcgg catggaccca gcgtagatca tgctggattt gtactgccgg aattc
EXERCISE 3: Once again, invoke Nucleotide-nucleotide BLAST (blastn) and copy and paste all or part this new "Lost World" sequence into the Search window and submit it to BLAST. Click the link to the left of the highest-scoring match in the list of sequence. Which organism is this DNA sequence from?
Continue looking down the list of matches until you find a different organism with the next-highest match. What is this organism? Is it in any way related to dinosaurs?
EXERCISE 4: From the main BLAST page, click on the link for Translated query vs. protein database (blastx). Copy and paste this same "Lost World" sequence into the Search window and submit it to BLAST. Make sure to include the entire sequence for this exercise.
On the results page, find the match "GATA Binding Protein" and click on corresponding score for the alignment (201) in the right hand column. The resulting page will show the query sequence written as a protein (using the 20 letters corresponding to amino acids). The matching sequence of amino acids from the database is shown below the query sequence, with dashes representing gaps. Dr. Boguski's message is hidden in the query sequence in the positions corresponding to dashes in the subject sequence. What is his message?
EXERCISE 5: Using the Internet or other references, answer the following questions about bioinformatics. Be sure to identify the source for your answer. If the source is on the Internet, list an additional source to corroborate your answer.
- When was the phrase bioinformatics first coined?
- How many genes are there in the human genome?
- In addition to the simple examples we have discussed in class, describe a a recent advance in biology research that was made possible by computer technology.