Name: _________________________________________

CSC 121: Computers and Scientific Thinking
Fall 2020

Lab 2: Applications in Biology/Bioinformatics


The National Center for Biotechnology Information (NCBI), which part of the National Institutes of Health (NIH), is the primary repository of biological information in the US. The NCBI Web site includes the Basic Local Alignment Search Tool (BLAST), which can be used to search the GenBank database of DNA sequences and find regions of local similarity. This lab, based on exercises developed at NCBI/NIH and the University of New Hampshire, will familiarize you with the use of BLAST.



Jurassic Park Dino-DNA Analysis

In 1990, Michael Crichton published the book Jurassic Park about the resurrection of dinosaurs using the blood from the stomachs of insects that had been encased in amber. At one point in the book, Dr. Henry Wu is asked to explain some of the DNA techniques used in reconstructing the extinct dinosaur genomes. Dr. Wu describes the use of restriction enzymes and how the fragmented pieces of dino DNA can be spliced together with these enzymes. He also alludes to the fact that they don't have the entire genome but that they "fill in the gaps" with modern day frog DNA. At one point during his discussion he points to a computer screen and remarks "Here you see the actual structure of a small fragment of dinosaur DNA."

gcgttgctgg cgtttttcca taggctccgc ccccctgacg agcatcacaa aaatcgacgc ggtggcgaaa cccgacagga ctataaagat accaggcgtt tccccctgga agctccctcg tgttccgacc ctgccgctta ccggatacct gtccgccttt ctcccttcgg gaagcgtggc tgctcacgct gtaggtatct cagttcggtg taggtcgttc gctccaagct gggctgtgtg ccgttcagcc cgaccgctgc gccttatccg gtaactatcg tcttgagtcc aacccggtaa agtaggacag gtgccggcag cgctctgggt cattttcggc gaggaccgct ttcgctggag atcggcctgt cgcttgcggt attcggaatc ttgcacgccc tcgctcaagc cttcgtcact ccaaacgttt cggcgagaag caggccatta tcgccggcat ggcggccgac gcgctgggct ggcgttcgcg acgcgaggct ggatggcctt ccccattatg attcttctcg cttccggcgg cccgcgttgc aggccatgct gtccaggcag gtagatgacg accatcaggg acagcttcaa cggctcttac cagcctaact tcgatcactg gaccgctgat cgtcacggcg atttatgccg caagtcagag gtggcgaaac ccgacaagga ctataaagat accaggcgtt tcccctggaa gcgctctcct gttccgaccc tgccgcttac cggatacctg tccgcctttc tcccttcggg ctttctcatt gctcacgctg taggtatctc agttcggtgt aggtcgttcg ctccaagctg acgaaccccc cgttcagccc gaccgctgcg ccttatccgg taactatcgt cttgagtcca acacgactta acgggttggc atggattgta ggcgccgccc tataccttgt ctgcctcccc gcggtgcatg gagccgggcc acctcgacct gaatggaagc cggcggcacc tcgctaacgg ccaagaattg gagccaatca attcttgcgg agaactgtga atgcgcaaac caacccttgg ccatcgcgtc cgccatctcc agcagccgca cgcggcgcat ctcgggcagc gttgggtcct gcgcatgatc gtgctagcct gtcgttgagg acccggctag gctggcgggg ttgccttact atgaatcacc gatacgcgag cgaacgtgaa gcgactgctg ctgcaaaacg tctgcgacct atgaatggtc ttcggtttcc gtgtttcgta aagtctggaa acgcggaagt cagcgccctg

In 1992, Dr. Mark Boguski at NIH entered this sequence into a text editor and searched all of the known DNA sequences at the time. Dr. Boguski wrote up his findings and submitted a manuscript to the journal BioTechniques, as a tongue-in-cheek joke. His manuscript was accepted and published [Boguski, M.S. A Molecular Biologist Visits Jurassic Park. (1992) BioTechniques 12(5):668-669]. You will reproduce this experiment using BLAST.

EXERCISE 1: From the main BLAST page, select nucleotide blast. This brings up a web page where you can specify your query sequence along with various parameters (including the genetic database to use). Copy-and-paste the above "dinosaur DNA" sequence into the window labeled Enter Query Sequence, select the Nucleotide collection (nr/nt) database, and then click the BLAST button to start the search. After a short, delay, the results of your search will be displayed in the page.

The results page allows you to look at the closest matches (or "hits"), both textually and visually. By default, the Descriptions tab lists the hits in a table, with columns for the sequence description, measures of how close the match was, and background information about the sequence (Accession). This same hits are shown as colored lines under the Graphic Summary, with red signifying sections that match closely and the colors lavender, green, blue, and black signifying less perfect matches.

View the top hits for your "dinosaur DNA" sequence under the Descriptions tab. You may find an entry "Streptomyces coelicolor strain M1154/pAMX4/pGP1416 chromosome, complete genome" at the top of the matches - ignore this, as it is recent entry of questionable origin (a synthetic sequence from an unpublished article). For each of the remaining top three hits in the table, click on the Accession link to the right and report the entry that appears under the heading SOURCE ORGANISM.








If a sequence does not correspond to a natural organism, the SOURCE ORGANISM entry will identify it as an artificial sequence. How many of the top ten matches (again, ignoring the questionable one) are artificial sequences?








In practice, researchers rarely have complete and exact DNA samples. Some mistakes will undoubtedly occur in extracting sequences from samples, and gaps may occur as pieces of a sample are lost or incorrectly combined. This is why BLAST reports multiple matches and provides matching information via the colored lines and overall score. Advanced users of BLAST can specify additional search parameters to control how similar a match must be in order to be reported.

EXERCISE 2: Introduce errors into the Jurassic Park sequence by deleting the second, third, and next-to-last lines in the sequence, and randomly changing another line to whatever bases you want. Do these changes affect the search results you obtain (compared to the matches from the original search)? How do any changes impact the scores of the matches and the lines in the Graphic Summary?










The Lost World Dino-DNA Analysis

After Dr. Boguski's article appeared in 1992, it was brought to Michael Crichton's attention. Crichton, who was working on the sequel to Jurassic Park, reached out to Boguski and asked him to consult on the book. Dr. Boguski constructed an interesting sequence that he felt was more scientifically plausible, and this sequence appeared The Lost World.

gaattccgga agcgagcaag agataagtcc tggcatcaga tacagttgga gataaggacg gacgtgtggc agctcccgca gaggattcac tggaagtgca ttacctatcc catgggagcc atggagttcg tggcgctggg ggggccggat gcgggctccc ccactccgtt ccctgatgaa gccggagcct tcctggggct gggggggggc gagaggacgg aggcgggggg gctgctggcc tcctaccccc cctcaggccg cgtgtccctg gtgccgtggg cagacacggg tactttgggg accccccagt gggtgccgcc cgccacccaa atggagcccc cccactacct ggagctgctg caaccccccc ggggcagccc cccccatccc tcctccgggc ccctactgcc actcagcagc gggcccccac cctgcgaggc ccgtgagtgc gtcatggcca ggaagaactg cggagcgacg gcaacgccgc tgtggcgccg ggacggcacc gggcattacc tgtgcaactg ggcctcagcc tgcgggctct accaccgcct caacggccag aaccgcccgc tcatccgccc caaaaagcgc ctgcgggtga gtaagcgcgc aggcacagtg tgcagccacg agcgtgaaaa ctgccagaca tccaccacca ctctgtggcg tcgcagcccc atgggggacc ccgtctgcaa caacattcac gcctgcggcc tctactacaa actgcaccaa gtgaaccgcc ccctcacgat gcgcaaagac ggaatccaaa cccgaaaccg caaagtttcc tccaagggta aaaagcggcg ccccccgggg gggggaaacc cctccgccac cgcgggaggg ggcgctccta tggggggagg gggggacccc tctatgcccc ccccgccgcc ccccccggcc gccgcccccc ctcaaagcga cgctctgtac gctctcggcc ccgtggtcct ttcgggccat tttctgccct ttggaaactc cggagggttt tttggggggg gggcgggggg ttacacggcc cccccggggc tgagcccgca gatttaaata ataactctga cgtgggcaag tgggccttgc tgagaagaca gtgtaacata ataatttgca cctcggcaat tgcagagggt cgatctccac tttggacaca acagggctac tcggtaggac cagataagca ctttgctccc tggactgaaa aagaaaggat ttatctgttt gcttcttgct gacaaatccc tgtgaaaggt aaaagtcgga cacagcaatc gattatttct cgcctgtgtg aaattactgt gaatattgta aatatatata tatatatata tatatctgta tagaacagcc tcggaggcgg catggaccca gcgtagatca tgctggattt gtactgccgg aattc

EXERCISE 3: Once again, invoke nucleotide blast and copy-and-paste this new Lost World sequence into the Enter Query Sequence window and submit it to BLAST. Click the Accession link to the right of the highest-scoring sequence match in the list. Which organism is this DNA sequence from?


Continue looking down the list of matches until you find a different organism with the next-highest match. What is this organism?



In the book, it is theorized that birds evolved from dinosaurs. If Boguski's DNA sequence were real, would the genetic evidence presented by these two top matches support this theory? Explain your answer.






EXERCISE 4: Apparently, Dr. Boguski couldn't resist sneaking a hidden message into his Lost World sequence. He inserted bases into his sequence which, when interpreted as proteins, spelled out a 4-word message. To view his hidden message, go to the main BLAST page and select the blastx option. Copy-and-paste the Lost World sequence into the Enter Query Sequence window and submit it to BLAST. Make sure to include the entire sequence for this exercise.

Once the matches appear (this can take several seconds), click on the Description labeled erythroid transcription factor [Taeniopygia guttata]. The resulting page will show the query sequence (labeled Query) written as a protein, using the 20 letters corresponding to amino acids. The matching sequence of amino acids from the database (labeled Sbjct) is shown below the query sequence, with dashes representing gaps. Dr. Boguski's message is hidden in the query sequence corresponding to the first four gaps (dashes) in the subject sequence. What is his 4-word message?






EXERCISE 5:    Access the GenBank Overview page (linked at the bottom of the main NCBI site) and answer the following questions: