CSC 121: Computers and Scientific Thinking
Fall 2024

Lab 3: Applications in Biology/Bioinformatics

The National Center for Biotechnology Information (NCBI), which is part of the National Institutes of Health (NIH), is the primary repository of biological information in the US. The NCBI Web site includes the Basic Local Alignment Search Tool (BLAST), which can be used to search the GenBank database of DNA sequences and find regions of local similarity. This lab, based on exercises developed at NCBI/NIH and the University of New Hampshire, will familiarize you with the use of BLAST.

Jurassic Park Dino-DNA Analysis

In 1990, Michael Crichton published the book Jurassic Park about the resurrection of dinosaurs using the blood from the stomachs of insects that had been encased in amber. At one point in the book, Dr. Henry Wu is asked to explain some of the DNA techniques used in reconstructing the extinct dinosaur genomes. Dr. Wu describes the use of restriction enzymes and how the fragmented pieces of dino DNA can be spliced together with these enzymes. He also alludes to the fact that they don't have the entire genome but that they "fill in the gaps" with modern day frog DNA. At one point during his discussion, he points to a computer screen and remarks "Here you see the actual structure of a small fragment of dinosaur DNA."

gcgttgctgg cgtttttcca taggctccgc ccccctgacg agcatcacaa aaatcgacgc ggtggcgaaa cccgacagga ctataaagat accaggcgtt tccccctgga agctccctcg tgttccgacc ctgccgctta ccggatacct gtccgccttt ctcccttcgg gaagcgtggc tgctcacgct gtaggtatct cagttcggtg taggtcgttc gctccaagct gggctgtgtg ccgttcagcc cgaccgctgc gccttatccg gtaactatcg tcttgagtcc aacccggtaa agtaggacag gtgccggcag cgctctgggt cattttcggc gaggaccgct ttcgctggag atcggcctgt cgcttgcggt attcggaatc ttgcacgccc tcgctcaagc cttcgtcact ccaaacgttt cggcgagaag caggccatta tcgccggcat ggcggccgac gcgctgggct ggcgttcgcg acgcgaggct ggatggcctt ccccattatg attcttctcg cttccggcgg cccgcgttgc aggccatgct gtccaggcag gtagatgacg accatcaggg acagcttcaa cggctcttac cagcctaact tcgatcactg gaccgctgat cgtcacggcg atttatgccg caagtcagag gtggcgaaac ccgacaagga ctataaagat accaggcgtt tcccctggaa gcgctctcct gttccgaccc tgccgcttac cggatacctg tccgcctttc tcccttcggg ctttctcatt gctcacgctg taggtatctc agttcggtgt aggtcgttcg ctccaagctg acgaaccccc cgttcagccc gaccgctgcg ccttatccgg taactatcgt cttgagtcca acacgactta acgggttggc atggattgta ggcgccgccc tataccttgt ctgcctcccc gcggtgcatg gagccgggcc acctcgacct gaatggaagc cggcggcacc tcgctaacgg ccaagaattg gagccaatca attcttgcgg agaactgtga atgcgcaaac caacccttgg ccatcgcgtc cgccatctcc agcagccgca cgcggcgcat ctcgggcagc gttgggtcct gcgcatgatc gtgctagcct gtcgttgagg acccggctag gctggcgggg ttgccttact atgaatcacc gatacgcgag cgaacgtgaa gcgactgctg ctgcaaaacg tctgcgacct atgaatggtc ttcggtttcc gtgtttcgta aagtctggaa acgcggaagt cagcgccctg

In 1992, Dr. Mark Boguski at NIH entered this sequence into a text editor and searched all of the known DNA sequences at the time. Dr. Boguski wrote up his findings and submitted a manuscript to the journal BioTechniques, as a tongue-in-cheek joke. His manuscript was accepted and published [Boguski, M.S. A Molecular Biologist Visits Jurassic Park. (1992) BioTechniques 12(5):668-669]. You will reproduce this experiment using BLAST.

EXERCISE 1:   From the main BLAST page, select nucleotide blast. This brings up a web page where you can specify your query sequence along with various parameters (including the genetic database to use). Copy-and-paste the above "dinosaur DNA" sequence into the window labeled Enter Query Sequence, using the default Nucleotide collection (nr/nt) database, and then click the BLAST button to start the search. After a short, delay, the results of your search will be displayed in the page. By default, the Descriptions tab lists the closest matches (or "hits") in a table, with columns for the sequence description, measures of how close the match was, and background information about the sequence (Accession). Click on the "Total Score" column heading so that the hits are ranked by total score, the best measure of an overall match. Lists the name (what appears in the "Description" column) for each of the top three matches.
EXERCISE 2:    Clicking on the Accession link displays background information on a match, including its source organism. If a sequence does not correspond to a natural organism, the SOURCE ORGANISM entry will identify it as an artificial sequence. How many of the top ten matches (again, ranked by "Total Score") are artificial sequences? For any hit that is not an artificial sequence, check the date they were published (under the JOURNAL entry). Were they published before 1990, when Jurassic Park was published?
EXERCISE 3:    When you click on the "Graphic Summary" tab at the top of the hits table, you are shown the matches in a visual form. The colors of the lines show how close the match was between the search sample and the database hit, with red signifying sections that match closely and the colors lavender, green, blue, and black signifying less perfect matches. There may also be colorless sections, denoting gaps in the matching sequence. Describe the lines you see for your search hits. Are there any colors beside red? Do any of the lines contain gaps?

In practice, researchers rarely have complete and exact DNA samples. Some mistakes will undoubtedly occur in extracting sequences from samples, and gaps may occur as pieces of a sample are lost or incorrectly combined. This is why BLAST reports multiple matches and provides matching information via the colored lines and overall score. Advanced users of BLAST can specify additional search parameters to control how similar a match must be in order to be reported.

EXERCISE 4:   Introduce errors into the Jurassic Park sequence by deleting the second, third, and next-to-last lines in the search sequence, and randomly changing another line to whatever nucleotides (C, G, A or T) you want. Do these changes affect the search results you obtain (compared to the matches from the original search)? Are the top 3 matches the same as before? Are the "Total Scores" impacted? How about the lines in the Graphic Summary?

The Lost World Dino-DNA Analysis

After Dr. Boguski's article appeared in 1992, it was brought to Michael Crichton's attention. Crichton, who was working on the sequel to Jurassic Park, reached out to Boguski and asked him to consult on the book. Dr. Boguski constructed an interesting sequence that he felt was more scientifically plausible, and this sequence appeared The Lost World.

gaattccgga agcgagcaag agataagtcc tggcatcaga tacagttgga gataaggacg gacgtgtggc agctcccgca gaggattcac tggaagtgca ttacctatcc catgggagcc atggagttcg tggcgctggg ggggccggat gcgggctccc ccactccgtt ccctgatgaa gccggagcct tcctggggct gggggggggc gagaggacgg aggcgggggg gctgctggcc tcctaccccc cctcaggccg cgtgtccctg gtgccgtggg cagacacggg tactttgggg accccccagt gggtgccgcc cgccacccaa atggagcccc cccactacct ggagctgctg caaccccccc ggggcagccc cccccatccc tcctccgggc ccctactgcc actcagcagc gggcccccac cctgcgaggc ccgtgagtgc gtcatggcca ggaagaactg cggagcgacg gcaacgccgc tgtggcgccg ggacggcacc gggcattacc tgtgcaactg ggcctcagcc tgcgggctct accaccgcct caacggccag aaccgcccgc tcatccgccc caaaaagcgc ctgcgggtga gtaagcgcgc aggcacagtg tgcagccacg agcgtgaaaa ctgccagaca tccaccacca ctctgtggcg tcgcagcccc atgggggacc ccgtctgcaa caacattcac gcctgcggcc tctactacaa actgcaccaa gtgaaccgcc ccctcacgat gcgcaaagac ggaatccaaa cccgaaaccg caaagtttcc tccaagggta aaaagcggcg ccccccgggg gggggaaacc cctccgccac cgcgggaggg ggcgctccta tggggggagg gggggacccc tctatgcccc ccccgccgcc ccccccggcc gccgcccccc ctcaaagcga cgctctgtac gctctcggcc ccgtggtcct ttcgggccat tttctgccct ttggaaactc cggagggttt tttggggggg gggcgggggg ttacacggcc cccccggggc tgagcccgca gatttaaata ataactctga cgtgggcaag tgggccttgc tgagaagaca gtgtaacata ataatttgca cctcggcaat tgcagagggt cgatctccac tttggacaca acagggctac tcggtaggac cagataagca ctttgctccc tggactgaaa aagaaaggat ttatctgttt gcttcttgct gacaaatccc tgtgaaaggt aaaagtcgga cacagcaatc gattatttct cgcctgtgtg aaattactgt gaatattgta aatatatata tatatatata tatatctgta tagaacagcc tcggaggcgg catggaccca gcgtagatca tgctggattt gtactgccgg aattc
EXERCISE 5:   Once again, invoke nucleotide blast and copy-and-paste this new Lost World sequence into the Enter Query Sequence window and submit it to BLAST. As before, click on the "Total Score" header to sort the hits by total score. Click the Accession link to the right of the highest-scoring sequence match in the list. Which source organism is this DNA sequence from? Similarly, what is the next highest-scoring source organism?
EXERCISE 6:   In the book, it is theorized that birds evolved from dinosaurs. If that were true, you might expect the DNA of modern birds to have similarities with dinosaur DNA. Do either of the top two matches support this theory? What explanation might there be for the other top match? Explain your answers.

The Blast page provides tools for searching the DNA databases and viewing the results in various ways. The default blastn option that we have been using displays the hits as nucleotide sequences (i.e., sequences of nucleotides C, G, A, and T). Alternatively, by selecting the blastx option, the results may be instead be viewed as proteins. Apparently, Dr. Boguski couldn't resist sneaking a hidden message into his Lost World sequence. He inserted nucleotides into his sequence which, when interpreted as proteins, spelled out a 4-word message.

EXERCISE 7:   To view Dr. Boguski's hidden message, go to the main BLAST page and select the blastx option. Copy-and-paste the Lost World sequence into the Enter Query Sequence window and submit it to BLAST. Once the matches appear (this can take several seconds), sort the hits by "Total Score" as before, then click on the description of the top hit. The resulting page will show the query sequence (labeled Query) written as a protein, using the 20 letters corresponding to amino acids. The matching sequence of amino acids from the database (labeled Sbjct) is shown below the query sequence, with dashes representing gaps. Dr. Boguski's message is hidden in the query sequence corresponding to the first four gaps (sequences of dashes) in the subject sequence. What is his 4-word message?

Exploring the NCBI Site

The NCBI Web site has a wide variety of resources pertaining to biotechnology and biomedicine, beyond the genetic databases and tools of BLAST. Links at the bottom of the page are especially useful if you are looking for information or a specific tool.

EXERCISE 8:   Access the GenBank Overview page (found by clicking on the GenBank link at the bottom of the About NCBI site) and answer the following questions:

Submit a document containing your answers to all of the lab questions via BlueLine.