Question: Blast Dinosaur DNA   
In order to use the necessary website effectively you may need
to…


Blast Dinosaur DNA   

In order to use the necessary website effectively you may need
to instruct your browser to enable popup windows. This can be done
by changing this in the preferences of the web browser (e.g. on a
Mac OS, Firefox àPreferences àContent àUncheck “Block pop-up
windows).

There are several questions embedded in this exercise. They are
numbered 1-7, some have multiple parts and are sometimes embedded
in paragraphs (so, read carefully!). When you have completed this
lab, write up your answers and print them out – this will function
as your lab notebook.  Be sure to include your name,
date, and a title for the lab.  

First, go to the main NCBI web site
(http://www.ncbi.nlm.nih.gov/).  From the main NCBIweb
site click on the BLAST link (right-hand side, under Popular
Resources). There are a number of variants, the most common are
listed in Table 1.

Table 1.Common Variants of BLAST

BLAST variant

Query sequence type

Database type

blastn

DNA

DNA

blastp

protein

protein

blastx

DNA (translated to protein)

protein

tblastn

protein

DNA (translated to protein)

tblastx

DNA (translated to protein)

DNA (translated to protein)

BLAST stands for Basic Local Alignment Search
Tool.  It is a quick way to take an unknown sequence and
align it against an existing sequence database.  The key
to BLAST is that it divides the unknown sequence (the query) into
short segments, matches those to existing sequences, and then
extends the search to neighboring sequences.  The
original blast publication is by Altschul et al. (1990. Basic local
alignment search tool. J Mol Biol. 215(3):403-10).  After
making the best alignment possible based on the BLAST algorithm,
the match is assigned an alignment scorethat
incorporates the number of correct matches, the number of misses,
and the number and length of gaps.  Note that the
alignment score is a function not just of the factors mentioned
above, but also the overall length of the sequence (e.g., a short
query will have a low maximum score even if it matches another
sequences perfectly).

Simplified alignment score example:

AACGTTTCCAGTCCAAATAGCTAGGC

===–===   =-===-==-======

AACCGTTC   TACAATTACCTAGGC

Hits(+1): 18, Misses (-2): 5,  Gaps (existence -2,
extension -1): 1 Gap with Length: 3

Alignment Score = 18 * 1 + 5 * (-2) – 2 – 2 = 6

Another important result from a BLAST search is the E
value
.  This is the number of matches with the
same alignment score that you would expect by chance
alone.  Typically, an E value <0.05 is required for
the match (or hit) to be significant. There are a number of
additional references that you can use to learn about BLAST, I have
listed three of these below.

http://www.youtube.com/watch?v=HXEpBnUbAMo(YouTube video)

http://www.ncbi.nlm.nih.gov/books/NBK21097/(information from
NCBI)

http://www.ncbi.nlm.nih.gov/books/NBK1734/(information from
NCBI)

“Jurassic Park” Dinosaur-DNA Analysis

In 1990 Michael Crichton wrote a book called Jurassic
Park
, which was later made into a movie by Steven Speilberg in
1993. The story begins with the resurrection of dinosaurs using
blood from the digestive tracts of insects that had been encased in
tree sap, which later turned into amber. At one point in the book,
Dr. Henry Wu is asked to explain some DNA techniques used in
reconstructing the extinct dinosaur genomes. Dr. Wu alludes to the
fact that they do not have the entire genome sequenced, but that
they “fill in the gaps” with modern day frog DNA. (As an aside, we
know now that there would be better sources of modern comparable
DNA. For one point of extra credit, make a suggestion and explain
why this would work better than frog DNA.  If you have no
idea, this might provide a hint: http://xkcd.com/1211/) At one
point during his discussion he points to a computer screen and
remarks, “Here you see the actual structure of a small fragment of
dinosaur DNA.” The DNA sequence Dr. Wu refers to is found on page
103 of the book and is seen below.  

>JurassicPark DinoDNA p 103
gcgttgctgg cgtttttcca taggctccgc ccccctgacg agcatcacaa aaatcgacgc
ggtggcgaaa cccgacagga ctataaagat accaggcgtt tccccctgga agctccctcg
tgttccgacc ctgccgctta ccggatacct gtccgccttt ctcccttcgg gaagcgtggc
tgctcacgct gtaggtatct cagttcggtg taggtcgttc gctccaagct gggctgtgtg
ccgttcagcc cgaccgctgc gccttatccg gtaactatcg tcttgagtcc aacccggtaa
agtaggacag gtgccggcag cgctctgggt cattttcggc gaggaccgct ttcgctggag
atcggcctgt cgcttgcggt attcggaatc ttgcacgccc tcgctcaagc cttcgtcact
ccaaacgttt cggcgagaag caggccatta tcgccggcat ggcggccgac gcgctgggct
ggcgttcgcg acgcgaggct ggatggcctt ccccattatg attcttctcg cttccggcgg
cccgcgttgc aggccatgct gtccaggcag gtagatgacg accatcaggg acagcttcaa
cggctcttac cagcctaact tcgatcactg gaccgctgat cgtcacggcg atttatgccg
caagtcagag gtggcgaaac ccgacaagga ctataaagat accaggcgtt tcccctggaa
gcgctctcct gttccgaccc tgccgcttac cggatacctg tccgcctttc tcccttcggg
ctttctcatt gctcacgctg taggtatctc agttcggtgt aggtcgttcg ctccaagctg
acgaaccccc cgttcagccc gaccgctgcg ccttatccgg taactatcgt cttgagtcca
acacgactta acgggttggc atggattgta ggcgccgccc tataccttgt ctgcctcccc
gcggtgcatg gagccgggcc acctcgacct gaatggaagc cggcggcacc tcgctaacgg
ccaagaattg gagccaatca attcttgcgg agaactgtga atgcgcaaac caacccttgg
ccatcgcgtc cgccatctcc agcagccgca cgcggcgcat ctcgggcagc gttgggtcct
gcgcatgatc gtgctagcct gtcgttgagg acccggctag gctggcgggg ttgccttact
atgaatcacc gatacgcgag cgaacgtgaa gcgactgctg ctgcaaaacg tctgcgacct
atgaatggtc ttcggtttcc gtgtttcgta aagtctggaa acgcggaagt cagcgccctg

In 1992 Dr. Mark Boguski at NCBI entered this sequence into a
text editor and searched all of the known DNA sequences at the
time. Mark wrote up his findings and submitted a manuscript to the
journal BioTechniques, as a tongue-in-cheek joke. His manuscript
was accepted and published (Boguski, M.S. 1992. A molecular
biologist visits Jurassic Park.
BioTechniques12(5):668-669). In 1992, the tools you now
have were not available and access to databases was very awkward
and time-consuming. In fact, most databases at that time did not
allow public access. You will be able to easily reproduce this
experiment using BLAST and your favorite web browser in less than
1/100thof the time it took Dr. Boguski.

Part 1:If you followed the instructions on the
first page, you are already at the NCBI BLAST page; if not, click
this link: http://blast.ncbi.nlm.nih.gov/Blast.cgi. From the main
BLAST page, select nucleotide blast (which is also called blastn;
the n stands for nucleotides). This brings up a web page where you
can specify your query sequence (i.e., the sequence you will use to
search the databases), along with various
parameters.  

Cut and paste the above “dinosaur DNA” sequence into the large
box labeled “Enter accession number(s), gi(s), or FASTA
sequence(s).” The > symbol indicates the sequence is in FASTA
format, the first line of which is for information/comment and is
not used in the actual search. (Note that if you had a sequence
saved as a text file, you could have clicked the Choose File button
and uploaded the file that contained the query sequence or multiple
query sequences. Also, note that when you point and click your
mouse somewhere outside the big box, the FASTA description line
becomes the Job Title.)  In the area of the webpage
titled Choose Search Set, click the Others (nr, etc.) radio
button.

1. How many sequences are currently in the nr
database?
Hint: use the NCBI website to help you find the
answers to these questions, in particular clicking on the question
mark next to the drop down menu specifying “nucleotide collection
(nr/nt)”.

It is also possible to specify or exclude sequences from
particular organisms by including a taxon name in the box
“Organism”. For example, if you type in “Drosophila” you
will only retrieve sequences from Drosophiliaspecies. This
works at varying taxonomic levels, “Drosophila
melanagaster
” would retrieve sequences only from this
particular species, while “Diptera” would find all sequences from
organisms within this order (including Drosophila). Use of
this tool can decrease the amount of computational resources (and
hence time) needed for a search. In our example, we do not know the
source of the DNA, but based on the narrative above you might make
a prediction.  

2. Suggest a reasonable organismal limitation that one
could make.
Type this into the organism box and list the
taxon and taxid. This will appear after you type in the
organism.

For this exercise we will not use the limitation function and
instead search the entire nucleotide nr database.  

The next parameter we will briefly explore is the program
selection. There are three choices, “highly similar sequences
(megablast)”, “more dissimilar sequences (discontiguous blast)” and
“somewhat similar sequences (blastn)”. Click on the question mark
after “Choose a BLAST algorithm” to learn more about these
options.  (Hint, click on the “more” option to use NCBI
resources). Now choose the “Highly similar sequences (megablast)”
option.

Check the box in front of “Show results in a new window” and
then click the “BLAST” button to start the search. A new web page
will appear. (As mentioned above, the sequence you submit for
searching the database is called the query sequence. An identical
or similar sequence that is found in the database by the query
sequence is called the subject sequence.) Using resources found on
NCBI and other places, answer the following questions about your
results.

3. a. What is the color of the top line in the graphic
(not the multicolored one)?

b. What does this color mean?

c. Scroll down to the table just below the
graphic with all the lines. You can find out the actual source of
the “dinosaur” DNA by clicking the Accession number on the first
line. The first listed accession number is the best hit (= best
alignment).  Clicking on this will give you access to the
GenBank page that contains lots of other information about the
source of the “dinosaur” DNA. What is the source of the
“dinosaur” DNA?

d. What is a vector used for in molecular
biology?
If you aren’t familiar with this term, work with
your lab partner to understand it. Don’t be afraid to use google.
Check in with your instructor to verify your understanding.

e. If this sequence was
actual dinosaur DNA, do your results make
sense?  Explain.

f.Go back to the tab with the results webpage
that you were looking at in parts a-b above. What is the
Max score and E value for the first hit?
I am not looking
for definitions here, but rather the actual values for the two for
the first hit.  (You will need this information
later.)

g.Next scroll down below the “Description”
table to the “Alignments”. This shows the alignment of the query
and subject sequences. The alignment is shown in pieces, with the
highest scoring pieces shown first. Take some time to examine the
first alignment and answer the following questions using the “Range
2” piece. (In this case, the first subject sequence is referred to
as the best hit because it is the most similar to the query
sequence.)

i. Using the positions (numbers) next to the sequences,
which nucleotide is located at query nucleotide position 302
(T,A,G, or C)?

ii. What position does this
correspond to in the subject sequence (what is the
number)?

iii. What is the percent
identity of this range of the two
sequences?  

iv. In your own words, what does the answer to part iii
mean?
  (This was actually mentioned in the
YouTube video.)

v. How many gaps are there in this range of aligned
sequences?

Now highlight and copy the DNA sequence below (60 bases), which
is the fourth line of the putative dinosaur DNA from page 2
above.

tgctcacgct gtaggtatct cagttcggtg taggtcgttc gctccaagct gggctgtgtg

 

Go back to the blastnpage and paste the above sequence into the
big box.  Make sure the non-redundant database is still
selected and click the BLAST button.  

4.a. What color is the top
line in the graphic?

b. What does that color
mean?

c. What are the Max score
and E value for the best hit?

d.  Why do you
think that the colored line, hit score and E value are different
from those above?

Part 2:  Mark Boguski’s published
article was brought to Crichton’s attention. In his second book,
The Lost World,” Crichton used Mark as a consultant. Mark
constructed an interesting sequence from existing species and also
embedded a message in the protein translation of the DNA sequence
that he submitted for use in the book. Here is the sequence Mark
gave Crichton for the book “The Lost
World
.”  Copy the following sequence and then paste
it into the big box on the blastn webpage. Make sure the
non-redundant database is selected. Click the BLAST.

>LostWorld DinoDNA p 135

gaattccgga agcgagcaag agataagtcc tggcatcaga tacagttgga gataaggacg
gacgtgtggc agctcccgca gaggattcac tggaagtgca ttacctatcc catgggagcc
atggagttcg tggcgctggg ggggccggat gcgggctccc ccactccgtt ccctgatgaa
gccggagcct tcctggggct gggggggggc gagaggacgg aggcgggggg gctgctggcc
tcctaccccc cctcaggccg cgtgtccctg gtgccgtggg cagacacggg tactttgggg
accccccagt gggtgccgcc cgccacccaa atggagcccc cccactacct ggagctgctg
caaccccccc ggggcagccc cccccatccc tcctccgggc ccctactgcc actcagcagc
gggcccccac cctgcgaggc ccgtgagtgc gtcatggcca ggaagaactg cggagcgacg
gcaacgccgc tgtggcgccg ggacggcacc gggcattacc tgtgcaactg ggcctcagcc
tgcgggctct accaccgcct caacggccag aaccgcccgc tcatccgccc caaaaagcgc
ctgcgggtga gtaagcgcgc aggcacagtg tgcagccacg agcgtgaaaa ctgccagaca
tccaccacca ctctgtggcg tcgcagcccc atgggggacc ccgtctgcaa caacattcac
gcctgcggcc tctactacaa actgcaccaa gtgaaccgcc ccctcacgat gcgcaaagac
ggaatccaaa cccgaaaccg caaagtttcc tccaagggta aaaagcggcg ccccccgggg
gggggaaacc cctccgccac cgcgggaggg ggcgctccta tggggggagg gggggacccc
tctatgcccc ccccgccgcc ccccccggcc gccgcccccc ctcaaagcga cgctctgtac
gctctcggcc ccgtggtcct ttcgggccat tttctgccct ttggaaactc cggagggttt
tttggggggg gggcgggggg ttacacggcc cccccggggc tgagcccgca gatttaaata
ataactctga cgtgggcaag tgggccttgc tgagaagaca gtgtaacata ataatttgca
cctcggcaat tgcagagggt cgatctccac tttggacaca acagggctac tcggtaggac
cagataagca ctttgctccc tggactgaaa aagaaaggat ttatctgttt gcttcttgct
gacaaatccc tgtgaaaggt aaaagtcgga cacagcaatc gattatttct cgcctgtgtg
aaattactgt gaatattgta aatatatata tatatatata tatatctgta tagaacagcc
tcggaggcgg catggaccca gcgtagatca tgctggattt gtactgccgg aattc

 

 

Click the Accession number (blue hypertext) of the best hit,
which will take you to the GenBank page.

5. Which organism is this DNA sequence
from?  

6.To learn more about the organism click the
ORGANISM link in the GenBank (gb) record. Do the same thing for the
second-highest-scoring match. Are either of these organisms
related to dinosaurs? Explain.

Part 3:From the BLAST homepage
(http://blast.ncbi.nlm.nih.gov/Blast.cgi), click on blastx, which
will translate the DNA sequence you are submitting into an amino
acid sequence (protein), and then blast the protein sequence to
protein databases. Once again, copy and paste this same “Lost
World” sequence from above into the big box (make sure to include
the entire sequence for this exercise); make sure the non-redundant
database is selected and click BLAST.  

7.On the results page, look at the best
alignment by clicking on the description for the alignment (i.e.,
click on the “erythroid transcription
factor”).  Carefully examine the alignment of the two
sequences.  Notice that there are some gaps (represented
by dashes —) where the query sequence has amino acids while the
other sequence does not.  Mark’s message is contained in
the query sequence where the subject sequence has
gaps.  Read the gaps.  What is his
message?
  

(Visited 1 times, 1 visits today)
Translate »