GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2006-03 > 1142160094


From: Ted Kandell <>
Subject: Uploading a complete mitochondrial sequence to NCBI GenBank, and why this is so useful for the genetic genealogist.
Date: Sun, 12 Mar 2006 02:41:34 -0800 (PST)


With Bennett Greenspan's kind offer of FTDNA as the address for the contact information, I've uploaded my full mtDNA sequence to NCBI GenBank (http://www.ncbi.nlm.nih.gov), which is the US govt. run site where all published genetic sequences are stored, for everything from the bird flu virus to animals and humans. This site also aggregates all the sequences uploaded into the European genetic databases as well, and so has about 80 million individual sequences in it database so far, and this is growing exponentially. It's a requirement for a paper to be published that any genetic sequence referenced in it for the first time must be uploaded first to GenBank. This insures that all researchers out there will have access to the data, and that this will contribute to the pool of general scientific knowledge.

Each uploaded genetic sequence is given an Accession Number. This is mine, along with a link to the page:

DQ377992 http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=88942033

See that me and Bennett are the "authors" I'm the very first private individual to upload a sequence to GenBank. I may be the first, but hopefully I won't be the last ... and I'm trying to work out a system, with the folks at NCBI and also FTDNA, so that people who have gotten their full mtDNA sequences can easiy upload them too. If you think about it, there would be tremendous benefits to science in general if people did this: there are currently about 1750 complete mtDNA sequences in the GenBank CoreNucleotide database, and with the current number of full sequences that have already been done or ordered from FTDNA, that number could be easily doubled immediately.

Here is a query to show all the full mtDNA sequences in GenBank CoreNucleotide (just a summary description of each one):

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=Nucleotide&term=Homo[Organism]+AND+mitochondri*[Title]+AND+16000:17000[SLEN]+NOT+partial[All+Fields]+NOT+EST[keyword]&dispmax=100&doptcmdl=DocSum

You might ask, why should anyone bother to do this, aside from some altruistic motive of contributing to the general knowlege? Well, there are real benefits for the genetic genealogist, as I will explain.

Now, before you get all confused as to what you are looking at here, I can give you a brief description of what you are seeing:

There are descriptions of what this is "mitochondrion, complete genome": what species it belongs to, what part of the genome it is "organelle=mitochondrion", the length of the uploaded sequence, the haplogroup, "haplotype=HV*" as they term it, and the authors and the contact information (more about this below.) There is also an optional comment, which can be used to add any other relevant information, such as the ethnic origin, etc.

Then come "annotations": These are descriptions of the various features of the mtDNA, such as the D-Loop (the control region), and the various genes in the coding region. This is *not* something that I had to add myself - all human mtDNA sequences are more or less identical in this respect, with the same genes and regions - the only differences might be a mutation here or there that actually changes an amino acid in a protein.

The important part is at the bottom, which is the actual full sequence. There is another way of looking at this, if you select "FASTA" from the "Display" pulldown menu on the upper left:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&qty=1&c_start=1&list_uids=88942033&dopt=fasta&dispmax=5&sendto=&from=begin&to=end&extrafeatpresent=1

This shows the sequence in the standard FASTA format, which is just the sequence of bases, with a one-line arbitrary header at the start.

This is somewhat useful as a place to store your sequence to that it will never be lost, but the real usefulness comes with the ability to compare your sequence in several ways with all the others stored out there in GenBank. There is an online tool provided by NCBI, called BLAST, that lets you do searches on arbitrary DNA and amino acid sequences to find the most similar ones, and filter the results based on the criteria you choose. You can use an accession number as the sequence to search for, or you can cut and paste a FASTA sequence, or just an arbitrary sequence of bases or amino acids too.

Here is a search using my uploaded sequence, for the first 10 similar human mitochondrial sequences, and have them be displayed in a format that makes it easy to compare the differences (you have to hit "Blast!" on the bottom of the page, then "Format!" on the next page to see the results for all of these queries):

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?ALIGNMENTS=10&ALIGNMENT_VIEW=FlatQueryAnchored&AUTO_FORMAT=Semiauto&CLIENT=web&DATABASE=nr&DESCRIPTIONS=10&ENTREZ_QUERY=human%5BORGN%5D+and+mitochondri*%5BTitle%5D&ENTREZ_QUERY=All+organisms&EXPECT=10&FORMAT_BLOCK_ON_RESPAGE=None&FORMAT_ENTREZ_QUERY=All+organisms&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&GET_SEQUENCE=on&LAYOUT=TwoWindows&MASK_CHAR=2&MASK_COLOR=1&MEGABLAST=on&NCBI_GI=on&PAGE=MegaBlast&PERC_IDENT=None,+1,+-2&PROGRAM=blastn&QUERY=DQ377992&SERVICE=plain&SET_DEFAULTS=Yes&SET_DEFAULTS.x=13&SET_DEFAULTS.y=8&SHOW_LINKOUT=on&SHOW_OVERVIEW=on&WORD_SIZE=28&END_OF_HTTPGET=Yes

This result page gives a summary, with color-coded lines to show the regions compared (in this case they are all the same) and the degree of difference, with the ability to mouse-over to see the sequence referred to. Then, the actual sequences are compared, one underneath the other, base by base. The top line is the query sequence. If any of the results have insertions, in that position there is a dash on the top line to indicate that another match has an addition at this locus. If any of the compared sequences have a deletion relative to the query sequence, they in turn have a dash at that position. Indentical bases are shown with a dot. At the sides, you can see first the "GI" number of the sequence (another kind of accession number), with a link to it, and then the starting and ending positions on each line for that sequence (60 bases). If there are relative insertions or deletions, notice that these numbers become "out of sync" relative to the original query sequence.
Now, 60 bases per line is a bit much, without a "ruler guide", so you have to do some counting ...

But, what you can see here is this: If you look at positions.2706 and 7028, you can see that two of the sequences are the same in these positions, the first and third hits (they are in order of similarity) but the rest aren't.This is because the rest of the sequences, with 2706A and 7028C, are in haplogroup H, So what are the other two? Neither has 14766T (relative to the CRS, which is off by one in the first and off by 2 in the third), so they both must be inside haplogroup HV. The third hit has 72C and 15904T (one off compared to the numbers relative to the CRS, because of the insertion at 310.1), which puts it in haplogroup V. The first hit however, doesn't have these, or any of the other mutations diagnostic of any other sub-haplogroup of HV, so therefore it too is HV*.

(All this is from the following papers - I haven't memorized these! http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=15382008&query_hl=45&itool=pubmed_docsum
and
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=15466285&query_hl=48&itool=pubmed_docsum)

Here BTW is a comparison of my sequence, with just the above HV* and V close matches for clarity, and the Revised Cambridge Reference Sequence (the last hit) to make it easy to identify which bases are changed.:

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?ALIGNMENTS=10&ALIGNMENT_VIEW=FlatQueryAnchored&AUTO_FORMAT=Semiauto&CLIENT=web&DATABASE=nr&DESCRIPTIONS=10&ENTREZ_QUERY=AY713976+OR+AY495118+OR+J01415&ENTREZ_QUERY=All+organisms&EXPECT=10&FORMAT_BLOCK_ON_RESPAGE=None&FORMAT_ENTREZ_QUERY=All+organisms&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&GET_SEQUENCE=on&LAYOUT=TwoWindows&MASK_CHAR=2&MASK_COLOR=1&MEGABLAST=on&NCBI_GI=on&PAGE=MegaBlast&PERC_IDENT=None,+1,+-2&PROGRAM=blastn&QUERY=DQ377992&SERVICE=plain&SET_DEFAULTS=Yes&SET_DEFAULTS.x=30&SET_DEFAULTS.y=9&SHOW_LINKOUT=on&SHOW_OVERVIEW=on&WORD_SIZE=28&END_OF_HTTPGET=Yes

And here is a pairwise comparison of my sequence and the other HV* sequence, AY713976 with the differences highlighted in red, to make them easy to see:

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?ALIGNMENTS=10&ALIGNMENT_VIEW=PairwiseWithIdentities&AUTO_FORMAT=Semiauto&CLIENT=web&DATABASE=nr&DESCRIPTIONS=10&ENTREZ_QUERY=AY713976&ENTREZ_QUERY=All+organisms&EXPECT=10&FORMAT_BLOCK_ON_RESPAGE=None&FORMAT_ENTREZ_QUERY=All+organisms&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&GET_SEQUENCE=on&LAYOUT=TwoWindows&MASK_CHAR=2&MASK_COLOR=1&MEGABLAST=on&NCBI_GI=on&PAGE=MegaBlast&PERC_IDENT=None,+1,+-2&PROGRAM=blastn&QUERY=DQ377992&SERVICE=plain&SET_DEFAULTS=Yes&SET_DEFAULTS.x=22&SET_DEFAULTS.y=12&SHOW_LINKOUT=on&SHOW_OVERVIEW=on&WORD_SIZE=28&END_OF_HTTPGET=Yes

This shows, at the top of the comparison, that the two sequences are identical in 16563 out of 16569 positions, with no gaps (insertions or deletions.) Each are exactly 3 off from the common root of HV*. So where does AY713976 come from?

It is an HV* sequence from India, of all places:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=51450294&dopt=GenBank

You can then click on the link under the title of the paper in which it was described, the "PUBMED" number, to read that paper:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&query_hl=1&list_uids=15467980


Now, does this mean that my mtDNA sequence is from India? Well, not really, even if it is the closest one in GenBank (for now). See, the only people who match my sequence and haplogroup exactly are a small group of Ashkenazi Jews, from pretty much the same area of Belarus, Lithuania, and the adjacent areas of Poland and Ukraine etc. all close to where my maternal ancestors came from (Belarus). There hasn't been anyone else who matches this sequence that isn't from this background whatsoever.

It seems that my sequence is "on the other side" of the common root of HV than the Indian sequence, each with just 3 mutual differences from it, and neither very far from it. There are several other HV* Indian sequences in the above paper, some of which also have 150T (unlike mine), and so form an Indian HV* clade.

There have been other HV* sequences described, including one from Italy, but again this Italian one has no mutations in common with mine or the Indian one, and is itself 6 mutations distant from the common HV root. So, the conclusion that I would draw from this, is that this sequence comes from halfway between Europe and India, i.e. the Near East, which makes possible sense if you consider the ethnic and historical background. Of course, if more sequences or matches show up, this conclusion may have to be revised, but it is a good working guess for now based on the evidence.

Here is something else very useful. Blast can produce a tree of a set of full sequences, based on their degree of similarity. For example, the 10 closest to mine that were found in the above Blast query, by just clicking on the "Tree View" button right under the list of the alignments. These trees can be formatted in several different ways, and the titles for the sequences can be displayed on the trees too.

If you think about it, this is what we all want to know: Where does my mtDNA haplotype come from, who is it closest to, and comparisons in several flexible and easy to read graphical formats. However, this of course all depends on one thing. The number of sequences out there to compare against. This is where you people come in. The more people upload their full mtDNA sequences to GenBank CoreNucleotide, the more accurate the comparison becomes, and the trees become more well defined - so the divergence estimates become more accurate too.

People may have a question about this though: They can see my name there as an author, and I'm not making any secret that this is my sequence - that is just me - but other people may have very justifiable privacy concerns. I think there's a way around this. The person can submit their sequence to GenBank through their lab, anonymously, and only the lab would associate their Accession Number with their identity, unless the person themselves wanted to publicise it. This way, if other people, matches for example, wanted to contact the testee, they could email the lab and ask about the particular Accession, and if this was set up, these emails could be forwarded anonymously, something like ySearch or Mitosearch. However, even if the person didn't want to do this, other researchers could include their sequence in their studies, just like has been done already for many other public seqences. "Free research!"

Of course, to make this useful, there has to be some specific information about the origin of the haplotype if it's known. A guideline to do this would have to be negotiated with the people at NCBI - they've been very helpful. For example, there are special "qualifiers" in the "source" section that can be used for the ethnicity, aside from the "note". There is a qualifier for the longitude and latitude ("lat_long"), and another for the country, province and locality ("country"). These however have been defined so far to be the location of collection, not the known origin of the sequence, so they haven't been used in this case. If there was a need, NCBI could define other fields to hold this sort of data (but not the gedcom, or ancestor's name of course - but these can be in Mitosearch, with a reference to the GenBank Accession there.) With sequences submitted up till now, this hasn't been a problem, because all the extra information has been referenced in the published
research paper, but these unpublished sequences would have to carry this information since they won't have any outside references (at least initially.)

Actual submission of sequences to GenBank could be done by just one click, similar to haplotypes are uploaded currently to ySearch and Mitosearch from the FTDNA site. All that it would take is for the sequence in FASTA format to be emailed to NCBI, with a temporary random identifier in the header, and then this identifier would be referenced when NCBI emails back the new Accession Number. Of course, there would have to be a way of specifying the extra qualifer data, the haplogroup, the ethnic origin, the location, etc. if they are known.

So, if lots of people who are now getting their full sequences want to submit them, these things will get worked out between the labs, the testees, and NCBI, not that there is a need.

Of course, even without submitting one's sequence to GenBank, having the sequence available in FASTA format alone will still make it easy to cut and paste and do Blast queries and generate trees too.

And, don't forget, that by doing this, other scientists may find some of these sequences interesting for other reasons than population genetics, for example, medical studies, and might want to contact the submitter to see if they would participate. Here is where the advancement of knowlege will really take place: A huge pool of data, that can be used to help research diseases like the tendency for Type II Diabetes and other metabolic conditions, which could be related to the mitochondrial haplotype. Could this lead to new treatments or cures?

What do people think about this? If anyone finds this idea interesting, they are welcome to respond to me directly, and then we can get a group together to work out the issues, like how to submit sequences easily and automatically if the testee wants, and how to insure privacy with there still being a way to optionally contact the submitter.

Ted Kandell

DQ377992 in GenBank Nucleotide
YGYTX on ySearch






This thread: