GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2009-04 > 1238777129


From: James Heald <>
Subject: [DNA] Comparison of Chinese, Yoruba, Watson and Venter genome y-snps
Date: Fri, 03 Apr 2009 17:45:29 +0100


I have uploaded a couple of spreadsheet files,
http://www.healds.org.uk/dna/snps/genome_snps_comparison.zip (189k)
and
http://www.healds.org.uk/dna/snps/genome_other_snps_comparison.zip (597k)
comparing the y-SNPs reported by the Chinese, Yoruba, Watson and Venter
genome projects to each other, and to the list of SNPs with more-or-less
ascertained positions from ISOGG's haplogroup tree from ISOGG, Adriano
Squecco's 23&Me spreadsheet, and the HapMap data.

The first file contains SNPs that were reported by either the Chinese or
the Yoruba project or both, with comparisons to the other sources; the
second, SNPs from other sources that were /not/ reported in the Chinese
or Yoruba sequenced genomes. In each case, the SNPs are sorted into
relevant haplogroup order where known, otherwise by position on the Y
chromosome.

Errors, omissions, corrections, suggestions etc would be very
appreciated, because (as will become clear) there is a lot I don't
understand in this data. But here are some preliminary comments:


Apparent Genotypes
==================

* Chinese: O1a* (M119)
* Yoruba: E1b1a7a* (U174)
* Watson: R1b1b2a1a1* (U106)
* Venter: R1b1b2a1a1d1 (L44/L45/L46/L47)

* The reference sequence ("Hugo") is a composite from several
individuals, and its genotype varies from base to base. I would very
much like to know if anybody can shed any further light on this,
particularly if any particular stretches are known to all be from the
same individual.

From the data, it appears that Hugo is mostly R1b.
* At least some of Hugo is from R1b1b2a1a2d3a (S144)
* There is also some from G2a3b1a (S131), especially between circa
12942936 and 13714104
* It is possible that there may be other R1b individuals whose data also
went into the composite - eg perhaps Venter himself? Does anyone know
more about this?


Assessment of Data
==================

From the genotypes, one would expect to see SNPs in the Yoruba genome
vis-a-vis the reference from E1b1a7a up to DE and back down to NOP;
SNPs from the Chinese genome from O1a up to NO; and SNPs from both the
Yoruba and the Chinese from P to R1b1b2a1a2d3a, and also from G to
G2a3b1a when HUGO is in G.

By and large, these are indeed observed. Comparing principally against
the current ascertainments on Adriano's 23&Me sheet, with some
additional ascertainments from ISOGG and from the HapMap data,

* The Chinese data, abstracted from the 385 Mb file YH-SNPs.gff
downloaded from http://yh.genomics.org.cn/download.jsp, shows 222 of the
226 SNPs that might be expected. The anomalies, where no SNP is
reported where one might have been expected, are at the following
chomosome positions:
- 3445259 (rs2552661); "derived under R"
- 10599615 (rs9786465); "P231 (R1)"
- 5072892 (rs2571764 / rs56285826); "derived between R1b and R1b1"
- 4026708 (rs4032353); "derived between R1b1 and R1b1b"

All of these were also unreported in the Yoruba genome.

In addition, there are two locations for which SNPs are reported for all
three other genomes, which might therefore be presumed to be "HUGO
private mutations", for which no Chinese SNP is reported. These are:
- 27198031 (rs35733966)
- 57435386 (rs12171801)


* The Yoruba SNP list was inferred from the dbSNP submission at
http://www.ncbi.nlm.nih.gov/SNP/snp_viewBatch.cgi?sbid=856991, as
described in my posting of 23 March, ten days ago. The last part of
Submitter SNP_ID is presumed to be the SNP position.

This matches 461 of the 484 SNPs that might be expected. The other 23
are highlighted in yellow in the appropriate column on the two spreadsheets.

In addition, there are 14 SNPs described by the 23&Me sheet as "derived
under E1b1" for which no SNP is reported by the Yoruba genome team.
However, as the only E1b1 members that have 23&Me tests are from
haplogroups E1b1a8 and E1b1a8a, it is entirely possible that many or all
of these mutations are in face "derived under E1b1a8", so indeed should
not be seen in an E1b1a7 genome.


* The Watson SNP list was taken from the 26 Mb file
watson-454-snp-v01.txt.gz, downloaded from
ftp://jimwatsonsequence.cshl.edu/jimwatsonsequence/. This appears to
give more SNPs than selecting the Ychromosome and “downloading HapMap
GFF file” from the James Watson genome browser.

Nevertheless, only 10 of the expected 28 SNPs were reported -- making
this apparently far and away the flakiest data set.

It is also noticeable, looking on the second spreadsheet, that SNPs that
were reported in Watson but nobody else appear to be clustered in runs,
uninterrupted by Venter SNPs.

I don't understand either this, or why so many SNPs appear to be
"missing" in the Watson data.


* The Venter SNP list was taken from the 259Mb file
HuRef.InternalHuRef-NCBI.gff, downloaded from
ftp://ftp.jcvi.org/pub/data/huref/.

25 of the expected 28 SNPs are included, the exceptions being
- 13714104 (rs9786910); "derived under G"
- 13381370 (rs35285796); "L78 (derived under G2a3)"
- 5815550 (rs2566671); "L2/S139 (R1b1b2a1a2d3)"
all of these were also missing in the Watson list.

A complication with the Venter data is that the original file reports
many many sections of "heterozygous SNPs" - in parts of the Y chromosome
no-one seems to note as heterozygous; and where such Venter SNPs seem to
have been assessed by comparing with a quite different sequence than the
HUGO reference Y chromosome.

I'm not sure I understand exactly what's going on here, but I am
assuming that these were stretches that, because of the Shotgun method
used, the Venter team didn't know whether they aligned with the Y
chromosome or somewhere completely different. But they seem to have
been assessed against the somewhere completely different.

This also seems to raise a question, that where the Venter team reported
*no* SNP, does this really mean that there was no SNP? Or could it be
that they were comparing with some quite different chromosome, and there
was no SNP with respect to that? I wasn't sure which stretches of the
genome this issue might affect, so I wasn't sure how much of the Venter
"no SNP" results this might be a problem with.

Because of these issues, I have only partly included the data from the
Venter file. I have included all SNPs ("homozygous" and
"heterozygous"), but then I called it a day, so I haven't included all
the chromosome locations for MNPs and "mixed sequences", only those
which correspond to SNPs reported by one of the other sources. I've
instead extracted the Y-chromosome contents of the Venter SNP gff file,
and put them up so that they are there in full for anyone that wants
them, at http://www.healds.org.uk/dna/snps/Venter_ysnps.zip (97k).


RS numbers
----------

I also came across an odd issue with a few RS numbers quoted for
particular locations, which it seems are now considered by dbSNP to be
"not uniquely located", and are no longer included in dbSNP's "map" of
the Y chromosome, the chromosome report at
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/chr_rpts/

This 'withdrawn?' RSs include eg:
- 3129109 (rs2652921)
- 4402048 (rs2452344)
- 4703090 (rs11096443), all reported with RS numbers by the Chinese team,

and eg:
- 142179559 (rs28392688); P139, co-defining Haplogroup F
- 3604898 (rs35699596); P254, defining Haplogroup F4
- 4985637 (rs4252209); MEH2, defining Haplogroup Q1a

and others.

Typically the dbSNP server responds "Mapped unambiguously on
non-reference assembly only", and comes up with more than one chomosome
on the reference assembly where the flanking sequences could
(more-or-less?) match.

But I don't understand why these RSs are not included in the "chromosome
report", when there are others with more than one chromosome which are
included (which I've tagged with 'M' in a column on the sheet); and when
there are separate chromosome reports for "Multi" and "NotOn". Seems
like a failure in the dbSNP audit trail that these RSs aren't appearing
in /any/ of those listings.


Numerical Breakdown
===================

One useful way to sort the spreadsheet is into SNPs reported by both
Chinese and Yoruba genome teams, SNPs reported by one, and SNPs reported
by the other. (Coulmn G on the spreadsheet).

Doing that gives the following breakdown:

* 451 SNPs derived in Chinese AND Yoruba, of which 200 have ascertained
positions (more-or-less) in the tree, a further 166 have rs numbers but
no ascertained positions (33 JW != CV); and 85 are novel, with no rs
number (4 JW != CV)

* 1084 SNPs derived in Chinese NOT Yoruba, of which 22 have ascertained
positions (more-or-less) in the tree, a further 274 have rs numbers but
no ascertained positions (including 13 RS reported by the Chinese now no
longer in dbSNP main list), 39 JW != CV; and 808 are novel, with no rs
number, for which 10 JW != CV.

* 1305 SNPs derived in Yoruba NOT Chinese, of which 268 have ascertained
positions (more-or-less) in the tree, a further 377 have rs numbers but
no ascertained positions (2 RS no longer in dbSNP main list; 15 JW!=
CV); and 660 are novel, with no rs number (20 JW!= CV)


I've noted the number for which the JW report doesn't match the CV
report, because under "normal" circumstances (i.e. if SNPs are truly
UEPs, the SNP reports were accurate, and SNPs haven't been masked by
anything else going on), one might expect that:

- a mutation in either JW or CV, and a mutation in either the Chinese or
the Yoruba should mean a mutation in the other (either the Yoruba or the
Chinese); (since this presumably indicates a mutation which is 'private'
in the HUGO sequence, or at least between there and JW); and
- a mutation in either the Chinese or the Yoruba but not both should
mean that JW=CV, since the mutation presumably should be occurring
elsewhere than downstream of P.

But the large number of cases where JW != CV (which I have highlighted
on the sheet) indicates that there is more going on here than I have
understood.


Finally, something I found very striking was how much larger the number
of mutations reported in "Chinese NOT Yoruba" (ie Haplogroup NO to
Haplogroup O1a*) was than that reported in "Chinese AND Yoruba" (ie
Haplogroup P to Haplogroup R1b1b2a1a2d3a): 1084 versus 451, for what is
by definition exactly the same span of time in both cases.

Could different fixation rates due to different population histories
really lead to such a large difference?

And if they could, does that mean the dating method used in Karafet et
al (2008) now looks very questionable?


Hoping somebody can shed some light on all this,

All best,

James.









This thread: