GENEALOGY-DNA-L ArchivesArchiver > GENEALOGY-DNA > 2009-04 > 1239216857
From: James Heald <>
Subject: Re: [DNA] Comparison of Chinese, Yoruba,Watson and Venter genome y-snps
Date: Wed, 08 Apr 2009 19:54:17 +0100
1. If anyone wants to look at the internal "phred" numbers for the
Chinese data, to see whether the apparent over-supply of Chinese SNPs
and/or shortage in Yoruba SNPs (about 250 Yoruba SNPs too few, when
there are Chinese SNPs; or 500 Chinese SNPs too many, out of about 950
Chinese NOT Yoruba SNPs) might relate to any systematic differences in
the quality-control numbers for Chinese NOT Yoruba SNPs, as against
Chinese AND Yoruba SNPs, I have uploaded the full Y-chromosome part of
the original Chinese SNP file to
The Phred numbers are in the two-digit column after the position numbers.
However, most of them are very high; and even a Phred number of 20 is
supposed to indicate a 99% chance that the SNP is accurate. So this
*shouldn't* make much difference. But it might be interesting to see
whether there is *any* correlation between the Phred numbers and whether
or not a Yoruba SNP was reported for a Chinese one.
(The probability for the Yoruba to show an SNP, given that it is seen in
the Chinese, ought to be about 50%: ie about (700/1400) or (450/900);
because both "Chinese NOT Yoruba" and "Chinese AND Yoruba" branches
relate to the same time span, the time from the present back to the
splitting of Haplogroup NOP.
But the observed conditional frequency f(Yoruba SNP|Chinese SNP) is in
fact only about (450/1400) = 32%; even having excluded 135 SNPs that
appear to be in Chinese-only zones, described in point 3 at
Why this difference? If the Yoruba team, like the Chinese, are only
reporting SNPs when they have a 99% chance (a Phred score of 20), rather
than a better-than-50% chance (a Phred score of zero), then are their
Y-chromosome Phred numbers as a whole significantly lower than the
Chinese, so that many more (true) SNPs are falling in the no man's land,
and not getting reported?
Unless it's in their paper, maybe only somebody on the team could tell us.)
2. To make it easier to see the likely ancestral/derived state of SNPs,
and to see the potential SNPs that might be worth investigating for
Haplogroup G, I have now colour-code position numbers in the spreadsheet
for regions where the HUGO sequence was *not* taken from RP-11, and so
uploaded an updated version
(Genotyping of HUGO discussed in more detail at
3. A couple of people have noted off-list that some of the haplogroup
assignments highlighted in green are not as detailed as they might be.
For example, the SNPs L53,L54,L55 are believed to indicate Q1a3a, and
have only been found in people in that subgroup; but on the sheet they
are indicated as Q1a3+
Similarly, in my original post I noted 14 SNPs not derived in the Yoruba
data that I suspected might be from E1b1a8 and E1b1a8a. These are all
marked as "E1b1+", despite that they have only seen under E1b1a in 23&Me
samples, never under E1b1b.
I should therefore clarify that I believe Adriano adopts quite a
conservative approach on his 23&Me spreadsheet, giving only the most
conservative assessment which *must* be made from the sample that has
tested at 23&Me; rather than a correlation that might at first sight
seem more striking.
Thus the "E1b1+" indication is given because there are samples under
E1b1 which are ancestral (the whole of E1b1b, in fact), so it is known
the SNP must occur *somewhere* below E1b1. But it is not indicated as
"E1b1a+" or "E1b1a", even though all current 23&Me samples showing
derived are in E1b1a, because Adriano can't rule out that a sample to
come from 'E1b1-something else' might also show as derived.
At least, I think that's what Adriano is doing, and I took the
assignments from his spreadsheet without looking into it any much further.
Though the question does arise (for ISOGG and other tree-makers, as well
as Adriano), how many samples do you need to see before you make the
call that the SNP is "indistinguishable" from a subgroup, rather than
merely saying it occurs "under" the group above.
Anyway, this is just to clarify what I think Adriano does, which is what
I copied into the spreadsheet.
Note that I'm not particularly intending to maintain the spreadsheet, so
this column is likely to drift out of date, both as future 23&Me samples
allow Adriano to make more and more refined haplogroup assignments, and
as future tree revisions lead to nomenclature changes. The assessments
given should therefore be seen as reflecting a snapshot in time, only.
|Re: [DNA] Comparison of Chinese, Yoruba,Watson and Venter genome y-snps by James Heald <>|