GENEALOGY-DNA-L ArchivesArchiver > GENEALOGY-DNA > 2009-04 > 1239034215
From: James Heald <>
Subject: Re: [DNA] Comparison of Chinese, Yoruba,Watson and Venter genome y-snps
Date: Mon, 06 Apr 2009 17:10:15 +0100
Tim, and John,
Thank you very much for the links, commentary and analysis.
To follow up some points:
1. The extraction of the Chinese Y-SNPs was not anything very clever.
There is a file YH-SNPs.gff that is available from
http://yh.genomics.org.cn/download.jsp (under "2 YH variants and
annotations"), which contains all the SNPs found in the whole genome.
The file is quite big (385 Mb), but it is then easy to extract the
Y-chromosome ones (lines which start "chrY"), using the unix utility
"grep". This gives a file which is a more manageable 180k.
That file contains a little more detail than I extracted, including a
column for the probability that it really was an SNP, and a comment
discussing the local neighbourhood of the sequence.
2. Mutation L52 (rs13304168) occurs at position 13151202 -- part of the
short stretch where HUGO is in Haplogroup G. Therefore, Venter and
Watson show a mutation, compared to the reference sequence; but the
Chinese and the Yoruba do not. It therefore appears on the second
spreadsheet, not the first one.
3. Tim's analysis by "million" is very interesting. The stretches
where there seem to be a particular excess of mutations in the Chinese
sequence compared to the Yoruba may be further more precisely located to:
* ChrY:11700739..11722747 (53 SNPs)
* ChrY:11747533..11792857 (82 SNPs)
* ChrY:18022464..19270159 (29 SNPs)
* ChrY:20753369..20846749 (37 SNPs)
These are stretches where there are "runs" of the given number of SNPs
in the Chinese genome uninterrupted by any SNPs on the Yoruba genome --
something rather unlikely if the SNPs were simply distributed randomly.
Looking at them in Thomas Krahn's Ymap browser, the first region is (I
think) bang on the centromere of the chromosome. I don't know what the
significance of this might be -- would this be an area where
recombination might be particularly possible?
My suspicion is that the first two represent two runs of unusually large
many SNPs in the Chinese data -- perhaps indicating a mutation process
that has changed several locations simultaneously all at once? Or
perhaps different reporting criteria for SNPs, particularly relevant in
this area? Whereas the last two seem to represent the Yoruba data
calling a surprisingly low number of SNPs.
There are some of these Chinese SNPs previously known with RS numbers
within all but the third of these ranges, so they presumably can't be
completely off the wall, even though none of them seem to have any
phylogenetic data yet.
4. If the age of Haplogroup DE is about 65,000 years ago, then the
Yoruba SNP data giving that 451 SNPs are shared with the Chinese out of
a total of 1756 would suggest an age for the split in Haplogroup NOP of
about (451/1756)*2*65000 = 33,400 years. (+/- appropriate margins of
error). Wikipedia currently gives a date for Haplogroup NOP of
30-35,000 years ago, so this is encouraging. (Of course many of the
SNPs were previously known and used by Karafet et al to establish these
same dates, but it is good to know that the new ones seem to follow the
1084 SNPs are shown in the Chinese genome but not the Yoruba one. Even
if we exclude 135 from the first two categories above, as likely all
reflecting some similar linked factor, that still leaves 949. Is the
difference simply random noise; are 451 and 949 simply two realisations
of a Poisson process with mean 700? It is unlikely. Such a Poisson
process would have a standard deviation of about sqrt(700), or about 25
-- so this difference is far more than would arise from simple random
noise. I wondered in my original post whether there might be something
in the Chinese histroy that had led to much higher mutation fixation
rates. But perhaps John is nearer the mark, when he wrote "I wonder if
the computer software... might be discarding them as ambiguous" --
particularly if, as he suggests, the software may be weighted to reject
differences from HUGO, at least if there is not already something in
dbSNP. Also, perhaps SNPs in satellite sequences and repeat sequences
may not always be reported. If 250 mutations in the "Chinese NOT
Yoruba" list were shifted to the "Chinese AND Yoruba" list, the numbers
would balance -- and that need only represent the most marginal
one-sixth of the Chinese SNP calls (which I did nothing to quality
control or quality threshold from the raw Chinese report).
5. Tim says he's estimated the age of P312 as about 3500 years.
Wikipedia's R1b page puts the age of L48 at about 4000 years. If we
were to estimate the age of the HUGO-R sequence splitting from Venter
and Watson at about 5000 years ago, that would mean we should expect to
see about (5000/130000)*1756 = 68 mutations shared by the Yoruba,
Chinese, Venter and Watson; and a similar further number of mutations in
either Watson or Venter compared to HUGO, not shared with the Yoruba or
Chinese, of which perhaps a quarter or a third might be common to Watson
Interestingly, there are about 60 SNPs which are positive in the Yoruba
and the Chinese and Venter, as compared with the reference sequence.
Nine are known to relate to Haplogroup G, from locations when HUGO is in
G. Two more are P312 and L20, in Haplogroup R.
A further eleven are on the 23&Me chip. Currently none of them are
derived in any of Adriano's sample; but only four of his sample are
L20+, so it remains to be seen whether somebody may still appear who
comes in sharing a subgroup with HUGO on one or more of these markers.
(One, rs9786556, is at 13678833 so may be a private mutation associated
with HUGO in Haplogroup G, rather than R-L20+).
Of the remaining thirty-eight, twelve may be more likely to be
associated with (subgroups of) Haplogroup G:
12933864 rs9786460 (may be under Haplogroup K)
(I'm still not exactly sure where the boundaries are for the HUGO-in-G
region, nor exactly how sharp they are; but these are all either in or
close to that region).
Before anyone starts a walk-on-the-Y in Haplogroup G, it might be worth
That leaves 26 apparently most likely to be in Haplogroup R below P312:
6. Tim found a lot more than 130 SNPs for Venter in total (or even 200,
if the Venter SNPs were like the Chinese SNPs more likely to be reported
than Yoruba SNPs). So there's clearly more to understand about them. I
think they may be useful (as above) to prospect with for
phylogenetically interesting mutations; but at this stage I think John
is right, they is simply too much unclear about the Venter SNPs (and
even more so the Watson ones) for them to help with dates - which would
anyway be relying on only a very small number of SNPs, so would still be
As to prospecting, one thing I wonder, since the Chinese data has been
around for a while, do we know whether anyone perhaps like Ethnoancestry
may have already looked at these mutations, to see whether they do have
any potential to further clarify any groups in R? Is it possible any of
them may have already been genotyped against a panel, and those results
if negative never publicised?
7. A couple more ranges of interest shown up by the Venter SNPs, when
compared with Chinese and/or Yoruba SNPs:
The latter shows a sudden run of 16 SNPs, where the Venter sequence
agrees with the Yoruba (but not the Chinese); apart from two locations,
where Venter agrees with the Chinese but not the Yoruba against the
The former contains a run of eight SNPs that Watson or Venter share with
the Chinese. Unfortunately, the first half are marked "heterozygous" in
the Venter data, so not reliable. But even so, this is a very striking
sudden density of SNPs.
According to papers cited in the posts Tim linked, there were two short
stretches of the Y chromosome where the HUGO sequence was filled in from
another source, in addition to the section taken from a Haplogroup G
individual. I wonder if the patterns here could be related to that?
All best for now,
|Re: [DNA] Comparison of Chinese, Yoruba,Watson and Venter genome y-snps by James Heald <>|