GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2009-04 > 1239034215


From: James Heald <>
Subject: Re: [DNA] Comparison of Chinese, Yoruba,Watson and Venter genome y-snps
Date: Mon, 06 Apr 2009 17:10:15 +0100
References: <200904051042.n35AgVWO030245@mail.rootsweb.com><8644F32C911145A09C8DE5F2E0103BFB@john>
In-Reply-To: <8644F32C911145A09C8DE5F2E0103BFB@john>


Tim, and John,

Thank you very much for the links, commentary and analysis.

To follow up some points:

1. The extraction of the Chinese Y-SNPs was not anything very clever.
There is a file YH-SNPs.gff that is available from
http://yh.genomics.org.cn/download.jsp (under "2 YH variants and
annotations"), which contains all the SNPs found in the whole genome.
The file is quite big (385 Mb), but it is then easy to extract the
Y-chromosome ones (lines which start "chrY"), using the unix utility
"grep". This gives a file which is a more manageable 180k.

That file contains a little more detail than I extracted, including a
column for the probability that it really was an SNP, and a comment
discussing the local neighbourhood of the sequence.

2. Mutation L52 (rs13304168) occurs at position 13151202 -- part of the
short stretch where HUGO is in Haplogroup G. Therefore, Venter and
Watson show a mutation, compared to the reference sequence; but the
Chinese and the Yoruba do not. It therefore appears on the second
spreadsheet, not the first one.

3. Tim's analysis by "million" is very interesting. The stretches
where there seem to be a particular excess of mutations in the Chinese
sequence compared to the Yoruba may be further more precisely located to:
* ChrY:11700739..11722747 (53 SNPs)
* ChrY:11747533..11792857 (82 SNPs)
* ChrY:18022464..19270159 (29 SNPs)
* ChrY:20753369..20846749 (37 SNPs)
These are stretches where there are "runs" of the given number of SNPs
in the Chinese genome uninterrupted by any SNPs on the Yoruba genome --
something rather unlikely if the SNPs were simply distributed randomly.

Looking at them in Thomas Krahn's Ymap browser, the first region is (I
think) bang on the centromere of the chromosome. I don't know what the
significance of this might be -- would this be an area where
recombination might be particularly possible?

My suspicion is that the first two represent two runs of unusually large
many SNPs in the Chinese data -- perhaps indicating a mutation process
that has changed several locations simultaneously all at once? Or
perhaps different reporting criteria for SNPs, particularly relevant in
this area? Whereas the last two seem to represent the Yoruba data
calling a surprisingly low number of SNPs.

There are some of these Chinese SNPs previously known with RS numbers
within all but the third of these ranges, so they presumably can't be
completely off the wall, even though none of them seem to have any
phylogenetic data yet.

4. If the age of Haplogroup DE is about 65,000 years ago, then the
Yoruba SNP data giving that 451 SNPs are shared with the Chinese out of
a total of 1756 would suggest an age for the split in Haplogroup NOP of
about (451/1756)*2*65000 = 33,400 years. (+/- appropriate margins of
error). Wikipedia currently gives a date for Haplogroup NOP of
30-35,000 years ago, so this is encouraging. (Of course many of the
SNPs were previously known and used by Karafet et al to establish these
same dates, but it is good to know that the new ones seem to follow the
pattern).

1084 SNPs are shown in the Chinese genome but not the Yoruba one. Even
if we exclude 135 from the first two categories above, as likely all
reflecting some similar linked factor, that still leaves 949. Is the
difference simply random noise; are 451 and 949 simply two realisations
of a Poisson process with mean 700? It is unlikely. Such a Poisson
process would have a standard deviation of about sqrt(700), or about 25
-- so this difference is far more than would arise from simple random
noise. I wondered in my original post whether there might be something
in the Chinese histroy that had led to much higher mutation fixation
rates. But perhaps John is nearer the mark, when he wrote "I wonder if
the computer software... might be discarding them as ambiguous" --
particularly if, as he suggests, the software may be weighted to reject
differences from HUGO, at least if there is not already something in
dbSNP. Also, perhaps SNPs in satellite sequences and repeat sequences
may not always be reported. If 250 mutations in the "Chinese NOT
Yoruba" list were shifted to the "Chinese AND Yoruba" list, the numbers
would balance -- and that need only represent the most marginal
one-sixth of the Chinese SNP calls (which I did nothing to quality
control or quality threshold from the raw Chinese report).

5. Tim says he's estimated the age of P312 as about 3500 years.
Wikipedia's R1b page puts the age of L48 at about 4000 years. If we
were to estimate the age of the HUGO-R sequence splitting from Venter
and Watson at about 5000 years ago, that would mean we should expect to
see about (5000/130000)*1756 = 68 mutations shared by the Yoruba,
Chinese, Venter and Watson; and a similar further number of mutations in
either Watson or Venter compared to HUGO, not shared with the Yoruba or
Chinese, of which perhaps a quarter or a third might be common to Watson
and Venter.

Interestingly, there are about 60 SNPs which are positive in the Yoruba
and the Chinese and Venter, as compared with the reference sequence.
Nine are known to relate to Haplogroup G, from locations when HUGO is in
G. Two more are P312 and L20, in Haplogroup R.

A further eleven are on the 23&Me chip. Currently none of them are
derived in any of Adriano's sample; but only four of his sample are
L20+, so it remains to be seen whether somebody may still appear who
comes in sharing a subgroup with HUGO on one or more of these markers.
(One, rs9786556, is at 13678833 so may be a private mutation associated
with HUGO in Haplogroup G, rather than R-L20+).

Of the remaining thirty-eight, twelve may be more likely to be
associated with (subgroups of) Haplogroup G:
12581348 rs7892986
12581351 rs7892905
12654641 rs9785657
12933864 rs9786460 (may be under Haplogroup K)
13095332 rs7893102
13467612 rs13304344
13502752 rs11799152
13603793 rs9786878
13603794 rs9785906
13661711 rs35304448
13689485 rs62614587
13784594 rs1125978
(I'm still not exactly sure where the boundaries are for the HUGO-in-G
region, nor exactly how sharp they are; but these are all either in or
close to that region).

Before anyone starts a walk-on-the-Y in Haplogroup G, it might be worth
investigating these.

That leaves 26 apparently most likely to be in Haplogroup R below P312:
6111248 rs9786609
6992790 rs7892861
7310235 rs35955169
7381330 rs6530623
7424282 rs7067275
10420794 rs7067387
11654673 rs62610050
11698659
12057023
12132090
12417422 rs62617915
15229407 rs28819996
15514463 rs13304223
15907039 rs13305372
16354175 rs11799151
17554518 rs13304804
19613854 rs10465460
20688657
20854047 rs35546687
22382982 rs2178500
22815250 rs34173183
22857377 rs6530626
22863230 rs2019059
27198031 rs35733966
57435386 rs12171801
57437745 rs11152878

6. Tim found a lot more than 130 SNPs for Venter in total (or even 200,
if the Venter SNPs were like the Chinese SNPs more likely to be reported
than Yoruba SNPs). So there's clearly more to understand about them. I
think they may be useful (as above) to prospect with for
phylogenetically interesting mutations; but at this stage I think John
is right, they is simply too much unclear about the Venter SNPs (and
even more so the Watson ones) for them to help with dates - which would
anyway be relying on only a very small number of SNPs, so would still be
very uncertain.

As to prospecting, one thing I wonder, since the Chinese data has been
around for a while, do we know whether anyone perhaps like Ethnoancestry
may have already looked at these mutations, to see whether they do have
any potential to further clarify any groups in R? Is it possible any of
them may have already been genotyped against a panel, and those results
if negative never publicised?

7. A couple more ranges of interest shown up by the Venter SNPs, when
compared with Chinese and/or Yoruba SNPs:

* ChrY:19612997..19614411
* ChrY:22309513..22310769

The latter shows a sudden run of 16 SNPs, where the Venter sequence
agrees with the Yoruba (but not the Chinese); apart from two locations,
where Venter agrees with the Chinese but not the Yoruba against the
reference.

The former contains a run of eight SNPs that Watson or Venter share with
the Chinese. Unfortunately, the first half are marked "heterozygous" in
the Venter data, so not reliable. But even so, this is a very striking
sudden density of SNPs.

According to papers cited in the posts Tim linked, there were two short
stretches of the Y chromosome where the HUGO sequence was filled in from
another source, in addition to the section taken from a Haplogroup G
individual. I wonder if the patterns here could be related to that?


All best for now,

James.





This thread: