GENEALOGY-DNA-L Archives

Archiver > GENEALOGY-DNA > 2003-05 > 1054135839

From: Charles <>
Subject: Re: [DNA] DNAPrint genomics, Inc. 2.0 test results fails my test
Date: Wed, 28 May 2003 11:31:56 -0400
References: <JCHBN.030527.164432.RC0@CUVMB.CC.COLUMBIA.EDU>

John,

Nice explanation. Thanks.

Using the sequence data and the BGA percentage results for the dozen or
so participants in Ann Turner's spreadsheet, do you have enough
information to write a computer program to reverse engineer the
algorithm DNAPrint is using to calculate the BGA results from the
sequence data information. If we knew the algorithm being used, then
others might be able to add further to this discussion. Just wondering
out loud if we have enough information in Ann Turner's database to
reverse engineer the algorithm. How many variables and how many
unknowns? Do we have enough data? Or is DNAPrint using fuzzy logic and
math techniques?

Charles Kerchner

"John F. Chandler" wrote:
>
> Cecilia wrote:
> > I'm not ready to "flunk" DNAPrint yet - even though it might need more
> > tweaking. It would help if the details of their methods were not hidden
> > behind "it's proprietary."
>
> Actually, their method is not secret. It's just computer-intensive. I
> think it might help if I cook up an extremely simplified example to show
> how it's done. In this example, we will assume the world is divided into
> only two groups, the Greens and the Blues. We will use only one marker,
> which has alleles A and T. Here are the allele frequencies:
>
> Green Blue
> A 50% 10%
> T 50% 90%
>
> Since there are two copies of each marker, we actually need the
> frequencies of the pair combinations. We can calculate these from the
> above quite simply -- if "a" is the fraction of A in the gene pool,
> then a*a is the fraction of AA pairs, 2(1-a)a is the fraction of TA,
> and (1-a)*(1-a) is the fraction of TT. For example, Blues have a=0.1,
> so TT should occur (1-0.1)*(1-0.1)=0.81 = 81%. Note that there are only
> three possible outcomes of this simple test (ignoring dropouts).
>
> Green Blue
> AA 25% 1%
> TA 50% 18%
> TT 25% 81%
>
> Now consider the following mixtures: all Green, 3/4 Green, 1/2 Green,
> 1/4 Green, and all Blue. We can compute a table that covers all these
> instead of just the "pure" populations. To do this, we simply go back
> and reset the value "a" according to the mixture. For example, half-and-
> half would have a=0.5*(0.5+0.1)=0.3 (just the average of the two pure
> frequencies). In this case, a*a=0.09 = 9%, the frequence of AA, and so
> on.
>
> ---- Green -- 3/4G+1/4B - 1/2G+1/2B - 1/4G+3/4B -- Blue
> AA -- 25% ----- 16% -------- 9% -------- 4% ------- 1%
> TA -- 50% ----- 48% ------- 42% ------- 32% ------ 18%
> TT -- 25% ----- 36% ------- 49% ------- 64% ------ 81%
>
> Here is where we switch gears. Each column in the above table has
> numbers adding up to 100% and represents all possible outcomes given
> a particular mixture. Now, suppose we know the outcome but not the
> mixture -- if we assume that all possible mixtures are equally likely
> (i.e., that we don't know the ethnicity in advance), then we can look
> at each ROW of the table as a list of the relative likelihood of the
> various mixtures given a particular test outcome.
>
> I.e., we can pick out the maximum likelihood value by simply reading
> across each row: for AA and TA, the maximum likelihood is all Green,
> while for TT, the maximum is all Blue. I think you can see that the
> numbers vary smoothly along each row, and you can imagine filling in a
> table with much finer gradations of mixture. That isn't necessary for
> reading off the maximum in this simple setup, but it does allow finding
> the various confidence limits. Also, in this simple example, we don't
> need a triangle plot to represent the answer graphically; a line will
> do. I hope this lines up properly (the row of dots is the range from
> all Green to all Blue, the "R" marks the MLE, "y" marks the 1/2
> confidence limit, "b" the 1/5 confidence limit, and "k" the 1/10
> confidence limit). Just as in the DNAprint graphs, a mark that "should
> be" off the end is simply placed at the end...
>
> AA R................y............b............k.| 100% G 0% B
> TA R......................................y.....b 100% G 0% B
> TT b...............y............................R 0% G 100% B
>
> Note that the confidence limits go all the way across the plot. This
> simple example is woefully short on data, so you can't expect to get a
> precise answer. If we added a second marker, there would be 9 possible
> test results, and we could again list them all and calculate the
> percentages for each and read off the MLE as before, but with somewhat
> narrower confidence limits.
>
> Now for the extra-credit exercise...
>
> Observe that the results presented above were all-or-nothing, but that is
> simply because we looked only in the range 0-100% Green. What would we
> find with 5/4 Green + (-1/4) Blue?? The frequencey for AA with that
> "mixture" is 36%, i.e., HIGHER than the MLE percentage. In other words,
> the "MLE" was artificially restricted, and the true maximum is what you
> might call "greener than Green". This is not just a fluke resulting
> from the simplicity of this example -- it follows from the fact that
> the allele frequencies are not (oppositely) saturated in the two
> populations.
>
> John Chandler