Conservative Extensions of Linkage Disequilibrium Measures from Pairwise to Multi-loci and Algorithms for Optimal Tagging SNP Selection

Tarpine, Ryan; Lam, Fumei; Istrail, Sorin

by Ryan Tarpine, Fumei Lam, Sorin Istrail

Abstract:

We present results on two classes of problems. The first result addresses the long standing open problem of finding unifying principles for Linkage Disequilibrium (LD) measures in population genetics (Lewontin 1964 [10], Hedrick 1987 [8], Devlin and Risch 1995 [5]). Two desirable properties have been proposed in the extensive literature on this topic and the mutual consistency between these properties has remained at the heart of statistical and algorithmic difficulties with haplotype and genome-wide association study analysis. The first axiom is (1) The ability to extend LD measures to multiple loci as a conservative extension of pairwise LD. All widely used LD measures are pairwise measures. Despite significant attempts, it is not clear how to naturally extend these measures to multiple loci, leading to a “curse of the pairwise”. The second axiom is (2) The Interpretability of Intermediate Values. In this paper, we resolve this mutual consistency problem by introducing a new LD measure, directed informativeness (the directed graph theoretic counterpart of the informativeness measure introduced by Halldorsson et al. [6]) and show that it satisfies both of the above axioms. We also show the maximum informative subset of tagging SNPs based on can be computed exactly in polynomial time for realistic genome-wide data. Furthermore, we present polynomial time algorithms for optimal genome-wide tagging SNPs selection for a number of commonly used LD measures, under the bounded neighborhood assumption for linked pairs of SNPs. One problem in the area is the search for a quality measure for tagging SNPs selection that unifies the LD-based methods such as LD-select (implemented in Tagger, de Bakker et al. 2005 [4], Carlson et al. 2004 [3]) and the information-theoretic ones such as informativeness. We show that the objective function of the LD-select algorithm is the Minimal Dominating Set (MDS) on r 2 -SNP graphs and show that we can compute MDS in polynomial time for this class of graphs. Although in LD-select the “maximally informative” solution is obtained through a greedy algorithm, and therefore better referred to as “locally maximally informative,” we show that in fact, Tagger (LD-select) performs very close to the global maximally informative optimum.

View PDF

Reference:

Ryan Tarpine, Fumei Lam, Sorin Istrail, "Conservative Extensions of Linkage Disequilibrium Measures from Pairwise to Multi-loci and Algorithms for Optimal Tagging SNP Selection", Chapter in Research in Computational Molecular Biology, Springer Berlin / Heidelberg, vol. 6577, pp. 468-482, 2011.

Bibtex Entry:

@incollection {Istrail2011,
   author = {Tarpine, Ryan and Lam, Fumei and Istrail, Sorin},
   affiliation = {Center for Computational Molecular Biology, Department of Computer Science, Brown University, Providence, RI 02912},
   title = {Conservative Extensions of Linkage Disequilibrium Measures from Pairwise to Multi-loci and Algorithms for Optimal Tagging SNP Selection},
   booktitle = {Research in Computational Molecular Biology},
   series = {Lecture Notes in Computer Science},
   editor = {Bafna, Vineet and Sahinalp, S.},
   publisher = {Springer Berlin / Heidelberg},
   isbn = {978-3-642-20035-9},
   keyword = {Computer Science},
   pages = {468-482},
   volume = {6577},
   url = {http://dx.doi.org/10.1007/978-3-642-20036-6_42},
   note = {10.1007/978-3-642-20036-6_42},
   abstract = {We present results on two classes of problems. The first result addresses the long standing open problem of finding unifying principles for Linkage Disequilibrium (LD) measures in population genetics (Lewontin 1964 [10], Hedrick 1987 [8], Devlin and Risch 1995 [5]). Two desirable properties have been proposed in the extensive literature on this topic and the mutual consistency between these properties has remained at the heart of statistical and algorithmic difficulties with haplotype and genome-wide association study analysis. The first axiom is (1) The ability to extend LD measures to multiple loci as a conservative extension of pairwise LD. All widely used LD measures are pairwise measures. Despite significant attempts, it is not clear how to naturally extend these measures to multiple loci, leading to a “curse of the pairwise”. The second axiom is (2) The Interpretability of Intermediate Values. In this paper, we resolve this mutual consistency problem by introducing a new LD measure, directed informativeness (the directed graph theoretic counterpart of the informativeness measure introduced by Halldorsson et al. [6]) and show that it satisfies both of the above axioms. We also show the maximum informative subset of tagging SNPs based on can be computed exactly in polynomial time for realistic genome-wide data. Furthermore, we present polynomial time algorithms for optimal genome-wide tagging SNPs selection for a number of commonly used LD measures, under the bounded neighborhood assumption for linked pairs of SNPs. One problem in the area is the search for a quality measure for tagging SNPs selection that unifies the LD-based methods such as LD-select (implemented in Tagger, de Bakker et al. 2005 [4], Carlson et al. 2004 [3]) and the information-theoretic ones such as informativeness. We show that the objective function of the LD-select algorithm is the Minimal Dominating Set (MDS) on r 2 -SNP graphs and show that we can compute MDS in polynomial time for this class of graphs. Although in LD-select the “maximally informative” solution is obtained through a greedy algorithm, and therefore better referred to as “locally maximally informative,” we show that in fact, Tagger (LD-select) performs very close to the global maximally informative optimum.},
   year = {2011},
   category = {Haplotype Analysis}
}