The Clark Phaseable Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS

Halldorsson, Bjarni; Aguiar, Derek; Tarpine, Ryan; Istrail, Sorin

doi:citeulike-article-id:9029749

by Bjarni Halldorsson, Derek Aguiar, Ryan Tarpine, Sorin Istrail

Abstract:

Abstract A phase transition is taking place today. The amount of data generated by genome resequencing technologies is so large that in some cases it is now less expensive to repeat the experiment than to store the information generated by the experiment. In the next few years, it is quite possible that millions of Americans will have been genotyped. The question then arises of how to make the best use of this information and jointly estimate the haplotypes of all these individuals. The premise of this article is that long shared genomic regions (or tracts) are unlikely unless the haplotypes are identical by descent. These tracts can be used as input for a Clark-like phasing method to obtain a phasing solution of the sample. We show on simulated data that the algorithm will get an almost perfect solution if the number of individuals being genotyped is large enough and the correctness of the algorithm grows with the number of individuals being genotyped. We also study a related problem that connects copy number variation with phasing algorithm success. A loss of heterozygosity (LOH) event is when, by the laws of Mendelian inheritance, an individual should be heterozygote but, due to a deletion polymorphism, is not. Such polymorphisms are difficult to detect using existing algorithms, but play an important role in the genetics of disease and will confuse haplotype phasing algorithms if not accounted for. We will present an algorithm for detecting LOH regions across the genomes of thousands of individuals. The design of the long-range phasing algorithm and the loss of heterozygosity inference algorithms was inspired by our analysis of the Multiple Sclerosis (MS) GWAS dataset of the International Multiple Sclerosis Genetics Consortium. We present similar results to those obtained from the MS data.

View PDF

Reference:

Bjarni Halldorsson, Derek Aguiar, Ryan Tarpine, Sorin Istrail, "The Clark Phaseable Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS", In Journal of Computational Biology, vol. 18, no. 3, pp. 323-333, 2011.

Bibtex Entry:

@ARTICLE{Halldorsson2011a,
  author = {Halldorsson, Bjarni and Aguiar, Derek and Tarpine, Ryan and Istrail,
	Sorin},
  title = {The Clark Phaseable Sample Size Problem: Long-Range Phasing and Loss
	of Heterozygosity in GWAS},
  journal = {Journal of Computational Biology},
  year = {2011},
  volume = {18},
  pages = {323--333},
  number = {3},
  abstract = {Abstract A phase transition is taking place today. The amount of data
	generated by genome resequencing technologies is so large that in
	some cases it is now less expensive to repeat the experiment than
	to store the information generated by the experiment. In the next
	few years, it is quite possible that millions of Americans will have
	been genotyped. The question then arises of how to make the best
	use of this information and jointly estimate the haplotypes of all
	these individuals. The premise of this article is that long shared
	genomic regions (or tracts) are unlikely unless the haplotypes are
	identical by descent. These tracts can be used as input for a Clark-like
	phasing method to obtain a phasing solution of the sample. We show
	on simulated data that the algorithm will get an almost perfect solution
	if the number of individuals being genotyped is large enough and
	the correctness of the algorithm grows with the number of individuals
	being genotyped. We also study a related problem that connects copy
	number variation with phasing algorithm success. A loss of heterozygosity
	(LOH) event is when, by the laws of Mendelian inheritance, an individual
	should be heterozygote but, due to a deletion polymorphism, is not.
	Such polymorphisms are difficult to detect using existing algorithms,
	but play an important role in the genetics of disease and will confuse
	haplotype phasing algorithms if not accounted for. We will present
	an algorithm for detecting LOH regions across the genomes of thousands
	of individuals. The design of the long-range phasing algorithm and
	the loss of heterozygosity inference algorithms was inspired by our
	analysis of the Multiple Sclerosis (MS) GWAS dataset of the International
	Multiple Sclerosis Genetics Consortium. We present similar results
	to those obtained from the MS data.},
  doi = {citeulike-article-id:9029749},
  owner = {Derek},
  timestamp = {2012.05.08},
  url = {http://www.brown.edu/Research/Istrail_Lab/papers/clarkphaseablejournal.pdf},
  category = {Haplotype Phasing, Haplotype Analysis, Deletion Inference}
}