The Clark Phase-able Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS (bibtex)
by Bjarni Halldorsson, Derek Aguiar, Ryan Tarpine, Sorin Istrail
Abstract:
A phase transition is taking place today. The amount of data generated by genome resequencing technologies is so large that in some cases it is now less expensive to repeat the experiment than to store the information generated by the experiment. In the next few years it is quite possible that millions of Americans will have been genotyped. The question then arises of how to make the best use of this information and jointly estimate the haplotypes of all these individuals. The premise of the paper is that long shared genomic regions (or tracts) are unlikely unless the haplotypes are identical by descent (IBD), in contrast to short shared tracts which may be identical by state (IBS). Here we estimate for populations, using the US as a model, what sample size of genotyped individuals would be necessary to have sufficiently long shared haplotype regions (tracts) that are identical by descent (IBD), at a statistically significant level. These tracts can then be used as input for a Clark-like phasing method to obtain a complete phasing solution of the sample. We estimate in this paper that for a population like the US and about 1% of the people genotyped (approximately 2 million), tracts of about 200 SNPs long are shared between pairs of individuals IBD with high probability which assures the Clark method phasing success. We show on simulated data that the algorithm will get an almost perfect solution if the number of individuals being SNP arrayed is large enough and the correctness of the algorithm grows with the number of individuals being genotyped. We also study a related problem that connects copy number variation with phasing algorithm success. A loss of heterozygosity (LOH) event is when, by the laws of Mendelian inheritance, an individual should be heterozygote but, due to a deletion polymorphism, is not. Such polymorphisms are difficult to detect using existing algorithms, but play an important role in the genetics of disease and will confuse haplotype phasing algorithms if not accounted for. We will present an algorithm for detecting LOH regions across the genomes of thousands of individuals. The design of the long-range phasing algorithm and the Loss of Heterozygosity inference algorithms was inspired by analyzing of the Multiple Sclerosis (MS) GWAS dataset of the International Multiple Sclerosis Consortium and we present in this paper similar results with those obtained from the MS data.
Reference:
Bjarni Halldorsson, Derek Aguiar, Ryan Tarpine, Sorin Istrail, "The Clark Phase-able Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS", In RECOMB, vol. 6044, pp. 158-173, 2010.
Bibtex Entry:
@INPROCEEDINGS{Halldorsson2010,
  author = {Halldorsson, Bjarni and Aguiar, Derek and Tarpine, Ryan and Istrail,
	Sorin},
  title = {The Clark Phase-able Sample Size Problem: Long-Range Phasing and
	Loss of Heterozygosity in GWAS},
  booktitle = {RECOMB},
  year = {2010},
  editor = {Berger, Bonnie},
  volume = {6044},
  pages = {158--173},
  abstract = {A phase transition is taking place today. The amount of data generated
	by genome resequencing technologies is so large that in some cases
	it is now less expensive to repeat the experiment than to store the
	information generated by the experiment. In the next few years it
	is quite possible that millions of Americans will have been genotyped.
	The question then arises of how to make the best use of this information
	and jointly estimate the haplotypes of all these individuals. The
	premise of the paper is that long shared genomic regions (or tracts)
	are unlikely unless the haplotypes are identical by descent (IBD),
	in contrast to short shared tracts which may be identical by state
	(IBS). Here we estimate for populations, using the US as a model,
	what sample size of genotyped individuals would be necessary to have
	sufficiently long shared haplotype regions (tracts) that are identical
	by descent (IBD), at a statistically significant level. These tracts
	can then be used as input for a Clark-like phasing method to obtain
	a complete phasing solution of the sample. We estimate in this paper
	that for a population like the US and about 1% of the people genotyped
	(approximately 2 million), tracts of about 200 SNPs long are shared
	between pairs of individuals IBD with high probability which assures
	the Clark method phasing success. We show on simulated data that
	the algorithm will get an almost perfect solution if the number of
	individuals being SNP arrayed is large enough and the correctness
	of the algorithm grows with the number of individuals being genotyped.
	We also study a related problem that connects copy number variation
	with phasing algorithm success. A loss of heterozygosity (LOH) event
	is when, by the laws of Mendelian inheritance, an individual should
	be heterozygote but, due to a deletion polymorphism, is not. Such
	polymorphisms are difficult to detect using existing algorithms,
	but play an important role in the genetics of disease and will confuse
	haplotype phasing algorithms if not accounted for. We will present
	an algorithm for detecting LOH regions across the genomes of thousands
	of individuals. The design of the long-range phasing algorithm and
	the Loss of Heterozygosity inference algorithms was inspired by analyzing
	of the Multiple Sclerosis (MS) GWAS dataset of the International
	Multiple Sclerosis Consortium and we present in this paper similar
	results with those obtained from the MS data.},
  doi = {citeulike-article-id:10541015},
  owner = {Derek},
  timestamp = {2012.05.08},
  url = {http://www.brown.edu/Research/Istrail_Lab/papers/clarkphaseable.pdf},
  category = {Haplotype Phasing, Haplotype Analysis, Deletion Inference}
}
Powered by bibtexbrowser