Wednesday, April 27, 2016 3:00pm - 5:00pm
Watson CIT - SWIG Boardroom (CIT241)
CCMB New Concentrators Reception and Honors Thesis presentations:
Samuel Crisanto: "HapROSE A Software Package for Long Range Haplotype Phasing"
Abstract: HapROSE is a software package for Long Range Haplotype Phasing that implements different heuristics for haplotype clustering in order to perform Long Range Phasing. I examine how a model of variable-memory inference proposed by Ron et. al. (1996) compares with the output of a variant which is at the heart of BEAGLE (Browning and Browning 2006). The BEAGLE heuristic adapted the algorithm proposed in the Ron et. al. paper in order to optimize for speed and a finite number of training samples, and is a current gold standard algorithm for this problem. I then implemented the stochastic EM algorithm used by BEAGLE, and compare phasing accuracy between the two heuristics under a number of different conditions. I also explore new methods of inferring the existence of haplotype blocks through the use of tracts in long reads, leveraging state of the art PacBio genome sequencing data and the Positional Burrows Wheeler transform.
Advisor: Sorin Istrail
Abigail Janke: "Structure and dynamics of RNA polymerase II C-terminal domain in complex with cancer-linked FET protein assemblies"
Translocations of FET protein (FUS, EWS, TAF15) low complexity (LC) domains onto transcription factor DNA-binding domains are known to cause cancer (Arvand and Denny,2001; Guipaud et al., 2006; Lessnick and Ladanyi, 2012). Since FET LC domains are believed to be potent transcriptional activators, FET LC fusions are thought to cause aberrant transcription of genes related to cell growth and survival (Kwon et al., 2013). But what makes FET LC domains such potent transcriptional activators? Various in vitro models have suggested that higher order assemblies of FET LC domains recruit RNA Polymerase II CTD to promoters, leading to formation of the pre-initiation complex (Kwon et al., 2013; Schwartz et al., 2013; Burke et al., 2015). While Kwon et al. have shown that TAF15 LC fibrils recruit the degenerate half of RNA polymerase II carboxy-terminal domain (CTD), structural details of the complex formed between self-assembled FET proteins and the CTD are largely unknown.
Here we detail the first nuclear magnetic resonance (NMR) study of the intact degenerate repeat half of the CTD (CTD27-52). We report (1)H, (15)N backbone resonance assignments as well as key structural and dynamic parameters of CTD27-52, which verify that the unphosphorylated degenerate half of the CTD exists in an entirely disordered conformation. We then characterize the dynamics of the RNA polymerase II CTD27-52 in complex with TAF15 LC fibrils. In the presence of TAF15 LC fibrils, backbone resonances within the first 7 heptads of the degenerate half of RNA polymerase II CTD exhibit heightened transverse relaxation, suggesting increased binding of TAF15 LC fibrils. Our findings help characterize the mechanism by which higher order assemblies of FET LC domains recruit RNA polymerase II CTD, which is critical for understanding the role of FET translocations in cancer. Separately, and perhaps more significantly, our backbone resonance assignments of unphosphorylated CTD27-52 facilitate future investigations of residue-specific interactions between the CTD and numerous transcription initiation factors.
Advisor: Nicolas Fawzi
Daniel Seidman: "LumberTracts: A Method for time Efficient Determination of Identical by Descent Tracts Between Unphased Genotypes"
It is our goal to produce a tool for the efficient analysis of IBD tracts in unphased genome data by taking advantage of the inherent sequenceorganizing properties of a modified suffix tree data structure. An identical by descent (IBD) tract is a DNA sequence shared between two individuals long enough to be unlikely to have arisen by coincidence. The presence of such tracts implies that the two individuals must have descended from the same organism, and that the IBD tract was present in that common ancestor’s DNA. These tracts can be used to infer the amount of time between the existence of that common ancestor and the existence of the individuals in question. We already know how to find these tracts efficiently in phased genotype data, represented by single nucleotide polymorphism sequences, but in the event that some of the genome input data may be unphased, the task becomes more complex, as now the tracts that could be found between the sequences are uncertain, as the haplotypes themselves are unknown. We intend to create an algorithm that uses a modified suffix tree to organize the unphased genotype data into a form which will hold the tracts easily, but we will also try to create a simpler tool which uses a more conventional method of pairwise comparison to test the same principal, and we will compare their performances. The algorithm to be eventually used in the tool will depend on which of the two proves most efficient.
Advisor: Sorin Istrail
Andy Ly: "Towards Unifying Tagging SNP Selection Algorithms"
Single nucleotide polymorphism (SNP) can be described as the variation in a single nucleotide of the genome of a species. These variations may occur between different members of the same species, and are currently studied to determine correlation to diseases. A tagging SNP, on the other hand, is a representative single nucleotide polymorphism that can represent a haplotype, a set of SNPs on a chromatid of a chromosome. As there exist many SNPs in an organism, identifying tagging SNPs reduces the complexity and computation needed. There are algorithms that have been developed to select tagging SNPs from a set of SNPs, but each of their approaches are different. The algorithms of interest are the LD-Select algorithm, the Informativeness algorithm, and principal component analysis (PCA). While there are similarities between these algorithms, there are also differences. For example, both the application of PCA and the LDSelect algorithms rely on haplotype blocks, while the Informativeness algorithm is “blockfree”. Another aspect is while the LD-Select and the Informativeness algorithms use linkage disequilibrium (LD), a statistical measure for determining non-random association of SNPs, the algorithm that uses PCA relies on a different statistical measure more inherit to PCA, but is still related to LD. These algorithms also have different results due to each other’s approaches and assumptions. To approach this study, first an analysis of these algorithms will be studied. Fields of interest include their statistical measures, benefits, and problems, common and unique, with each algorithm. A comparison will be made between them, to identify similarities and differences among each process and their methodologies in identifying tagging SNPs. With this knowledge, a unique hybrid of these algorithms may be formed, with the intent of being better than the previous algorithms mentioned by themselves.
Advisor: Sorin Istrail
Thurs 4/28 at 12 noon in BioMed 202
Jacob Thomas: "Transcriptome Analysis of the Christianson Syndrome Mouse Model"
Christianson Syndrome is an X-linked neurodevelopmental disorder caused by mutations in SLC9A6. In this study, we present the results of a transcriptome based approach to study a Slc9a6 null mouse model of the disease. In particular, pipelines for the analysis of RNA-Seq data from both mRNA and miRNA are used to identify differentially expressed genes and miRNAs. Overall, 107 differentially expressed genes and 9 differentially expressed miRNAs are identified. Gene set enrichment analyses are performed on the set of 107 differentially expressed genes and a set of 940 predicted target genes of the miRNAs. These sets show enrichment for protein degradation processes and neuronal processes. Additionally, a number of the miRNAs have been shown to affect autophagy and apoptosis.
Advisor: Eric Morrow