Home > Sorin Istrail > Open Letter

An open letter from Sorin to prospective students

For Computer Science and Mathematics and Engineering students: The genome is the quintessential example of Big Data. In the era of patients with their genomes sequenced and (kind of) assembled (human genome = 3GB of DNA sequence) the computational problems of GWAS and genomics are in great need for innovation in algorithms, statistics and mathematical models, data bases, machine learning, software systems, programming languages.

For Biologists, Medical students: this “Medical Bioinformatics” course included in the past a variety of life science students, some science and medical faculty audited as well, all making very valuable contributions to the class.

How do I pick graduate students to join my Lab and how undergraduate students work on research projects with me? A simple answer: take one of my two courses (my Spring course is CSCI 1820 “Algorithmic Foundations of Computational Biology”) and impress me. For undergraduates, I have openings, and my very successful students started working with me from their freshman year and many (4 this year) ending up with an honor thesis.

Small group of graduate students. I prefer to work very closely with few students – I do not believe in large groups of graduate students; my graduate students are each “an army of one!” Speaking of graduate students, this year we are fortunate to have as graduate TA for the course, my PhD student Derek Aguiar. Derek is a model example of the talent and background experience that I look into a graduate student. Exceedingly strong in algorithms and software development; a gifted member of the teaching team, and a delight to have around: a true army of one!

Guest lecturers. This year we will have 2-3 Guest Lecturers in this class: (1) Professor Matthew Stephens (U of Chicago), one of the word leaders in Population Genetics and GWAS; Professor Gary Stormo (Washington U, St. Louis) an authority on regulatory genomics; and (tentatively) Senior Statistician Wendy Wong (Illumina, England) co-author of seminal papers that we will study in this class. We have also hosted several distinguished guests in previous years; video recordings from these lectures can be found on the course website.

Derek and I hope to see you in class.


Dear undergraduate, graduate or postdoctoral students interested in taking my courses, working with me on research projects or joining my Lab.

I am writing to you when the Fall 2012 semester is about to begin. I have been on a year of teaching relief the academic year Fall 2011- Spring 2012 due to finishing a five-year term as Director of the Center for Computational Molecular Biology which was started at Brown eight years ago. As such, I miss my usual continuations of students from my previous classes taking my next ones. Also, a larger than usual group of students from my Lab, graduated in May 2012: Ryan Tarpine my first graduating PhD student at Brown, went to Google Research; from Brown class of ‘ 12 going to graduate schools: Tim Johnstone went to Yale (Biology), James Hart to Berkeley (Biology), James Weis to Harvard (Business); David Moskovitz ’11 to Stanford (Computational Biology); MS student Alex Gilmore to (yelp).

The bottom line: new students are wanted!

I would like to encourage you take this year my Fall graduate class, CSCI2820 “Medical Bioinformatics.” This year, with a permanent Computer Science course number, as in the previous (three) years (formerly known as a topics course, CSCI295L), it will be focused on computational methods for “Genome-Wide (Disease) Associations Study (GWAS).” Two other topics “Protein Folding” and “Immunogenomics” will be covered in just a few lectures/projects related to specific diseases and their GWASs.

The main computational problem studied in the course. Imagine a Matrix with 3 billion entries for a group of 1000 autistic patients (the “cases”) and their 2000 parents (the “controls”). Each of these 3000 people has a row in the matrix with half a million SNPs (single DNA position on their genome) values tested, the result of each such SNP test being 0 or 1 (called “allele” at that locus) or 2 (missing data).

The general GWAS computational problem can be stated as follows.

Given: the parents SNPs data are the first 2000 rows and the patients the last 1000 rows; Find: patterns of the SNPs more associated with cases than with controls, and among them, the ones with the highest statistical discrepancy of association.