From non-uniform data distributions to asymptotic speed and memory improvements: compressive genomics and HyperMinHash

Add event to my Google calendar Add event to my Google calendar Share this event on facebook E-mail this event
Wednesday, March 14, 2018 4:00pm - 5:00pm
Watson CIT - SWIG Boardroom (CIT241)

Yun William Yu
Research Fellow
Department of Biomedical Informatics at Harvard Medical School

 "From non-uniform data distributions to asymptotic speed and memory improvements: compressive genomics and HyperMinHash"

Designing efficient algorithms requires that we understand the underlying distributions of the data. In the "compressive genomics" part of this talk, I show how we exploit the natural structure of biological data to speed up similarity search. We prove that by organizing the database to facilitate clustered search, our time-complexity scales with metric entropy (number of covering hyperspheres) if the fractal dimension of the dataset is low. This is the key insight behind our compressively accelerated versions of standard tools in genomics (CORA, 10-100x speedup for all-mapping of NGS reads), metagenomics (MICA, 3.5x speedup Diamond), and chemical informatics (Ammolite, 150x speedup SMSD).

However, understanding the underlying distributions is also crucial even when the distributions are completely artificial. In the "HyperMinHash" part of this talk, I discuss a method for reducing the space complexity from O(log n) to O(log log n) of the probabilistic MinHash sketch for finding Jaccard distance (a.k.a. similarity) between sets. We achieve this by taking advantage of the non-uniform distribution of minimum values from a collection of uniform random hashes. In ongoing work, we are applying these types of sketches to enable orders of magnitude faster Boolean queries on aggregate patient medical records.

Hosted by the CCMB and CS