Research Research Projects

Rapid analysis and visualization of output from topic models

A series of methods in genomics use multilocus genotype data to assign individuals membership in latent clusters that often correspond to geographic regions or methods of subsistence. These methods belong to a broad class of topic models, such as latent Dirichlet allocation used to analyze text corpora. Inference from topic models can produce different output matrices when repeatedly applied to the same inputs, and the number of latent clusters is a parameter that is often varied in the analysis pipeline. For these reasons, quantifying, visualizing, and annotating the output from topic models are bottlenecks for investigators across multiple disciplines from ecology to text data mining. We developed and are extending Pong, a network-graphical approach for analyzing and visualizing membership in latent clusters with an interactive visualization. Pong leverages efficient algorithms for solving the Assignment Problem to dramatically reduce runtime while increasing accuracy compared to competing methods.

Research Leads: 

Sohini Ramachandran
VISIT project website

Funding Sources: 

National Science Foundation (CAREER to SR)
Pong's visualization of individual membership of humans across 8 latent clusters in the 1000 Genomes. Each individual is depicted as a vertical line of pixels, and the proportion of each color seen in a given vertical line is the inferred genome-wide ancestry that individual has in the cluster denoted by that color. Black lines separate different geographic populations of sampled individuals. Membership in the red cluster is correlated with eastern African ancestry, membership in the yellow cluster is associated with western African ancestry, membership in the blue cluster is associated with southern European ancestry, and so on.