Data Science, Computing and Visualization Workshops:
Weekly at noon on Fridays; see upcoming topics
“Are We Not Doing Phrasing Anymore?”: Towards a Cultural Informatics
John Laudun, University of Louisiana
Recent headlines in major news outlets like the New York Times or the Chronicle of Higher Education reveal the profound suspicion with which statistical methods have been received within the humanities. The pervasive belief is that a chasm lies between statistics and the humanities that not only cannot be bridged but should not be attempted, at the risk of losing the human. And yet slowly and steadily a growing number of practitioners have not only developed research programs but also pedagogical methods that open up new analytical perspectives as well as new avenues for students to explore their relationship between the subject matter and their own understanding. This talk offers a small survey of various practices to be found in the digital humanities alongside a few experiments by the author in allowing students to experience how statistical methods in fact de-mystify the meaning-making process in language and empower students not only to ground their insights in things they can see, and count, but also, in understanding texts as nothing more than certain sequences of words, opening a path to making them better writers as well. Working from a broad survey to narrow applications, the talk suggests that concerns about a loss of humanity in the humanities is actually a concern for loss of certain kinds of authority, but that new kinds of authority are possible within which researchers and teachers will find a firm ground from which to offer interpretations and evaluations of the kinds of complex artifacts that have long been the purview of the domain.
John Laudun received his MA in literary studies from Syracuse University in 1989 and his PhD in folklore studies from the Folklore Institute at Indiana University in 1999. He was a Jacob K. Javits Fellow while at Syracuse and Indiana (1987-1992), and a MacArthur Scholar at the Indiana Center for Global Change and World Peace (1993-94). He has written grants that have been funded by the Grammy Foundation and the Louisiana Board of Regents, been a fellow with the EVIA Digital Archive and a scholar in residence with UCLA’s Institute for Pure and Applied Mathematics. His book, The Amazing Crawfish Boat, is a longitudinal ethnographic study of creativity and tradition within a material folk culture domain. Laudun’s current work is in the realm of culture analytics. He is currently engaged in several collaborations with physicists and other scientists seeking to understand how texts can be modeled computationally in order to better describe functions and features.
I will describe how to use data science methods to understand and reduce inequality in two domains: criminal justice and healthcare. First, I will discuss how to use Bayesian modeling to detect racial discrimination in policing. Second, I will describe how to use machine learning to explain racial and socioeconomic inequality in pain.
Emma Pierson is a PhD student in Computer Science at Stanford, supported by Hertz and NDSEG Fellowships. Previously, she completed a master’s degree in statistics at Oxford on a Rhodes Scholarship. She develops statistical and machine learning methods to study two deeply entwined problems: reducing inequality and improving healthcare. She also writes about these topics for broader audiences in publications including The New York Times, The Washington Post, FiveThirtyEight, and Wired. Her work has been recognized by best paper (AISTATS 2018), best poster (ICML Workshop on Computational Biology), and best talk (ISMB High Throughput Sequencing Workshop) awards, and she has been named a Rising Star in EECS and Forbes 30 Under 30 in Science.
Host: Professor Seny Kamara
In this talk I will give an overview of two ongoing projects at the interface of data and stochastic simulation/optimization models. I will first discuss our work on “stochastic package queries” (SPQs), a framework for synthesizing data management systems, predictive models, and optimization tools to provide an end-to-end system for decision support in the face of uncertainty. The goal of a SQP is to select a subset of tuples in a table (e.g., a portfolio of stocks) to optimize the expected value of an aggregate over the subset, while satisfying a set of constraints with high probability. We use a Monte Carlo database system to incorporate stochastic predictive models into the database and provide a declarative extension to SQL, called sPaQL, for specifying SPQs. Prior stochastic programming approaches typically do not scale to the SPQ setting, so we provide novel techniques for scalably computing approximately optimal solutions with statistical guarantees, while not overwhelming optimization engines such as CPLEX and Gurobi.
Our second project aims at the ultimate goal of automatically generating and executing discrete-event stochastic simulation (DESS) models to seamlessly incorporate expert domain knowledge into decision making under uncertainty for complex dynamic systems. Perhaps the most challenging step in the creation of a DESS model is specification of the input distributions, e.g., for the arrival process in a queueing model. Traditionally, small amounts of historical data would be available; distribution-fitting software would assume that interarrival times are iid, and then select and parameterize one of a small family of standard distributions, such as exponential, gamma, or Weibull. We show that such software often fails for processes with complex features such as multi-modal marginal distributions or temporal correlations. We design novel generative neural networks, specifically, variational autoencoders with LSTM components, that permit automated, higher fidelity simulation input modeling in data-rich scenarios. Preliminary results show that a range of complex processes can be automatically and accurately modeled by our techniques, without overfitting.
METHODS FOR POPULATION HEALTH WITH LIMITED DATAZEHANG (RICHARD) LI,
Data describing health outcomes of hidden populations and in low-resource areas are usually noisy and incomplete. In this talk, I will discuss two projects in such data-constrained settings. In the first project, I propose probabilistic approaches to estimating cause of death distributions using verbal autopsies (VA). VA is a survey-based method that is routinely carried out to assign causes to deaths when medically certified causes are not available. I will present an approach to use latent Gaussian graphical models to characterize the joint distribution of symptoms and causes of death while accommodating informative prior knowledge about their marginal associations. This allows us to combine noisy data from multiple sources to improve the cause of death classification. I will also briefly discuss the broader impact of probabilistic modeling of VA based on pilot studies to integrate VA with existing civil registration system.In the second project, I will discuss methods to evaluate population-level public health interventions for combating the opioid epidemics. Opioid use and overdose has become an important public health issue in the United States. However, understanding the dynamics of opioid overdose incidents and effects of public health interventions remains challenging, as comprehensive datasets describing drug use usually lack. I will discuss challenges in evaluating impacts of spatially- and time-varying exposures with unmeasured confounding and spillover effects. I will discuss methods to leverage the space-time structures to adjust for certain types of confounding due to smooth latent processes and develop strategies to evaluate the sensitivity of such adjustments.
Zehang (Richard) Li is currently a postdoctoral associate in the Department of Biostatistics at Yale School of Public Health. He completed his PhD in Statistics at the University of Washington in 2018.His research interests include Bayesian hierarchical models for high-dimensional data, spatial-temporal statistics, causal inference, global and population health, and reproducible research.
Data Science Computation and Visualization Workshop
EXPLORATORY DATA ANALYSIS WITH PANDAS IN PYTHON, PART TWO with Andras Zsom
Exploratory data analysis (EDA) is the first step of any data science project. In the second part of this pandas tutorial, I’ll walk through various visualization types you can use to better understand the properties of your data at a glance using pandas. Coding experience with python is required but no experience with the pandas package is necessary to follow the tutorial.
Friday 2/28 @ 12pm
Carney Innovation Space, 4th Floor
164 Angell Street
Pizzas and sodas will be served. Sponsored by the Data Science Initiative and organized by the Center for Computation and Visualization.