Jan24Virtual12:00pm - 1:00pm
Tamara Broderick, PhD, Associate Professor in the Department of Electrical Engineering and Computer Science at MIT
An Automatic Finite-Sample Robustness Metric: Can Dropping a
Little Data Change Conclusions?
One hopes that data analyses will be used to make beneficial decisions regarding people’s health, finances, and well-being. But the data fed to an analysis may systematically differ from the data where these decisions are ultimately applied. For instance, suppose we analyze data in one country and conclude that microcredit is effective at alleviating poverty; based on this analysis, we decide to distribute microcredit in other locations and in future years. We might then ask: can we trust our conclusion to apply under new conditions? If we found that a very small percentage of the original data was instrumental in determining the original conclusion, we might expect the conclusion to be unstable under new conditions. So we propose a method to assess the sensitivity of data analyses to the removal of a very small fraction of the data set. Analyzing all possible data subsets of a certain size is computationally prohibitive, so we provide an approximation. We call our resulting method the Approximate Maximum Influence Perturbation. Our approximation is automatically computable, theoretically supported, and works for common estimators — including (but not limited to) OLS, IV, GMM, MLE, MAP, and variational Bayes. We show that any non-robustness our metric finds is conclusive. Empirics demonstrate that while some applications are robust, in others the sign of a treatment effect can be changed by dropping less than 0.1% of the data — even in simple models and even when standard errors are small.
Dec6Virtual and In Person12:00pm - 1:00pm
Dr. Li-Xuan Qin, Associate Member in Biostatistics; PhD in biostatistics; Memorial Sloan Kettering Cancer
Transcriptomics Data Normalization: Let’s Put It into Context
This talk will describe an assessment of transcriptomics data normalization (for removing artifacts due to inconsistent experimental handling in data collection) in the context of downstream analysis. With robustly benchmarked data and novel re-sampling-based simulations, I will illustrate several caveats of data normalization for biomarker discovery, sample classification, and survival prediction. I will then discuss the underlying causes for these caveats and provide alternative approaches that are more effective for dealing with the data artifacts.
Nov22Virtual12:15pm - 1:00pm
Dr. Jonathan Bartlett PhD, Reader in Statistics, Department of Mathematical Sciences, University of Bath, England
Hypothetical estimands in clinical trials - a unification of causal inference and missing data methods
In clinical trials events may take place which complicate interpretation of the treatment effect. For example, in diabetes trials, some patients may require rescue medication during follow-up if their diabetes is not well controlled. Interpretation of the intention to treat effect is then complicated if the level of rescue medication is imbalanced between treatment groups. In such cases we may be interested in a so-called hypothetical estimand which targets what effect would have been seen in the absence of rescue medication. In this talk I will discuss estimation of such hypothetical estimands. Currently such estimands are typically estimated using standard missing data techniques after exclusion of any outcomes measured after such events take place. I will define hypothetical estimands using potential outcomes, and exploit standard results for identifiability of causal effects from observational data to describe assumptions sufficient for identification of hypothetical estimands in trials. I will then discuss both ‘causal inference’ and ‘missing data’ methods (such as mixed models) for estimation, and show that in certain situations estimators from these two sets are identical. These links may help those familiar with one set of methods but not the other. They may also identify situations where currently adopted estimation approaches may be relying on unrealistic assumptions, and suggest alternative approaches for estimation.
Nov1Virtual12:00pm - 1:00pm
Dr. Jean Feng, Assistant Professor, Department of Epidemiology and Biostatistics, University of California, San Francisco (UCSF)
Safe approval policies for continual learning systems in healthcare
The number of machine learning (ML)-based medical devices approved by the US Food and Drug Administration (FDA) has been rapidly increasing. The current regulatory policy requires these algorithms to be locked post-approval; subsequent changes must undergo additional scrutiny. Nevertheless, ML algorithms have the potential to improve over time by training over a growing body of data, better reflect real-world settings, and adapt to distributional shifts. To facilitate a move toward continual learning algorithms, the FDA is looking to streamline regulatory policies and design Algorithm Change Protocols (ACPs) that autonomously approve proposed modifications. However, the problem of designing ACPs cannot be taken lightly. We show that policies without error rate guarantees are prone to “bio-creep” and may not protect against distributional shifts. To this end, we investigate the problem of ACP design within the frameworks of online hypothesis testing and online learning and take the first steps towards developing safe ACPs.
Oct25Virtual and In Person12:00pm
Dr. Elizabeth Tipton, Associate Professor of Statistics, Co-Director, Statistics for Evidence-Based Policy & Practice (STEPP) Center,Faculty Fellow, Institute for Policy Research
Title: When you don’t know the covariance: Combining model-based and robust standard errors
When faced with complex data structures, one approach is to estimate standard errors based upon a model, while another approach is to rely on the CLT and use ‘robust’ standard errors. This problem arises frequently in meta-analysis, when there are multiple effect sizes reported in each study, but the nature of the dependence between these effect sizes is completely unknown. Here multiple approaches have been developed, including the use of multivariate models (based on assumed covariance structures), multi-level models, and the use of “robust variance estimation”. In this talk, I provide an overview of a new approach which combines these previous methods. By beginning with a ‘working model’ and then using robust standard errors, I show that the resulting regression estimators are more efficient (than standard robust methods) and the hypothesis tests are more robust (than model-based methods).
Sep20Virtual12:00pm - 1:00pm
Dr. Amanda Meija, Assistant Professor, Department of Statistics, Indiana UniversityUsing empirical population priors to provide accurate subject-level insights into functional brain organization through template ICA
Abstract: A primary objective in resting-state fMRI studies is localization of functional areas (i.e. resting-state networks) and the functional connectivity (FC) between them. These spatial and temporal properties of brain organization may be related to disease progression, development, and aging, making them of high scientific and clinical interest. Independent component analysis (ICA) is a popular tool to estimate functional areas and their FC. However, due to typically low signal-to-noise ratio and short scan duration of fMRI data, subject-level ICA results tend to be highly noisy and unreliable. Thus, group-level functional areas are often used in lieu of subject-specific ones, ignoring inter-subject variability in functional topology. These group-average maps also form the basis for estimating FC, leading to potential bias in FC estimates given the topological differences in underlying functional areas. An alternative to these two extremes (noisy subject-level ICA and one-size-fits-all group ICA) is Bayesian hierarchical ICA, wherein information shared across subjects is leveraged to improve subject-level estimation of spatial maps and FC. However, fitting traditional hierarchical ICA models across many subjects is computationally intensive. Template ICA is a computationally convenient hierarchical ICA framework using empirical population priors derived from large fMRI databases or holdout data. Template ICA produces more accurate and reliable estimates of subject-level functional areas compared with popular ad-hoc approaches. The flexible Bayesian framework also facilitates incorporating other sources of a-priori information. In this talk, I will describe the template ICA framework, as well as two extensions to the baseline model: the first incorporates spatial priors to leverage information shared across neighboring brain locations, and the second incorporates empirical population priors on the FC between functional areas. I will also present recent findings from a study of the effects of psilocybin (the prodrug compound found in “magic mushrooms”) on the organization of the thalamus.Bio: Mandy Mejia is an assistant professor in the Department of Statistics at Indiana University. Her research aims to develop statistical techniques to extract accurate individual insights from functional MRI data, which is noisy, big and complex. Her group pursues this goal in three primary ways: (1) developing computationally efficient Bayesian techniques, which leverage information shared across space and across individuals to produce more accurate estimates at the individual level; (2) developing statistically principled noise-reduction techniques, and (3) analyzing data on the cortical surface and subcortical gray matter to facilitate spatial modeling and improve inter-subject alignment. Her group has developed several software tools to facilitate cortical surface and Bayesian analysis of fMRI data in R.
Sep13Virtual and In Person12:00pm - 12:50pm
Dr. Alyssa Bilinski, Brown University, Department of Health Services, Policy & Practice
O Decision Tree, O Decision Tree: Interpretable classification metamodels for health policy (w/Nicolas Menzies, Jeffrey Eaton, John Giardina, and Joshua Salomon)
Over the past decade, researchers have developed a rich set of metamodeling techniques for complex decision analytic models. These create parsimonious model emulators, improving the tractability of computationally-intensive analyses. However, such techniques typically focus on reproducing a full model, requiring high fidelity to the full space of parameters and outcomes, and can be difficult to interpret. In this paper, we use decision tree classifiers to create metamodels of policy-important binary outcomes. We first detail methods to fit and test classifiers optimizing out-of-sample performance, to upsample strategically in regions of high uncertainty, and to develop and test interpretable decision rules for policymakers. We apply these to a previously published agent-based simulation model of COVID-19 transmission in schools, with >99% out-of-sample predictive validity and minimal training data requirements. We compare the identified decision rules to those proposed by policymakers and to output from alternative metamodels. Our approach can reduce the computational and analytic burden of creating a metamodel, optimize performance for decisions of interest and comparability across models, and provide interpretable, easy-to-update summaries for policymakers
Mar22Virtual12:00pm - 1:00pm
Hyunseung Kang, PhD
Assistant Professor, Department of Statistics, University of Wisconsin-MadisonTitle:Assumption-Lean Analysis of Cluster Randomized Trials in Infectious Diseases for Intent-to-Treat Effects and Spillover Effects Among A Vulnerable SubpopulationAbstract:Cluster randomized trials (CRTs) are a popular design to study the effect of interventions in infectious disease settings. However, standard analysis of CRTs primarily relies on strong parametric methods, usually a Normal mixed effect models to account for the clustering structure, and focus on the overall intent-to-treat (ITT) effect to evaluate effectiveness. The paper presents two methods to analyze two types of effects in CRTs, the overall and heterogeneous ITT effects and the spillover effect among never-takers who cannot or refuse to take the intervention. For the ITT effects, we make a modest extension of an existing method where we do not impose parametric models or asymptotic restrictions on cluster size. For the spillover effect among never-takers, we propose a new bound-based method that uses pre-treatment covariates, classification algorithms, and a linear program to obtain sharp bounds. A key feature of our method is that the bounds can become dramatically narrower as the classification algorithm improves and the method may also be useful for studies of partial identification with instrumental variables. We conclude by reanalyzing a CRT studying the effect of face masks and hand sanitizers on transmission of 2008 interpandemic influenza in Hong Kong. This is joint work with Chan Park (UW-Madison)
Mar15Virtual12:00pm - 1:00pm
Sudipto Banerjee, PhD
Professor and Chair of the Department of Biostatistics
UCLA Fielding School of Public Health
Title: Bayesian Finite Population Survey Sampling from Spatial Process Settings
We develop a Bayesian model-based approach to finite population estimation accounting for spatial dependence. Our innovation here is a framework that achieves inference for finite population quantities in spatial process settings. A key distinction from the small area estimation setting is that we analyze finite populations referenced by their geographic coordinates (point-referenced data). Specifically, we consider a two-stage sampling design in which the primary units are geographic regions, the secondary units are point-referenced locations, and the measured values are assumed to be a partial realization of a spatial process. Traditional geostatistical models do not account for variation attributable to finite population sampling designs, which can impair inferential performance. On the other hand, design-based estimates will ignore the spatial dependence in the finite population. This motivates the introduction of geostatistical processes that will enable inference at arbitrary locations in our domain of interest. We demonstrate using simulation experiments that process-based finite population sampling models considerably improve model fit and inference over models that fail to account for spatial correlation. Furthermore, the process based models offer richer inference with spatially interpolated maps over the entire region. We reinforce these improvements and also scalable inference for spatial BIG DATA analysis with millions of locations using Nearest-Neighbor and Meshed Gaussian processes. We will demonstrate our framework with an example of groundwater Nitrate levels in the population of California Central Valley wells by offering estimates of mean Nitrate levels and their spatially interpolated maps.
Mar8Virtual12:00pm - 1:00pm
Dr. Feng Liang
Associate Professor at the Department of Statistics, University of Illinois at Urbana-Champaign
Title: Learning Topic Models: Identifiability and Rate of Convergence
Abstract: Topic models provide a useful text-mining tool for learning, extracting, and discovering latent structures in large text corpora. Although a plethora of algorithms have been proposed for topic modeling, little work is done to study the statistical accuracy of the estimated structures. In this paper, we propose an MLE of latent topics based on an integrated likelihood. We further introduce a new set of conditions for topic model identifiability, which are weaker than conditions that reply to the existence of anchor words. In addition, we study the estimation consistency and establish the convergence rate of the proposed estimator. Our algorithm, which is an application of the EM algorithm, is demonstrated to have competitive performance through simulation studies and a real application.
This is based on joint work with Yinyin Chen, Shishuang He, and Yun Yang.