Statistical Theory and Methods

In contrast to frequentist approaches, Bayesian methods provide a principled framework for combining data with prior information when making inferences. Bayesian methods allow for more precision in small samples. In large samples, Bayesian nonparametric/machine learning methods can capture complex, nonlinear relationships in the data to produce accurate predictions and uncertainty quantification. Bayesian methods are widely used to solve complex inference problems in microsimulations, genomics, causal inference, missing data problems, and more.

Roberta De Vito  Arman Oganisian

Clinical trials prospectively assign participants to two or more interventions and prospectively measure outcomes on the participants. Clinical trials are the gold standard for studying the effectiveness of interventions and are usually required for regulatory approval. Several members of the department are involved in developing methods for design, analysis, and interpretation of clinical trials. For example, Dr. Schmid and others have published extensively on meta-analysis for clinical trials and Dr. Schmid is a leader in the developments of methods for and applications of N-of-1 trials which are single person multiple cross-over studies. Dr. Steingrimsson works on developing machine learning methods for data-driven discovery of subgroups with enhanced treatment effects. ECOG-ACRIN has carried out large trials comparing different screening modalities in cancer. 

Jon Steingrimsson Christopher Schmid


Data science lives at the intersection of statistics, computational sciences, and domain matter knowledge. The Center for Statistical Sciences is heavily invested in health data science through a variety of projects in areas such as computational biology, machine learning, Bayesian statistics, network analysis, causal inference with big data, and analysis of neuroimaging data. The Department of Biostatistics is one of four core departments in Brown's Data Science Initiative. Dr. Hogan serves as the Deputy Director of the Initiative, Dr. De Vito is currently a member of the Data Science Executive Committee, and Dr. Eloyan is a member of the DSI Campus Advisory Board.

Joseph Hogan Roberta DeVito Ani Eloyan
Youjin Lee Arman Oganisian  

Data Fusion

In many applications in public health, medicine and social science, patient characteristics are dispersed over multiple files, platforms, and/or studies. Analysis that links two or more separate data sources is increasingly important as researchers seek to integrate administrative and clinical datasets while adapting to privacy regulations that limit access to unique identifiers. Dr. Gutman has developed novel Bayesian procedures to link units that appear in two datasets by treating the unknown linking as missing data. He is collaborating with health services researchers and clinicians to estimate the effects of policies and interventions as well as predict health outcomes from clinical and demographic variables.   Also Dr. De Vito has developed novel statistical techniques to integrate multiple studies in one task, to concurrently estimate common characteristics shared among all the studies and study-specific component.

Roee Gutman Roberta DeVito

Social network analysis

Statistical and causal inference problems routinely assume that subjects in data are independent of one another. However, this assumption is easily violated when subjects are interacting with others through network ties in a large, high-dimensional dataset.Groups in the Center for Statistical Sciences have been developed new approaches that would be valid even though subjects are interconnected with others. Applications of the new methods vary in diverse fields including HIV, alcohol and substance use research, and neuroimaging networks. Furthermore, we are working on how to utilize network interactions from diverse sources of dataset to improve overall public health outcomes.

Joseph Hogan Youjin Lee  Ashley Buchanan

Causal inference and big data

Causal inference problems are often challenged by complexities in data from different sources, such as massive online experiments or electronic medical records. To unravel the causal relationships buried in a large data set, we

  1. establish identification conditions needed for causal identification,

  2. develop nonparametric methods to estimate meaningful causal quantities flexibly, and

  3. deliver impactful causal implications for public health from big data.

Jon Steingrimsson Youjin Lee Arman Oganisian
Roee Gutman Joseph Hogan  

Loosely speaking, deep learning is a branch of machine learning that uses multi-layer neural networks to build models for several tasks such as prediction or diagnosis. The unknown parameters, commonly referred to as weights, are estimated by minimizing a loss function often subject to some form of regularization. Deep learning has shown promise in many domains, with imaging based analysis being a common application area where deep learning models have shown promising performance. Several center members are working on deep learning related research including analyzing medical images, uncertainty quantification, and interpretability of deep learning models.

Jon Steingrimsson Fenghai Duan

Latent variable models link observed (or manifest) variables to unobserved (or latent) constructs. They comprise of two parts: a measurement model specifying the relationship between manifest and latent variables, and a structural model delineating the relationships among the latent variables themselves. Both the manifest and the latent variables can be either discrete or continuous in nature. When both are continuous, one obtains the factor analytic models used widely in psychology, e.g., to measure latent constructs such as human intelligence. When both are discrete, one obtains the latent class models used to categorize observations into distinct groups, e.g., to classify individuals into diseased vs. non-diseased according to their constellation of symptoms. Widely used in educational testing are Item Response Theory models (also known as Latent Trait models) that relate a group of categorical manifest variables to a continuous latent variable, e.g., using answers to a multiple choice test to measure mastery of a particular academic subject. Finally, finite mixture models  (also known as Latent Profile Analysis) relate a set of continuous manifest variables to underlying categorical constructs, e.g., by partitioning clinical trial participants into homogeneous groups across behavioral and cognitive dimensions of engagement with physical activity interventions. Originally developed for cross-sectional data, latent variable models have recently been generalized to longitudinal data. For example, Latent Transition Analysis has been used to model movement across stages of change in studies of smoking cessation. An example of latent variable modeling by our faculty is given by the 2-parameter logistic IRT models fit to the DSM-IV criteria for nicotine dependence by Dr. Papandonatos and his students. They uncovered a 2-dimensional structure with two positively correlated latent factors, thus contradicting conventional wisdom that DSM-IV symptoms measure a single dimension of liability to nicotine dependence.

George Papandonatos

Data from the Public Health and Medical Research are often subject to clustering either due to the way they are collected, e.g., multiple observations on the same subject over the duration of the observation period (longitudinal data) or due to some other inherent heterogeneity between groups (strata) of the sampling units. Advanced multivariate statistical methods (e.g., Generalized Estimating Equations (GEE) and Mixed-Effects models) have been developed to correctly account for and describe the sources of heterogeneity and variability/correlation structure between and within groups of study subjects. Multivariate statistical methodology involves detecting, analyzing, and characterizing associations among multidimensional data. Related supervised or unsupervised techniques are mainly concerned with the dimension reduction of a system. Center faculty conduct extensive research on novel statistical techniques for analyzing longitudinal and multivariate data including methods for analyzing individual and aggregated results from personalized (N-of-1) trials of treatment interventions, methods for developing and assessing predictive models for ordinal health outcomes.

Stavroula Chrysanthopoulou Christopher Schmid

Center faculty are leaders in developing and applying methods for meta-analysis, the quantitative combination of results from different studies. Prof. Gatsonis has pioneered the use of hierarchical summary ROC curves for assessing sensitivity and specificity and is developing methods for summarizing the predictive accuracy of diagnostic tests. Prof. Trikalinos heads the Center for Evidence Synthesis in Health which he co-founded with Prof. Schmid. They have developed a variety of different methods and software tools for synthesizing different types of data and studies including meta-analysis of diagnostic tests, multivariate outcomes and networks of treatments. Prof. Schmid also heads the Evidence Synthesis Academy, which aims to promote the wider use and understanding of meta-analysis among decision-makers. 

Constantine Gatsonis Thomas A. Trikalinos Christopher Schmid Roberta DeVito George Papandonatos

Statistical methodology research on HIV/AIDS spans a broad spectrum and includes statistical causal inference (e.g. causal pathway analysis of HIV intervention involving behavioral changes);  statistical/machine learning methods (e.g. super-learning for risk modeling of treatment failure and prediction); Bayesian statistical modeling of the treatment continuum; clinical decision making for optimizing HIV treatment in resource limited settings; micro-simulation modeling; etc.  Professors Hogan, Liu, and Chrysanthopoulou’s collaborative and methodological research has secured rich research fund from NIAID, NIAAA, NIAID, NHLBI, NICHD, USAID, etc.

Stavroula Chrysanthopoulou

Joseph Hogan

Tao Liu

Jon Steingrimsson

Simulation models have been broadly used as a valuable tool in cost-effectiveness analyses, comparative effectiveness research, etc., for evidence-informed Public Health  Decision making. Recent advancements in computing technology have facilitated the development of increasingly intricate predictive models aimed at describing complex health processes and systems using Monte Carlo simulation techniques. Depending on their specific characteristics there is a large variety of these models including, but not limited to, state transition, discrete event simulation, dynamic transmission, compartmental, microsimulation, and agent-based models. Microsimulation models, in particular, synthesize information from multiple resources and use computer technology to combine mathematical and statistical models for simulating individual trajectories related to the course of a disease, usually in conjunction with some treatment or other interventions.

Center faculty have extensive expertise in this area, working on statistical approaches for developing, evaluating, and implementing this type of simulation models with applications to cancer, sexually transmitted diseases, opioid use disorder, COVID-19, dementia, etc. Dr Trikalinos is the PI of the NCI CISNET bladder cancer incubator site and has been core PI of several projects involving development and applications of simulation models for Public Health Decision Making. Dr Chrysanthopoulou specializes in statistical techniques for calibration, validation, and predictive accuracy assessment of microsimulation models, has developed the open-source MIcrosimulation Lung Cancer (MILC) model of the natural history of lung cancer, and is involved in collaborative projects at Brown University and other institutions for building complex simulation models used in decision analysis.

Stavroula Chrysanthopoulou



Thomas A. Trikalinos



N-of-1 trials are randomized multi-crossover experiments conducted on a single individual in order to determine the personalized relative efficacy of two or more treatments measured repeatedly over time. Prof. Schmid and a team of graduate students are developing time series and multilevel methods and software for the design and analysis of single trials as well as the meta-analysis of a series of N-of-1 trials that can estimate both individual and population level effects. The group has served as the analytic hub for several large federally and non-federally funded studies using the N-of-1 framework. These include alternative treatments for chronic pain,  diets for inflammatory bowel disease, triggers of atrial fibrillation and behavioral interventions for anxiety and stress. The group is collaborating with other Brown scientists to develop a mobile app that can flexibly setup, run and analyze and interpret data from one or more N-of-1 trials.

Christopher Schmid

Statistical Learning is a framework under the broad umbrella of Machine Learning that uses techniques from functional analysis to understand data. Statistical learning is often divided into two common categories: (i) supervised, and (ii) unsupervised learning. Briefly, supervised learning involves building a predictive model based on some response or outcome of interest; while, unsupervised learning learns about relationships and data structures without any supervising outcome variable. Many faculty members in the Center are developing novel statistical learning approaches to tackle specific public health related problems. Some of these areas include: artificial neural networks for medical imaging, anomaly detection methods for clinical trials, online learning techniques for real-time clinical prognostics, and dimensionality reduction and structured prediction models in genome-wide association studies.

Jon Steingrimsson Arman Oganisian

Survival analysis is the branch of statistics that deals with analyzing data when the outcome of interest is the time to some event, such as time to death or disease progression. Such outcomes are often only partially observed due to participants dropping out of the study or not having experienced the event of interest before the end of the study period (referred to as censoring). This partial missingness creates statistical challenges and several faculty members work on developing methods to address these challenges. Dr. Steingrimsson works on adapting machine learning algorithms for censored data and Dr. Chrysanthopoulou works on approaches for simulating time to event  (accounting for censoring) data in the context of complex simulation models (e.g., microsimulation models) used in Public Health Decision Making. In addition, several faculty members are involved in interdisciplinary collaborations that involve analysis of time-to-event outcomes.

Jon Steingrimsson Stavroula Chrysanthopoulou

Topological data analysis (TDA) visualizes the “shape” of data from the spatial connectivity between discrete points. Prof. Crawford and his lab group use TDA to summarize complex patterns that underlie high-dimensional biological data. They are particularly interested in the “sub-image” selection problem where the goal is to identify the physical features of a collection of 3D shapes (e.g., tumors and single cell formations) that best explain the variation in a given trait or phenotype. Actively collaborating with faculty in the Center for Computational Molecular Biology, the School of Engineering, and the Robert J. & Nancy D. Carney Institute for Brain Science, the Crawford Lab works to develop unified statistical and machine learning frameworks that generalize the use of topological summary statistics in 3D shape analyses. Current application areas include: radiomics with clinical imaging of brain-based diseases, molecular biology with 3D microscopy of cells, biophysics with molecular dynamics simulations,  and anthropology with computed tomography (CT) scans of bones.

Lorin Crawford