The program is designed to be completed in twelve months (September to August). Students may elect to complete the program over 16, 21, or 24 months, and many do so. In some cases, exceptionally well-prepared students might be able complete their work in 9 months. All students begin the program in September; there is no option for starting in the spring semester.
Courses are as follows, and are only offered in the semester noted, except for the independent Practicum (see detailed course descriptions below).
- DATA 1010. Probability, Statistics, and Machine Learning. (Mathematical foundations for data science, 2 credits)
- DATA 1030. Hands-on Data Science. (Practicing the data science pipeline, from data exploration and cleaning to presentation, 1 credit)
- DATA 1050. Data Engineering. (Computer science for data science, 1 credit)
- DATA 2020. Statistical Learning. (Inferential methods for regression analysis and statistical learning, 1 credit)
- DATA 2040. Deep Learning and Special Topics in Data Science. (Deep learning and big data, 1 credit)
- DATA 2080. Data and Society. (Ethical and societal implications, 1 credit)
- Elective. (Domain knowledge relevant to individual interest, 1 credit)
Summer, Fall, or Spring
- DATA 2050. Data Practicum. (Real-world data project, in industry or academia; see examples here)
DATA 1010: Probability, Statistics, and Machine Learning (Fall, 2 credits)
An introduction to the mathematical methods of data science through a combination of computational exploration, visualization, and theory. Students will learn scientific computing basics, topics in numerical linear algebra, mathematical probability (probability spaces, expectation, conditioning, common distributions, law of large numbers and the central limit theorem), statistics (point estimation, confidence intervals, hypothesis testing, maximum likelihood estimation, density estimation, bootstrapping, and cross-validation), and machine learning (regression, classification, and dimensionality reduction, including neural networks, principal component analysis, and t-SNE).
DATA 1030: Hands-on Data Science (Fall, 1 credit)
Develops all aspects of the data science pipeline: data acquisition and cleaning, handling missing data, data storage, exploratory data analysis, visualization, feature engineering, modeling, interpretation, presentation in the context of real-world datasets. Fundamental considerations for data analysis are emphasized (the bias-variance tradeoff, training, validation, testing). Classical models and techniques for classification and regression are included (linear regression, ridge and lasso regression, logistic regression, support vector machines, decision trees, ensemble methods). Uses the Python data science ecosystem.
DATA 1050: Data Engineering (Fall, 1 credit)
Provides an introduction to computer science and programming for data science. Coverage includes data structures, algorithms, analysis of algorithms, algorithmic complexity, programming using test-driven design, use of debuggers and profilers, code organization, and version control. Additional topics include data science web applications, SQL and no-SQL databases, and distributed computing.
DATA 2020: Statistical Learning (Spring, 1 credit)
A modern introduction to inferential methods for regression analysis and statistical learning, with an emphasis on application in practical settings in the context of learning relationships from observed data. Topics will include basics of linear regression, variable selection and dimension reduction, and approaches to nonlinear regression. Extensions to other data structures such as longitudinal data and the fundamentals of causal inference will also be introduced. At the end of the course, students will be able to (1) describe the statistical underpinnings of regression-based approaches to data analysis, (2) use R to implement basic and advanced regression analysis on real data, (3) develop written explanations of data analyses used to answer scientific questions in context, and (4) provide a critical appraisal of common statistical analyses, including choice of method and assumptions underlying the method.
DATA 2040: Deep Learning and Special Topics in Data Science (Spring, 1 credit)
A hands-on introduction to neural networks, reinforcement learning, and related topics. Students will learn the theory of neural networks, including common optimization methods, activation and loss functions, regularization methods, and architectures. Topics include model interpretability, connections to other machine learning models, and computational considerations. Students will analyze a variety of real-world problems and data types, including image and natural language data.
DATA 2080: Data and Society (Spring, 1 credit)
A course on the social, political, and philosophical issues raised by the theory and practice of data science. Explores how data science is transforming not only our sense of science and scientific knowledge, but our sense of ourselves and our communities and our commitments concerning human affairs and institutions generally. Students will examine the field of data science in light of perspectives provided by the philosophy of science and technology, the sociology of knowledge, and science studies, and explore the consequences of data science for life in the first half of the 21st century.
DATA 2050: Data Practicum (Can be done anytime, 1 credit)
Students work with a practicum supervisor in industry (typically during an internship) or an academic researcher (typically as part of an ongoing research program) and solve a real-world data problem which exercises the skills developed in the program. Students will submit a proposal, weekly status reports, and a final paper and presentation. To receive credit the project must entail at least 180 hours of work and typically takes between 5 and 12 weeks to complete.