Is Your Big Data Too Big or Too Small? Sample Complexity, Uniform Convergence, and Rademacher Averages
What sample size is needed to accurately estimate a parameter? For a given parameter, the answer is straightforward. But in most advance data analysis the parameter of interest is not predefined, it is the outcome of a search in some parameter space-- for example, what is the largest fraction of patients that share any set of 5 mutations. The difficulty here is that the question implies a large (exponential) multi-comparison analysis, and classical statistics has no general effective method to evaluate the accuracy of such estimate from a sample.
We will present computationally efficient methods for evaluating the sample complexity (error bound for a given sample size) in specific domains, relevant to massive data mining applications. Surprisingly, we demonstrate that our methods, that give provable statistical guarantees, are more computationally efficient than known heuristics that have no guarantees on the quality of their solutions.
Eli Upfal is a professor of computer science at Brown University; where he was also the department chair from 2002 to 2007. Prior to joining Brown in 1998, he was a researcher and project manager at the IBM Almaden Research Center, and a professor of Applied Mathematics and Computer Science at the Weizmann Institute of Science. His main research interests are randomized algorithms, probabilistic analysis of algorithms, and computational statistics, with applications ranging from combinatorial and stochastic optimization to massive data analysis, social networks and computational biology.