QuIC-M - Safe Visual Data Exploration and Machine Learning
In the era of big data, there’s plenty of software on the market that helps people to explore and visualize datasets in search of patterns and new discoveries. But users can't tell if the patterns they’re seeing are real or if they simply appear in the data by random chance unless they apply appropriate statistical tests to make sure their findings are valid, a feature that currently available commercial data exploration tools do not provide.
Statisticians and scientists routinely use a suite of tests to measure whether or not a result is statistically significant. But the statistical issues in the big data world go well beyond basic significance tests. Modern data exploration tools make it easy to poke and prod a dataset in myriad ways with a few mouse clicks. That can create an issue known to statisticians as the “multiple comparisons problem,” and it’s one of the things this project will address.
The problem is essentially this: The more questions you ask of a dataset, the more likely you are to stumble upon something that looks like a genuine correlation, but is actually just a random fluctuation in the data. Without proper statistical correction, this can lead to false discoveries. There are statistical techniques for dealing with the problem, but none of them are easily implemented in a real-time data exploration setting. This project will develop an appropriate technique.