PROVIDENCE, R.I. [Brown University] — In 2019, Aaron Gokaslan and Vanya Cohen — both master’s in computer science students at Brown University at the time — uploaded software they developed and named OpenWebText onto the internet for anyone to download and use.
The dataset, which can be used to train artificial intelligence language models, like AI chatbots, to mimic human language, was met with initial praise. But it wasn’t until four years later amid the explosion of AI chatbot interest that Gokaslan and Cohen would learn just how much of an impact their small act of rebellion has had.
Recently, OpenWebText was uploaded to the website of a popular machine learning company that hosts a library of open-source tools for AI models. The company, Hugging Face, tracks monthly downloads of the tools on its website. Through that data, Gokaslan and Cohen, who graduated in 2019, saw the unexpected: Their software had gone viral four years after its release.
Out of 20,000 open-source datasets hosted on the Hugging Face website, OpenWebText ranked as the number one most downloaded dataset from February into April, peaking at approximately 1.2 million downloads in February. Currently, it is still clocking in close to 500,000 downloads per month.
“It has been really exciting and gratifying,” said Gokaslan, who is now a Ph.D. student at Cornell University. “It's really rare for these things to get thousands of downloads per month, much less millions and millions per month.”
Gokaslan and Cohen, who is now a Ph.D. student at the University of Texas at Austin, created OpenWebText as part of their effort to replicate OpenAI’s language processing model GPT-2. They designed the program to train their version of GPT-2, called OpenGPT-2. OpenAI, which is now known for upending the AI landscape with ChatGPT, was not publicly releasing its data on GPT-2 or the code for the dataset it used to train it, saying it was too risky. The two Brown students were among the first to sidestep the AI organization and help experts better understand the technology, leading to the initial success of OpenWebText.
“The dataset has only become more important and a de facto industry standard because OpenAI never released the dataset used to train their GPT-2 model or any of the training code,” Gokaslan said. The open-source replications the duo created are now often the go-to tool for experimenting with large language models, he added.
OpenWebText comprises a massive collection of text sourced from popular web pages linked on Reddit, similar to OpenAI’s process. Cohen and Gokaslan removed duplicate content, ensured the majority of the text is in English and maintained “high quality of content” throughout the data. The dataset, so far, has been instrumental in the development of a number cutting-edge language models, including RoBERTa, which is Facebook’s language model based on Google’s BERT.
The dataset may have started to gain interest earlier, but Gokaslan and Cohen hadn’t previously kept track of downloads since the software was hosted only on Google Drive. But they strongly suspect the surge in popularity is related to the spike in interest surrounding language models because of the new AI-powered chatbots and technology that have now become so widely available.