Brown University computer scientists clear a path to stream 3D ‘volumetric’ video

A new method called PackUV compresses massive 3D video data into everyday video formats, potentially bringing immersive video experiences closer to home televisions and computers.

PROVIDENCE, R.I. [Brown University] — New research by Brown University computer scientists may be a key step in bringing volumetric video — video that can be viewed from virtually any perspective in a 3D scene — to computers and smart televisions.

The research introduces a new way of processing video called PackUV, which improves the capture of 3D action and makes the final product readily streamable, storable and compatible with the video codecs that currently power most video on the internet.

“With volumetric video, you can basically explore a scene from any vantage point you want,” said Aashish Rai, a computer science graduate student at Brown who led the work. “It captures three dimensions of space, plus time, making it a 4D video. With our work, we basically convert this entire 4D scene into a normal video that you can stream over the internet and share with friends.”

Rai, who works in the Interactive 3D Vision and Learning Lab at Brown led by Assistant Professor Srinath Sridhar, will present the work in June at the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Volumetric video is an emerging way of both capturing and viewing video. Actions are recorded using multiple synchronized cameras that encircle a scene. Computer algorithms then rebuild that physical space in 3D, making a video reproduction that can be viewed from any perspective within the space. Directors could use it to show a scene from a perspective where a camera would have been impossible to place. With an added user interface, people can navigate their way through a scene — watching, for example, sporting events from on the field or a concert from the stage.

For all the promise of volumetric video, there are still challenges to overcome, and this new research tackles several. Chief among them, the work introduces a new way of compressing volumetric video, which makes storing and streaming feasible with current media infrastructure.

“Volumetric video is incredibly hard to store and stream,” Rai said. “A 30-minute clip can balloon to terabytes of data, and the formats it comes in are completely alien to the infrastructure the internet already runs on — your computer, your streaming service, your video codec.”

Their solution was to start with the state-of-the-art method of rendering 3D scenes, known as 3D Gaussian splatting. The technique renders 3D images using “Gaussians,” fuzzy blobs that encode the color, opacity and shape of points in space. The quality of the images is high, but file sizes are huge. The innovation in this new work is a way of mapping the 3D scene and its millions of Gaussians into a more manageable 2D image in a way that’s similar to projecting a globe onto a flat map.

The result is “a structured, multi-scale image that encodes the entire dynamic 3D scene,” Rai said. Stack those 3D-encoded images together, and it makes a video with a reasonable file size that is compatible with stalwart video codecs that run Netflix, YouTube and most of the rest of the internet.

Creating a digital twin of the real world

There’s another key challenge the work addresses. Other gaussian splatting approaches to volumetric video work well for short videos, the researchers say, but often break down over longer sequences. To work properly, rendering approaches to volumetric video must keep track of moving objects in a scene. But current tracking techniques often lose objects that temporarily disappear from sight — for example, when a ball temporarily disappears behind a person. They also have trouble dealing with novel movement — a person entering a room in the middle of a sequence of events. This new work introduces a new approach to the problem.

“We are able to handle this by splitting the long video into small chunks,” Rai said. “At the beginning of each chunk, we basically see if something has moved or if it has entered or left the room, and model accordingly.”

By restarting the tracking process more frequently, the new technique is better able to reacquire objects that have been temporarily blocked and deal appropriately with new movements. As a result, the new approach can render complex scenes of up to 30 minutes in length without breaking down — far longer than other gaussian splatting approaches.

To test and benchmark their new technique, the researchers assembled what they believe to be the largest dataset of multi-view video ever assembled. Captured with an array of 50 to 90 synchronized cameras, the dataset includes video of people performing all kinds of actions, from playing basketball and pickleball to cooking and woodworking. The actions were captured both in a lab specially equipped with cameras, and using a mobile camera array to catch actions in the real world.

The researchers have made the entire dataset available to any researchers who may want to use it. The aim, Sridhar says, is to help advance a technology that he sees as having a wealth of future applications.

“There are real-world applications in entertainment and sports, for example, but also other use cases — manufacturing and other areas — where you need to create digital twins of the real world,” Sridhar said. “Fundamentally, that’s what this work is about.”

The research was supported by the Office of Naval Research (N00014-23-1-2804) and a National Science Foundation Career Award (2143576).

News from Brown

Brown University computer scientists clear a path to stream 3D ‘volumetric’ video

Creating a digital twin of the real world

Related news:

Staff Spotlight: Through dedication and attention to detail, Leo Eastman cares for Brown’s campus

Two renowned faculty members speak at International Congress of Mathematicians

Skye Rosario: Inspiring others to shoot for the Moon

Social Navigation