Google’s New Robot Learned to Take Orders by Scraping the Web

The machine learning technique that taught notorious text generator GPT-3 to write can also help robots make sense of spoken commands.
PaLM robot picking up a dish sponge in a kitchen
Courtesy of Google

Late last week, Google research scientist Fei Xia sat in the center of a bright, open-plan kitchen and typed a command into a laptop connected to a one-armed, wheeled robot resembling a large floor lamp. “I’m hungry,” he wrote. The robot promptly zoomed over to a nearby countertop, gingerly picked up a bag of multigrain chips with a large plastic pincer, and wheeled over to Xia to offer up a snack.

The most impressive thing about that demonstration, held in Google’s robotics lab in Mountain View, California, was that no human coder had programmed the robot to understand what to do in response to Xia’s command. Its control software had learned how to translate a spoken phrase into a sequence of physical actions using millions of pages of text scraped from the web.

That means a person doesn’t have to use specific preapproved wording to issue commands, as can be necessary with virtual assistants such as Alexa or Siri. Tell the robot “I’m parched,” and it should try to find you something to drink; tell it “Whoops, I just spilled my drink,” and it ought to come back with a sponge.

Courtesy of Google

“In order to deal with the diversity of the real world, robots need to be able to adapt and learn from their experiences,” Karol Hausman, a senior research scientist at Google, said during the demo, which also included the robot bringing a sponge over to clean up a spill. To interact with humans, machines must learn to grasp how words can be put together in a multitude of ways to generate different meanings. “It’s up to the robot to understand all the little subtleties and intricacies of language,” Hausman said.

Google’s demo was a step toward the longstanding goal of creating robots capable of interacting with humans in complex environments. In the past few years, researchers have found that feeding huge amounts of text taken from books or the web into large machine learning models can yield programs with impressive language skills, including OpenAI’s text generator GPT-3. By digesting the many forms of writing online, software can pick up the ability to summarize or answer questions about text, generate coherent articles on a given subject, or even hold cogent conversations.

Google and other Big Tech firms are making wide use of these large language models for search and advertising. A number of companies offer the technology via cloud APIs, and new services have sprung up applying AI language capabilities to tasks like generating code or writing advertising copy. Google engineer Blake Lemoine was recently fired after publicly warning that a chatbot powered by the technology, called LaMDA, might be sentient. A Google vice president who remains employed at the company wrote in The Economist that chatting with the bot felt like “talking to something intelligent.”

Despite those strides, AI programs are still prone to becoming confused or regurgitating gibberish. Language models trained with web text also lack a grasp of truth and often reproduce biases or hateful language found in their training data, suggesting careful engineering may be required to reliably guide a robot without it running amok.

The robot demonstrated by Hausman was powered by the most powerful language model Google has announced so far, known as PaLM. It is capable of many tricks, including explaining, in natural language, how it comes to a particular conclusion when answering a question. The same approach is used to generate a sequence of steps that the robot will execute to perform a given task.

Researchers at Google worked with hardware from Everyday Robots, a company spun out of Google parent Alphabet’s X division dedicated to “moonshot” research projects to create the robot butler. They created a new program that uses the text processing capabilities of PaLM to translate a spoken phrase or command into a sequence of appropriate actions such as “open drawer” or “pick up chips” that the robot can perform.

The robot’s library of physical actions was learned through a separate training process in which humans remotely controlled the robot to demonstrate how to do things like pick up objects. The robot has a limited set of tasks that it can perform within its environment, which helps prevent misunderstandings by the language model from becoming errant behavior.

PaLM’s language skills can allow a robot to make sense of relatively abstract commands. When a robot arm was tasked with moving colored blocks and bowls around, Google research scientist Andy Zeng asked it to “imagine that my wife is the blue block and I am the green block. Bring us closer together.” The robot responded by moving the blue block to sit next to the green block.

"Applying large language models to robotics is an exciting direction," says Stefanie Tellex, an assistant professor at Brown University who specializes in robot learning and robot-human collaboration. But she adds that broadening the range of tasks that a robot can perform—so that it can do more things that a person might ask—remains "a large unsolved problem."

Brian Ichter, a research scientist at Google involved with the project, acknowledges that “plenty of things” can still befuddle the Google kitchen robot. Simply changing the lighting or moving an object can cause the machine to fail to grasp an object correctly, illustrating how robots can struggle with physical tasks that are trivial for humans.

It is also unclear whether the system would handle complex sentences or commands as smoothly as the short commands it responded to in demos. AI advances have already expanded abilities for robots; for example, industrial robots can identify products or spot defects in factories. Many researchers are also exploring ways for robots to learn through practice, in the real world or in simulation, and from observation. But demos that seem impressive often work in only a limited setting.

Ichter says the project may lead to methods of imbuing language models with better understandings of physical reality. Mistakes made by AI language software are often underpinned by a lack of common sense knowledge, which humans use to make sense of the ambiguities of language. “Language models haven’t really experienced the world in any way. They only reflect the statistics of the words they have read on the internet,” Ichter says.

Google’s research project is a long way from being a product, but many of the company’s rivals have recently taken a new interest in home robots. Last September, Amazon demonstrated Astro, a home robot with far more limited abilities; this month the company announced that it plans to buy iRobot, the company behind the popular Roomba robot vacuum cleaner. Elon Musk has promised that Tesla will build a humanoid robot, although details on the project are scarce, and it may be more of a recruiting pitch than a product announcement.