Google is using Gemini AI to train robots so they can better navigate and perform tasks, and in a new research paper, DeepMind's robotics team explained how Gemini 1.5 Pro's longer context window, which determines how much information an AI model can process, allows users to more easily interact with the RT-2 robot using natural language commands.
This works by filming a video tour of a designated area, such as a home or office space, with researchers using Gemini 1.5 Pro to make the robot “watch” the video to learn about the environment. The robot can then undertake commands based on what it has observed using verbal and / or image outputs — such as guiding users to a power outlet after being shown a phone and asked “where can I charge this?” DeepMind says its Gemini-powered robot had a 90 percent success rate across over 50 user instructions that were given in a 9,000-plus-square-foot operating area. The researchers also found "preliminary evidence" that in Gemini 1.5 Pro, the droid can plan how to carry out instructions that go beyond simple navigation. For example, if a user with a bunch of Coke cans on their desk asks the droid if it has a favorite drink, the team said, Gemini "knows that the robot should head to the fridge, see if there's any Coke there, and then return to its original location in the fridge." The user will report back on the results. DeepMind said it plans to study these results further.
The video demo Google provided is impressive, but the apparent shortcuts the Droid takes after recognizing each request disguise the 10 to 30 seconds it takes to process those instructions, according to the research paper.It may be a while before we share our homes with more advanced environment-mapping robots, but at least they can find our lost keys or wallets.