Google DeepMind has just introduced a groundbreaking duo of artificial intelligence models, Gemini Robotics-ER 1.5 and Gemini Robotics 1.5, designed to breathe new life into general-purpose robots. These models operate in perfect synchronicity, promising unparalleled advancements in reasoning, visual perception, and physical action across diverse real-world situations. The ER 1.5 model takes on the crucial role of the ‘planner’ or ‘orchestrator,’ meticulously mapping out tasks, while its counterpart, the 1.5 model, flawlessly executes these plans based on simple, natural language commands.
How Google DeepMind’s Gemini Models Serve as the Robot’s Brain
In a recent announcement, DeepMind shared exciting details about these new Gemini Robotics models, emphasizing their role in empowering general-purpose robots for the real world. Generative AI has already sparked a revolution in robotics, transforming how we interact with machines by allowing natural language commands instead of complex traditional interfaces.
Yet, the journey to truly integrate AI as a robot’s ‘brain’ has faced significant hurdles. Large language models often struggle with spatial and temporal understanding, leading to difficulties in making precise movements or recognizing various object shapes. This challenge arose because a single AI model was tasked with both devising and executing a plan, often resulting in errors and delays.
Google’s brilliant solution is a dual-model architecture. At its heart is the Gemini Robotics-ER 1.5, a powerful vision-language model (VLM) equipped with advanced reasoning and sophisticated tool-calling abilities. This model can formulate intricate, multi-step plans for any given task. DeepMind highlights its exceptional capacity for logical decision-making within physical environments and its ability to seamlessly access external resources, such as Google Search, to gather necessary information. It has also achieved remarkable performance in key spatial understanding evaluations.
Once the plan is meticulously crafted, the Gemini Robotics 1.5 seamlessly takes over. This vision-language-action (VLA) model translates visual input and instructions directly into precise motor commands, enabling the robot to perform its duties. Before acting, it intelligently calculates the most efficient execution path, and notably, it can even articulate its thought process in natural language, offering a new level of transparency and understanding.
Google asserts that this innovative system will dramatically improve robots’ ability to grasp and execute complex, multi-stage instructions seamlessly. Imagine a scenario where a user asks a robot to sort various items into compost, recycling, and trash bins. This advanced AI system can first access the latest recycling guidelines online, then carefully analyze the objects before it, formulate an optimal sorting plan, and finally, execute the task with precision.
A significant highlight is the design of these AI models to be highly adaptable, functioning effectively in robots of any form factor or size. This versatility is attributed to their superior spatial understanding capabilities. Currently, developers can access the Gemini Robotics-ER 1.5 orchestrator model through the Gemini application programming interface (API) in Google AI Studio. However, the advanced VLA model is presently reserved for a select group of partners.