Al Large Model Toward Multimodality

Have you heard of Moravec's paradox? The paradox states that advanced reasoning requires very little computational power for an artificial intelligence (AI) system, while implementing the perceptual-motor skills that humans take for granted requires enormous computational resources. In essence, complex logical tasks are easier for AI than the basic sensory tasks that human instincts can accomplish. This paradox highlights the difference between AI and human cognitive abilities at this stage.

People are inherently multimodal. Each of us is like an intelligent terminal that usually needs to go to school to be educated (trained), but the purpose and result of that training and learning is that we have the ability to work and live autonomously without always relying on external instructions and control.

We learn about the world around us through multiple sensory modalities such as sight, speech, sound, touch, taste, and smell to analyze, reason, decide, and take action.

After years of sensor fusion and AI evolution, robots are largely equipped with multimodal sensors at this stage. As we bring more computing power to edge devices such as robots, these devices are becoming smarter and smarter, capable of sensing their surroundings, understanding and communicating in natural language, acquiring haptics through digital sensing interfaces, as well as sensing the robot's specific force, angular velocity, and even the magnetic field around the robot through a combination of accelerometers, gyroscopes & magnetometers, and more.

Toward a New Era of Robotics and Machine Cognition

Prior to Transformer and Large Language Models (LLMs), implementing multimodality in AI typically required the use of multiple separate models responsible for different types of data (text, images, audio) and the integration of the different modalities through a complex process.

With the advent of Transformer models and LLMs, multimodality has become more integrated, allowing a single model to simultaneously process and understand multiple data types, resulting in AI systems that are more capable of comprehensively sensing their environment. This shift has greatly improved the efficiency and effectiveness of multimodal AI applications.

While LLMs such as GPT-3 are primarily text-based, the industry has made rapid progress toward multimodality. From OpenAI's CLIP and DALL-E, and now Sora and GPT-4o, are examples of models that have moved toward multimodality and more natural human-computer interaction. For example, CLIP understands images paired with natural language, thus bridging the gap between visual and textual information; DALL-E aims to generate images based on textual descriptions. We see the Google Gemini model undergoing a similar evolution.

In 2024, multimodal evolution accelerates. In February, OpenAI released Sora, which generates realistic or imaginative videos based on text descriptions. When you think about it, this could provide a promising path to building universal world simulators, or become an important tool for training robots. After three months, GPT-4o has significantly improved the performance of human-robot interaction and is able to reason in real time between audio, vision and text. Combining text, visual and audio information to train a new model end-to-end eliminates two modal transitions from input modality to text and then from text to output modality, which in turn dramatically improves performance.

In the same week in February, Google released Gemini 1.5, which dramatically expanded the context length to 1 million tokens. This means that 1.5 Pro can process large amounts of information at once, including an hour of video, 11 hours of audio, and a codebase that contains more than 30,000 lines of code or 700,000 words.Gemini 1.5 is built on Google's leading research on Transformer and Mixed-Member Expert Architecture (MoE), and open-sources 2B and 7B models that can be deployed on the edge side. At the Google I/O conference in May, in addition to doubling the length of context and releasing a series of generative AI tools and apps, Google explored its vision for the future of Project Astra, a general-purpose AI assistant that processes multimodal information, understands the context in which a user is placed, and interacts with people in conversations in a very natural way.

As the company behind the open-source LLM Llama, Meta also joins the General Artificial Intelligence (AGI) track.

This true multimodality greatly increases the level of machine intelligence and will lead to new paradigms for many industries.

For example, robots used to be very homogeneous, with some sensors and locomotion capabilities, but generally they did not have the "brain" to learn new things and adapt to unstructured and unfamiliar environments.

Multimodal LLMs are expected to transform the ability of robots to analyze, reason, and learn, moving them from specialization to generalization. pc's, servers, and smartphones are leaders in general-purpose computing platforms, and can run many different kinds of software applications to achieve a wide variety of functions. Generalization will help scale up, generating economies of scale, and prices can be dramatically reduced as they scale up, leading to a virtuous cycle of adoption in more areas.

Elon Musk noticed the benefits of generalized technology early on, as Tesla's robots evolved from Bumblebee in 2022 to Optimus Gen 1, announced in March 2023, and Gen 2, announced at the end of 2023, with ever-increasing versatility and learning capabilities. Over the past 6-12 months, we have witnessed a number of breakthroughs in the field of robotics and humanoid robotics.

New technologies behind next-generation robotics and embodied intelligence

There is no doubt that we still have a lot of work to do before embodied intelligence reaches mass production. We need lighter designs, longer runtimes, and faster, more powerful edge computing platforms to process and fuse sensor data information to make timely decisions and control actions.

And we are moving toward creating humanoid robots; thousands of years of human civilization have produced ubiquitous environments designed for humans, and humanoid robotic systems are expected to be able to comfortably interact with humans and the environment and perform required operations in human-existing environments due to their similarity in form to people. These systems will be well suited to handle dirty, hazardous, and boring tasks such as patient care and rehabilitation, service work in the hospitality industry, teaching aids or learning companions in the educational field, and hazardous tasks such as disaster response and hazardous materials handling. Such applications utilize humanoid machine human attributes to facilitate natural human-robot interactions, act in human-centered spaces, and perform tasks that are often difficult for traditional robots to accomplish.

Many AI and robotics companies are launching new research and collaboration around how to train robots to better reason and plan in new unstructured environments. As the new "brains" of robots, models that are pre-trained on large amounts of data have excellent generalization capabilities, allowing robots to see and understand their environments more comprehensively, adjust their movements and actions based on sensory feedback, and optimize their performance in a variety of dynamic environments.

As an interesting example, Boston Dynamics' robot dog, Spot, can act as a tour guide in a museum, interacting with visitors, introducing them to the various exhibits, and answering their questions. It may be hard to believe, but in this use case, Spot's entertaining, interactive, and subtle performances are more important than making sure the facts are correct.

Robotics Transformer: The New Brain of Robotics

Robotics Transformer (RT) is rapidly evolving to translate multimodal inputs directly into actionable code. Google DeepMind's RT-2 performs as well as its predecessor, RT-1, with a near 100% success rate when performing tasks that have been seen before. However, when trained with PaLM-E (a robot-oriented embodied multimodal language model) and PaLI-X (a large-scale multilingual vision and language model, not specifically designed for robots), RT-2 has better generalization capabilities and outperforms RT-1 on unseen tasks.

Microsoft introduced LLaVA, a large-scale language and vision assistant. originally designed for text-based tasks, LLaVA leverages the power of GPT-4 to create a new paradigm for multimodal instructions to follow data, seamlessly integrating textual and visual components, which can be useful for robotic tasks. upon its introduction, LLaVA set new records for multimodal chat and scientific quizzing tasks, already exceeding the human average capabilities.

As mentioned earlier, Tesla's foray into humanoid and AI general purpose robotics is significant not only because it is designed for scale and mass production, but also because the strong fully self-driving (FSD) technology foundation of Tesla's Autopilot for automobiles can be used for robots. Tesla also has a smart manufacturing use case for applying Optimus to its new energy vehicle production process.

Arm is the cornerstone of the future of robotics

Arm believes that the robotic brain, both the "big brain" and the "little brain," should be a heterogeneous AI computing system that delivers superior performance, real-time response, and energy efficiency.

news-800-1

Robotics involves a wide range of tasks, including basic computation (e.g., sending and receiving signals to and from motors), advanced data processing (e.g., interpreting image and sensor data), and running the multimodal LLMs mentioned earlier. the CPU is well suited for general-purpose tasks, while AI gas pedals and GPUs can more efficiently handle parallel-processing tasks, such as machine learning (ML) and graphics processing. Additional gas pedals such as image signal processors and video codecs can also be integrated to enhance the robot's vision capabilities and storage/transmission efficiency. In addition, the CPU should have real-time responsiveness and needs to be able to run operating systems such as Linux and ROS packages.

When extended to the robotic software stack, the operating system layer may also require a real-time operating system (RTOS) that can reliably handle time-critical tasks, as well as a Linux distribution customized for robotics, such as ROS, which can provide services designed for heterogeneous computing clusters. We believe that Arm-sponsored standards and certification programs such as SystemReady and PSA Certified will help scale the development of robotic software. systemReady is designed to ensure that standard Rich OS distributions run on a wide range of system-on-chips (SoCs) based on the Arm architecture, while PSA Certified helps to Simplify security implementation solutions to meet regional security and regulatory requirements for connected devices.

Advances in large-scale multimodal models and generative AI herald a new era in the development of AI robots and humanoid robots. Along with AI computing and ecosystems, energy efficiency, security, and functional safety are essential to making robotics mainstream in this new era. arm processors are already widely used in robotics, and we look forward to working closely with the ecosystem to make Arm a cornerstone of the future of AI robotics.