Main image of Multimodal AI, Which Is Also Related to Generative AI: AI That Can Integrate and Process Multiple Pieces of Information and Data Like Humans

Multimodal AI, Which Is Also Related to Generative AI: AI That Can Integrate and Process Multiple Pieces of Information and Data Like Humans

1. Generative AI and Multimodal AI

AI technology is evolving rapidly. It is changing the way in which we live, do business, and more. Among those AI technologies, there are two types currently attracting an especially great deal of attention: generative AI and multimodal AI. Generative AI has the ability to automatically generate and output data in multiple formats such as text, images, and music. It is a type of AI that can support creative tasks that have been performed by humans up to now. (Refer to "Column: What Is Generative AI?")

On the other hand, multimodal AI refers to a type of AI in which multiple input data (modalities*1) formats exist*2. For instance, a typical example of multimodal AI is AI that takes different data formats such as text and images as input and then integrates those pieces of data to make predictions.
Multimodal AI can integrate and utilize multiple pieces of data. Therefore, it is expected to be utilized in a wide range of applications including conventional AI and generative AI. (Refer to "Column: Multimodal AI and Single Modal AI.")
We will take a look in this article at multimodal AI, a cutting-edge AI technology, and then describe its evolution, utilization examples, and more.

*1: The "modal" in multimodal is an adjective of modality that refers to data. Accordingly, multimodal indicates modalities in multiple formats (text, images, audio, etc.).

*2: Multimodal AI is a type of AI that can handle different data formats including output as well as input. For example, AI with different data formats, such as audio input and text output, is sometimes called multimodal AI even if there is only one input data format and one output data format.

2. Evolution of Multimodal AI

It is said that the concept and research on multimodal AI began in the 1980s. Research on multimodal AI has progressed together with deep learning in machine learning since the 2000s. Multimodal AI applications were announced in the 2010s. These applications included training AI to learn human facial expressions and text to change the facial expression of on-screen avatars according to the text.

Illustration of Data Input into Multimodal AI
Illustration of Data Input into Multimodal AI

AI models that effectively capture the relevance of data in each format have appeared and data integration has become more advanced since 2015. This has made it possible to realize complex processing and advanced recognition. In the 2020s, multimodal AI is increasingly being introduced into major generative AI services and AI platforms.
For instance, applications include those that integrate images, text, and other data formats to provide responses in natural languages through large language models (LLMs), those that output data in the two formats of images and text from user questions, and those that output text describing images. Furthermore, multimodal AI is being introduced into familiar hardware. For example, wearable devices equipped with multimodal AI have been announced.
As multimodal AI continues to develop in this way, it is believed it will continue to rapidly penetrate various other fields in the future including autonomous driving technology, crime prevention, medicine, manufacturing and engineering, business support and management, sports, and entertainment.

3. Utilization Examples of Multimodal AI

As mentioned in 1, multimodal AI can handle multiple data formats as input. Therefore, it is a highly flexible type of AI that can be applied to various purposes. We introduce below the main utilization examples of multimodal AI.

3.1 Web Field: Identifying Unauthorized Goods and Fake Videos

Image of Multimodal AI Monitors Counterfeit Goods on Websites
Multimodal AI Monitors Counterfeit Goods on Websites

A familiar utilization example of multimodal AI is the identification of counterfeit goods on intermediary websites for private sales. It can identify counterfeit goods from the text (descriptions and tags) and product image data accompanying newly listed products. Moreover, multimodal AI can also be utilized to identify fake videos from data in multiple formats such as videos and audio on video sharing websites and other websites.
It is expected that the identification ability of multimodal AI will further improve such as by training it to be able to accurately identify superfakes of easily counterfeited brand products and deep-fake videos that imitate important people and celebrities in various countries.

3.2 Automotive Field: Supporting Autonomous Driving Control

Image of Multimodal AI Is Essential to Realize Autonomous Driving
Multimodal AI Is Essential to Realize Autonomous Driving

Various research and testing is being conducted aiming for the future practical application of level 5 autonomous driving (driving systems that are capable of driving autonomously anywhere and do not need to be steered) of automobiles. The utilization of multimodal AI is attracting attention worldwide in research on advanced autonomous driving technologies.
The ability of multimodal AI to process diverse data, including data on the inside and outside of automobiles obtained with numerous sensors, data on the vehicle's position, other vehicles, and traffic conditions obtained with wireless communications, and audio data with passengers, on an integrated basis is an essential technology for autonomous driving control.

3.3 Medical Field: Making Supportive Proposals for Diagnosis and Treatment Methods

Image of Expectations for the Full-scale Utilization of Multimodal AI in the Medical Field Where Data in Multiple Formats Is Handled
Expectations for the Full-scale Utilization of Multimodal AI in the Medical Field Where Data in Multiple Formats Is Handled

Research to utilize multimodal AI in the early discovery of illnesses and the optimization of treatments plans is underway in the medical field by using it to analyze electronic medical records, test images, and other data on an integrated basis. For example, it is thought that multimodal AI will be able to output multifaceted judgments on the condition and progression of diseases, predict when cancer will occur, and make supportive proposals when determining diagnosis and treatment methods. In addition to contributing to the prediction of when follow-up examinations should be performed and the selection of appropriate treatment methods, it will also lead to a decrease in medical costs through the provisions of appropriate treatment and a reduction in the burden on medical professionals through the elimination of individualization. It is expected that multimodal AI will extensively contribute to the medical field as well.

3.4 Crime Prevention and Surveillance Field: Judging the Situation

Image of AI Judges the Situation from Video, Audio, and Other Data
AI Judges the Situation from Video, Audio, and Other Data

Security cameras using conventional AI support situational judgments by using the AI to analyze only video (images). However, there is a need to judge the situation from a great deal of information such as sight, sounds, vibrations, smells, and communication with other surveillance personnel in actual surveillance work by people.
Processing images, sounds, and other data in various formats on an integrated basis, multimodal AI can judge what kind of situation it is even in complex conditions such as noise, disturbances and other disruptive behavior, fights, and fraudulent and illegal intrusions. As research and the practical use of such utilization methods progresses, it is expected that AI support for surveillance work will greatly improve.

3.5 Manufacturing and Development Field: Supporting Robot Control / Material Development

Image of Multimodal AI Is Also Making the Manufacturing Industry Smarter
Multimodal AI Is Also Making the Manufacturing Industry Smarter

The number of industrial robots being introduced into manufacturing sites is considerably increasing. The movements of these conventional industrial robots are controlled by specifying the mechanical movement angle, speed, strength, and other elements in a program and by combining image identification and other recognition technologies. Meanwhile, research on robot control using multimodal AI is progressing. Multimodal AI is integrating and being trained to learn data and other information from various sensors. This is resulting in an improvement in judgment abilities compared to conventional robots. This is allowing robots to become capable of more delicate work. Multimodal AI is attracting attention as a technology that can also be applied to robots for medicine, nursing care, and agriculture in addition to the manufacturing field.

It is also possible to see multimodal AI being utilized in the development field. For example, processing the chemical structure and composition and measurement data (microscopic images and spectra, etc.) of substances reported in experimental data, papers, and other types of information obtained by the multimodal AI on an integrated basis allows it to predict the physical and chemical properties of that substance with high accuracy. It then becomes possible to optimize the compounding conditions and compositions at high speed in a virtual space by utilizing this technology. This technology is a type of materials informatics (MI). It is expected this will greatly contribute to improving the efficiency of the search for new materials and other areas of research and development by greatly reducing the time and costs involved.
It is also expected that the application of multimodal AI to manufacturing and engineering will continue to rapidly progress in the future. This will include, for example, the realization of high-precision anomaly detection through the integration of data information from various sensors placed in production equipment and the automation of quality inspections and maintenance activities using robots that were previously difficult to automate.

4. Summary

We have mainly introduced utilization examples of multimodal AI up to this point.
In recent years, we have seen the appearance of multimodal AI services that can handle data in multiple formats such as text and images on major AI platforms. As more of these platforms emerge and become more sophisticated, it is anticipated that the utilization of multimodal AI will expand to a wide range of fields including business and creativity, sports and entertainment. Multimodal AI and its progress is one of the most noteworthy trending technologies at present.

Column: What Is Generative AI?

Generative AI utilizes a type of machine learning*3 called deep learning*4 (Fig. 1). Its greatest strength is that it can extract characteristics from text, images, and other data formats prepared by humans and then automatically generate data (text, images, videos, audio, etc.) in multiple formats based on those characteristics. Originally, the data formats that could be output with conventional AI were extremely limited. Such generation of any kind of content is itself difficult. Serious discussions have started to take place about the expansion of human abilities because of AI and the arrival of the technological singularity in which AI replaces humans with the emergence of generative AI that generates diverse data.
For this reason, we can also call generative AI an innovative type of AI distinct from conventional AI.

*3: This is a type of AI in which a machine (computer) learns about relationships between characteristics, patterns, and other elements from data to allow it to make predictions, distinctions, and more.

*4: This is a type of AI that enables more advanced characteristic extraction from data, more complex pattern recognition, and other features compared to conventional machine learning AI. A typical example of this is a deep neural network.

Image of Relationship between generative AI and the AI concept
Fig. 1: Relationship between generative AI and the AI concept

We have brought together below a little more of what generative AI can do while comparing it to conventional AI.
As shown in Tab. 1, data includes text, still images, video (a collection of still images), audio, and more. Conventional AI has the ability, for example, to be able to engage in conversations and make predictions using text data or to make identifications using video data (Tab. 2). On the other hand, generative AI can generate data and content in different formats and types such as images, video, and audio in addition to text using mainly text data (Tab. 2 and Tab. 3). Generative AI offers revolutionary, unprecedented capabilities by enabling the creation of diverse content.

Tab. 1: Types of data
DataConcrete examples
Text (characters)Email text, articles, documents, programs, etc.
Images (still images)Photographs, illustrations, etc.
VideoMovies, television programs, etc.
SoundAudioNarration, telephone call recordings, etc.
MusicSongs, techno, background music, etc.
Tab. 2: Broad division of functions between new AI (generative AI) and conventional AI

Conventional AI

New AI

Conversation-
based AI

Prediction-based
AI

Identification-based
AI

Execution-based AI

Generation-based
AI*5

Translation,
chatbots, etc.

Detection of
abnormal values in
data, market
forecasts, etc.

Detection of foreign
objects and defective
items, X-ray and other
image diagnoses, etc.

Control of autonomous
driving and robots
(autonomous robots),
etc.

Composing music,
creating narration
and illustrations

*5: Generative AI is sometimes called generation-based AI.

Tab. 3: What generative AI can do
Generated data (output data)Input dataApplication examples
Text:
Answer questions and summarize sentences
TextGeneration of reports and
articles, etc.
Images (still images):
Generate images from the content of text
TextGeneration of illustrations,
paintings, photographs of
models, etc.
Video:
Generate video from the content of text
Text, or text and
still images
Generation of promotional
video and animation, etc.
Audio:
Generate new audio based on the input data content
(emotions, tone of voice, etc.)
TextGeneration of narration
and singing voice, etc.
Music:
Generate new music data based on input data content
(genre, lyrics, assumed composer and singer)
TextGeneration of songs and
background music, etc.

Column: Multimodal AI and Single Modal AI

In contrast to multimodal AI that handles data in multiple formats, AI that handles only a single data format as seen in conventional AI is called single modal AI or unimodal AI.
Fig. 2 shows an illustration of multimodal AI and single modal AI. Single modal AI, which inputs a single information format such as text only, images only, or audio only and then processes it separately, falls under the category of generative AI services that use online text learning and user text input, for example.

Moreover, another example of single modal AI is video or audio processing using edge AI that performs AI inference on sensors or other terminals at the edge of a network (edge devices). Multimodal edge AI is also being trialed, for instance, in autonomous driving. There is no doubt that the multimodal edge AI will continue to advance in diverse fields in the future.

Illustration of multimodal AI and single modal AI
Fig. 2: Illustration of multimodal AI and single modal AI

*6: As mentioned in *2 above as well, AI that can handle different data formats for input and output, such as audio input and text output, is sometimes called multimodal AI.

Related articles