With the announcement in September that ChatGPT has become Multimodal (or Multi-Modal) - (ChatGPT Can Now See, Hear and Speak) ChatGPT will now support both voice prompts from users and their image uploads. These new capabilities will offer a new, more intuitive type of interface by allowing you to have a voice conversation, or show ChatGPT what you’re talking about.
The term "Multimodal" means - "having or involving several modes, modalities, or maxima." For example, a multimodal project might include a combination of text, images, video, files, code, speech or audio.
A variety of modalities are possible and have been explored with increasing frequency, because the same basic concepts that drive ChatGPT can be applied to any type of input or output.
The next frontier in AI is combining these modalities in interesting ways, using innovative UI/UX (not just a chat interface!). Explain what is in a photo. Debug a coded program with your voice. Generate music from an image etc.
The present ChatGPT text to text, text to image, image to text and text to video functionalities, will seem like simple demos, as opposed to what is coming up next:
Up Next: Multimodal Neural Networks
Researchers are developing multimodal neural networks that can handle multiple types of data, such as text, PDFs, files, images, videos, speech, audio and more. These neural networks are called “multimodal” because they can process different modes of information.
This "multi-modality," will soon take center stage, as programs accept input or output - text, images, "point clouds" of physical space, speech audio, video, and entire computer functions as smart applications.
The magic happens when more modalities are combined together.
One example of this is the "High-Modality Multimodal Transformer", which was proposed by Paul Liang and his team at Carnegie Mellon University in 2023.
This neural network can deal with 10 different modes of data, including database tables and time series. The researchers found that adding more modes improved the performance and transferability of the neural network.
Another example is the "Meta-Transformer", which was developed by Yiyuan Zhang and his colleagues at the Multimedia Lab of the Chinese University of Hong Kong and the Shanghai AI Laboratory in 2023.
This neural network can handle 12 different modes of data. The Meta-Transformer is a unified framework for multimodal learning that can generate rich and diverse outputs.
Finally, NExT-GPT is an "Any-to-Any Multimodal Large Language Model."
NExT-GPT is a novel system that can perform any-to-any multimodal large language modeling, meaning it can accept and produce content in any combination of text, image, video, and audio modalities.
The system consists of three stages:
Multimodal encoding
LLM understanding and reasoning
Multimodal generation
The system leverages existing pre-trained models for each modality and connects them with projection layers that are fine-tuned with a small amount of parameters. The system also introduces a modality-switching instruction tuning (MosIT) technique and a curated dataset for it, which enables the system to handle complex cross-modal semantic understanding and generation tasks.
These 3 examples mean future versions of tools like ChatGPT could understand and use a lot more information at once. They will be able to process entire books, movies, and even 3D structures.
The Evolution of AI: Integrating Multimodalities and the UX Challenge
The realm of artificial intelligence is constantly evolving, and one of the most exciting developments is the integration of various modalities. Imagine the possibilities: describing the contents of a photograph, troubleshooting a software issue using voice commands, or even creating music inspired by an image. While the technical aspects of merging these modalities are undeniably intricate, the real challenge lies in crafting the perfect user experience (UX).
Why Traditional Chat Interfaces Fall Short
Chat interfaces have long been the go-to when first introducing users to novel technological concepts. Their intuitive nature makes them an obvious choice, especially when releasing new AI advancements. However, as we move into the next phase of multimodal AI, the limitations of chat interfaces become evident. Embedding images, audio, and other modalities within a chat can quickly lead to a cluttered and overwhelming experience for the user.
Chat interfaces often fall short of being the best tool for any specific task. The challenge, then, is to strike a balance between versatility and specialization.
The Future of UX in Multimodal Genertive AI
The integration of different modalities presents a vast opportunity in the UI/UX domain. The key lies in determining the most effective way to present diverse outputs—be it audio, text, images, or code—to users. Moreover, it's crucial to develop interfaces that not only display these outputs but also allow users to interact with, modify, and provide feedback on them. For instance, when considering the fine-tuning of a multimodal model, what mechanisms can we introduce to make this process intuitive and effective for the user?
Instead of dealing with a chat interface — you might display more dynamic elements — input boxes, sliders, forms, or other interactive UX elements.
In conclusion, as AI continues to break boundaries by integrating multiple modalities, the onus is on UX designers to create interfaces that not only showcase these advancements but also provide a seamless and intuitive experience for users. The future of AI isn't just about technological prowess; it's about crafting experiences that resonate with and empower users.
Full Examples of Multimodal Generative AI inputs and outputs:
Image-to-text: (Open AI CLIP)This type of generative AI takes an image as input and produces text as output. For example, an image captioner can take an image as input and generate a textual description of the image content as output. An image classifier can take an image as input and generate a textual label or category for the image as output. An optical character recognition (OCR) system can take an image of printed or handwritten text as input and generate a textual transcription of the text as output.
Image-to-image: (img2img or pix2pix)This type of generative AI takes an image as input and produces another image as output. For example, an image style transfer system can take an image and a style reference as input and generate a new image that has the same content but different style as output. An image super-resolution system can take a low-resolution image as input and generate a high-resolution image as output. An image inpainting system can take an incomplete or corrupted image as input and generate a complete or restored image as output.
Audio-to-text: This type of generative AI takes audio as input and produces text as the output. For example, a podcast recognition system can take podcast audio as input and generate a textual transcription of the speech content as output. A music transcription system can take music audio as input and generate a textual notation of the music score as output. A sound classifier can take sound audio as input and generate a textual label or category for the sound source or event as output.
Text to audio: (Meta MusicGen)This type of generative AI that takes text as the input and produces audio as the output. For example, a text to audio generator can take a textual script or narration as input and generate a realistic speech or music audio that matches the script or narration as output. A text to audio art system can take a creative or abstract text prompt as input and generate an artistic speech or music audio that reflects the prompt as output.
Audio-to-audio: This type of generative AI takes audio as the input and produces another audio as the output. For example, a speech synthesis system can take text or speech audio as input and generate speech audio with different voice, accent, or emotion as output. A music synthesis system can take music audio or notation as input and generate music audio with different instruments, genres, or styles as output. A sound enhancement system can take noisy or distorted sound audio as input and generate clean or improved sound audio as output.
Speech-to-text: (OpenAI Whisper) This type of generative AI takes spoken language as the input and produces written text as the output. For example, a transcription system can take an audio recording of a lecture and generate a written transcript of the spoken content. A voice command recognition system can take verbal commands as input and produce corresponding textual instructions for a device or software to execute. A podcast summarization system can take long audio episodes and generate concise written summaries of the main topics discussed. This technology is fundamental for applications like voice assistants, real-time captioning, and audio indexing.
Text-to-speech: (Meta's MMS) This type of generative AI takes written text as the input and produces spoken language as the output. For example, an audiobook generation system can take a written novel and produce an audio version narrated in a human-like voice. A reading assistant tool can take digital text from articles, emails, or documents and vocalize it for users with visual impairments or for those who prefer auditory learning. A voice response system can take textual data or scripted responses and convert them into verbal feedback for user interactions in call centers or virtual assistants. This technology bridges the gap between written content and auditory experiences, enhancing accessibility and user engagement.
Speech-to-speech: This type of generative AI takes spoken language as the input and produces altered or translated spoken language as the output. For example, a real-time translation system can take a sentence spoken in English and produce its equivalent in Spanish audibly. A voice modulation system can take a user's voice and alter its tone, pitch, or accent to produce a different vocal characteristic or mimic another person's voice. A speech enhancement system can take unclear or noisy speech as input and generate a clearer, noise-reduced version as output. This technology facilitates cross-lingual communication, voice customization, and improved auditory experiences in challenging environments.
Text to video: (Synthesia, Picsart,HeyGen) This type of generative AI takes text as the input and produces video as the output. For example, a text to video generator can take a textual script or storyboard as input and generate a realistic video that matches the script or storyboard as output. A text to video art system can take a creative or abstract text prompt as input and generate an artistic video that reflects the prompt as output.
Video-to-text: (HappyScribe) This type of generative AI takes video as the input and produces text as the output. For example, a video captioner can take video as input and generate a textual description of the video content as output. A video summarizer can take video as input and generate a shorter summary of the video content as output. A video classifier can take video as input and generate a textual label or category for the video genre, topic, or sentiment as output.
Video-to-video: This type of generative AI takes video as the input and produces another video as the output. For example, a video style transfer system can take video and a style reference as input and generate a new video that has the same content but different style as output. A video super-resolution system can take low-resolution video as input and generate high-resolution video as output. A video painting system can take incomplete or corrupted video as input and generate complete or restored video as output.
Code-to-text: (ChatGPT, etc.) This type of generative AI takes code as the input and produces textual descriptions or explanations as the output. For example, a code documentation system can take a segment of code and generate a detailed description or comment explaining its functionality. A code summarization system can take a lengthy code block as input and produce a concise summary of its main operations. A code-to-comment system can take uncommented or poorly documented code as input and generate relevant and informative comments for each segment, enhancing the readability and understanding of the code.
PDF-to-text: This type of generative AI takes PDF as the input and produces text as the output. For example, a PDF-to-text converter can take a PDF document as input and generate text that extracts the textual content from the document as output. A PDF-to-text summarizer can take a PDF document as input and generate text that summarizes the main points or highlights from the document as output.
Text-to-PDF: This type of generative AI takes text as the input and produces a PDF as the output. For example, a text-to-PDF converter can take text document as input and generate PDF document that preserves the formatting, layout, and style of the text document as output. A text-to-PDF generator can take text data as input and generate PDF document that visualizes the data with charts, graphs, or tables as output.
File-to-text: This type of generative AI takes file as theinput and produces text as the output. For example, a file-to-text converter can take file of any format (such as Word, Excel, PowerPoint, etc.) as input and generate text that extracts the textual content from the file as output. A file-to-text summarizer can take file of any format (such as Word, Excel, PowerPoint, etc.) as input and generate text that summarizes the main points or highlights from the file as output.
Text-to-file: This type of generative AI takes text as input and produces file as output. For example, a text-to-file converter can take text document as input and generate file of any format (such as Word, Excel, PowerPoint, etc.) that preserves the formatting, layout, and style of the text document as output. A text-to-file generator can take text data as input and generate file of any format (such as Word, Excel, PowerPoint, etc.) that visualizes the data with charts, graphs, or tables as output.
Point cloud-to-text: This type of generative AI takes a point cloud as input and produces text as output. A point cloud is a way of representing a physical object or space in digital form. It is made up of many points, each with a location in three dimensions (x, y, and z).For example, a point cloud-to-text converter can take a point cloud representation of a 3D object or scene as input and generate text that describes the shape, size, color, or texture of the object or scene as output. A point cloud-to-text classifier can take a point cloud representation of a 3D object or scene as input and generate text that labels or categorizes the object or scene as output.
Text-to-point cloud: This type of generative AI takes text as the input and produces point cloud as the output. For example, a text-to-point cloud generator can take a textual description of a 3D object or scene as input and generate a point cloud representation of the object or scene that matches the description as output. A text-to-point cloud art system can take a creative or abstract text prompt as input and generate a point cloud representation of an artistic 3D object or scene that reflects the prompt as output.
Data table-to-text: This type of generative AI takes data table as input and produces text as output. For example, a data table-to-text converter can take a data table containing numerical or categorical values as input and generate text that extracts the information from the table as output. A data table-to-text summarizer can take a data table containing numerical or categorical values as input and generate text that summarizes the main trends, patterns, or insights from the table as output.
Text-to-data table: This type of generative AI takes text as input and produces data table as output. For example, a text-to-data table converter can take text containing numerical or categorical information as input and generate a data table that organizes the information into rows and columns as output. A text-to-data table generator can take text containing natural language queries or commands as input and generate a data table that answers the queries or executes the commands using external data sources as output.
Infrared-to-text: This type of generative AI takes infrared as input and produces text as output. For example, an infrared-to-text converter can take an infrared image or video as input and generate text that extracts the thermal information from the image or video as output. An infrared-to-text classifier can take an infrared image or video as input and generate text that labels or categorizes the objects or events based on their thermal signatures as output.
Text-to-infrared: This type of generative AI takes text as input and produces infrared as output. For example, a text-to-infrared generator can take a textual description of an object or event with thermal information as input and generate an infrared image or video that matches the description as output. A text-to-infrared art system can take a creative or abstract text prompt with thermal information as input and generate an infrared image or video that reflects the prompt as output.
Smell/Odor-to-text: This type of generative AI takes smells as an input and produces a text output describing the smell. Input Modality (Smell):The AI would have a mechanism, potentially in the form of advanced sensors, to detect and analyze specific odors or volatile organic compounds (VOCs) in the environment. Processing:After detecting the smell, the AI would interpret or classify it based on its training data. Output Modality (Text):GenAI would then generate a textual description or classification of the detected smell. This could range from simple descriptors like "rotten eggs" or "pine" to more complex descriptions detailing the potential sources or components of the smell.
Applications:
Healthcare: Detecting diseases or infections based on body odor or breath, then providing a textual diagnosis or alert.
Environmental Monitoring: Detecting pollutants or hazardous chemicals in the air and informing through textual alerts.
Food Industry: Analyzing the freshness or quality of food items based on smell and providing textual feedback.
Perfume and Fragrance Industry: Describing complex fragrances in text based on their olfactory profile.
Safety and Security: Detecting prohibited or dangerous substances at checkpoints and generating textual alerts.
Text-to-smell/odor: This type of generative AI takes text as an input and produces an output from a dedicated hardware device.
Input Modality (Text):
GenAI receives textual input that describes a particular smell or set of smells. The description could range from simple terms like "vanilla" or "wet grass" to more complex descriptions or combinations.
Processing:
GenAI interprets the textual description to identify the required olfactory components. It would then determine the specific chemicals or combinations thereof needed to reproduce the described smell.
Output Modality (Smell):
Using a mechanism, possibly a device equipped with cartridges of various scent compounds, the AI would release the appropriate chemicals in the specified proportions to generate the described smell. The mechanism would act like a printer, but instead of colors, it would mix and release scents.
Applications:
Entertainment: Enhancing virtual reality (VR) or augmented reality (AR) experiences with scent based on textual cues from the content.
Retail: Allowing customers to "sample" perfumes, candles, or other scented products by inputting textual descriptions.
Education: Teaching about different environments or historical periods by recreating their scents based on textual descriptions.
Therapy: Using specific scents to trigger memories or emotions in therapeutic settings based on textual inputs.
Culinary Arts: Allowing chefs or food developers to experiment with aroma profiles by inputting desired scent descriptions.