Beyond Text and Images: The Power of Multimodal Models (MLLM)

5 min readOct 7, 2024

Ref: Visual Instruction Tuning Paper (https://arxiv.org/abs/2304.08485)

Multimodal understanding is a key to understanding how humans think. It’s like a puzzle with many pieces: seeing, hearing, feeling, tasting, and smelling. All these pieces work together to help us understand the world.

For example, eating an apple pie is more than just taste. We feel its texture, smell its aroma, and enjoy how it looks. This makes it a complete experience.

Watching a movie is also about more than just visuals. We hear the sounds, feel emotions, and understand the story better.

Our experiences are made up of many parts that connect and influence each other. To truly understand how humans think, we need to consider all these parts together. It’s like a computer program that combines different information to create a meaningful result. This is why multimodal embedding spaces are important. They help us understand human experiences in a way that shows how our senses and thoughts are connected.

But remember: Multimodal understanding is not perfect. It’s like a simplified model of how humans think.

MLLMs are like upgraded LLMs. They can understand more than just words. They can also understand pictures, sounds, and even videos. This makes them better at things like describing images, answering questions about videos, and creating new content that combines words and pictures.

LVLMs are similar to MLLMs, but they focus more on understanding images and language together. They are good for things like describing images and answering questions about pictures.

Here are some examples of LVLMs and MLLMs:

LVLMs:

LLaVA (Language-and-Vision-Aware)
CLIP (Contrastive Language-Image Pre-training)
FLAVA (Flow-based Language and Vision Attention Model)
Qwen-VL
GPT-4V

MLLMs:

DALL-E 2
Stable Diffusion
PaLM 2
Imagen
Midjourney

Multimodal RAG (Retrieval Augmented Generation)

Image 1 above shows how Multimodal RAG works. Multimodal Embedding models (like CLIP or ViT) help capture the “complexity and connections” between different types of information, such as video, images, text, and audio.

During inference, the user query is converted into an embedding, followed by a semantic search in the multimodal semantic space to retrieve relevant data from the vector database. Finally, the relevant data and search query are sent to the LVLM (Large Vision-Language Model) for inference, resulting in a high-quality response.

Example:

We will take help of LVLM to analyze a video titled as :SpaceX launched to space to bring back stranded astronauts ( https://www.youtube.com/watch?v=-SIyey_Yo1k)

>>> ' can you describe this image "..\space1.jpg"
llava: Added image '..\space1.jpg'
 This image shows a screenshot of a television screen displaying an aerial view of spacecraft launchpad,
presumably at NASA's Kennedy Space Center. The video is titled "NASA to begin sending people to the moon," with
text underneath indicating that "Spacecraft Launches."

There are overlaid news-related elements: in the bottom left corner, there is a news logo and the caption
"Breaking News" with additional text reading "Spacecraft Launches to Retrieve Stranded Astronauts," suggesting an
event of concern or importance. 
The background image appears to be a satellite view of Earth with a cloud-covered sky, and there are blurred texts
visible on the bottom left corner of the screen, likely referring to additional context or information about the
event being covered by the news program.

Now look at another frame from mid of the video.

>>>  can you describe this image "..\space3.jpg"
llava: Added image '..\space3.jpg'
 The image shows a split-screen view of two different scenes. On the left side of the screen is an aerial view of a spacecraft
launchpad, presumably at NASAs Kennedy Space Center. There are blurred texts visible on the bottom left corner of the screen,
likely referring to additional context or information about the event being covered by the news program.

On the right side, there is a news segment featuring two individuals in what appears to be a spacecraft environment. One person
is standing and holding a microphone, while the other individual is seated with their hands resting on the control panel of the
cockpit. The setting suggests they are in a simulated spacecraft, possibly part of a news report or documentary, discussing the
upcoming launch to retrieve stranded astronauts.

The top right corner has a news logo and the caption "Breaking News" with additional text reading "Spacecraft Launches to
Retrieve Stranded Astronauts." 

The overall style of the image suggests it is a still frame from a television news program or video, meant to convey
information about a space exploration event.

Breaking Down the Space Launch: An LVLM Analysis

We put our LVLM to the test with a SpaceX launch video. Without any extra info, it nailed a detailed description of one of the scenes. It’s like it has a human eye for detail!

LVLMs watch videos frame by frame. If there are captions or other text, they use that. But even without it, they can guess what’s happening based on what they’ve learned. It’s like describing a picture without words, just using your imagination.

This makes LVLMs super smart. They can connect pictures, words, and sounds to understand things better. It’s a big step towards AI that can “see” and “understand” like we do.

Conclusion:

MLLMs and LVLMs are changing how we think about AI. They’re like supercomputers that can understand things the way we do, using information from different sources. This is a big deal because it makes AI more human-like.

This also opens up new possibilities for using AI in different areas. For example, when processing car insurance claims, we consider things like accident pictures, videos, witness statements, police reports, and even forensic evidence. MLLMs and LVLMs can help us analyze all this information more accurately.

So, in the future, we can expect AI to be even more helpful and useful in our daily lives

Beyond Text and Images: The Power of Multimodal Models (MLLM)

Written by Prasun Mishra