Building Multimodal Apps

Putting modalities together in a real product raises architecture questions a text-only app never faces. This page covers the decisions that recur.

Native model vs. pipeline

There are two ways to build a system that handles more than one modality.

A native multimodal model accepts several modalities directly — one model takes image and text, or audio and text. A pipeline chains specialized single-purpose models: speech-to-text → LLM → text-to-speech, or OCR → LLM.

	Native multimodal model	Pipeline of specialists
Cross-modal nuance	Preserved — tone, layout, context	Lost at each boundary
Control & debugging	One opaque box	Inspect and swap each stage
Cost / latency	One call	Often cheaper, tunable per stage
Best stage quality	Whatever the model offers	Pick the best tool for each step

Multimodal RAG

RAG assumed text. When the knowledge you want to retrieve is images, diagrams, screenshots, or audio, you have two options:

Caption-based — at indexing time, a VLM describes each image in words; you then embed and retrieve those captions with ordinary text RAG. Simple, reuses your whole text stack, and the captions double as readable context. The cost: the caption is a lossy summary — detail the VLM didn’t mention is unsearchable.
Multimodal embeddings — a CLIP-style model embeds images and text into one shared space, so a text query retrieves images directly with no captioning step. More faithful, but a separate model and index to run, and the retrieved image still needs describing before an LLM can reason over it.

Caption-based is the pragmatic default; reach for multimodal embeddings when visual nuance the captions miss is genuinely important.

Cost and latency are different here

Mixed media breaks text-based intuitions:

Images and audio are token-hungry. A single high-detail image can cost as much as several pages of text; audio scales with duration. Re-model your cost estimates — multimodal requests are not text requests.
Generation is slow. Image and audio generation take seconds — run them asynchronously, stream progress, never block a request on them.
Each modality fails its own way. A VLM misreads a chart; STT mistranscribes a name. Build a separate evaluation signal per modality — one aggregate score hides which part broke.

Recurring patterns

Document intelligence — PDF or scan → VLM/OCR → structured data extraction.
Visual support — user sends a screenshot → VLM diagnoses the problem.
Voice assistant — the voice agent pipeline.
Generated media — text or image in → image/audio out, as an async job.

Each is the same discipline as any AI system: constrain the task, validate the output, handle the failure path — now across more than one kind of data.

Key takeaways

Choose a native multimodal model for cross-modal nuance and simplicity, a pipeline of specialists for control and observability — watching for information loss at the seams. For multimodal RAG, caption-based retrieval is the pragmatic default; multimodal embeddings are more faithful but add a model and index. Images and audio consume far more tokens than text and generate slowly, so re-model cost and run generation asynchronously — and evaluate each modality separately, because each fails in its own way.