Skip to content
About

Building Multimodal Apps

Putting modalities together in a real product raises architecture questions a text-only app never faces. This page covers the decisions that recur.

There are two ways to build a system that handles more than one modality.

A native multimodal model accepts several modalities directly — one model takes image and text, or audio and text. A pipeline chains specialized single-purpose models: speech-to-text → LLM → text-to-speech, or OCR → LLM.

Native multimodal modelPipeline of specialists
Cross-modal nuancePreserved — tone, layout, contextLost at each boundary
Control & debuggingOne opaque boxInspect and swap each stage
Cost / latencyOne callOften cheaper, tunable per stage
Best stage qualityWhatever the model offersPick the best tool for each step

RAG assumed text. When the knowledge you want to retrieve is images, diagrams, screenshots, or audio, you have two options:

A · Caption-based Image VLM Caption text Text embed Store B · Multimodal embeddings Image or text Multimodal embedder Shared vector space a text query and an image embed into the same space — match directly
  • Caption-based — at indexing time, a VLM describes each image in words; you then embed and retrieve those captions with ordinary text RAG. Simple, reuses your whole text stack, and the captions double as readable context. The cost: the caption is a lossy summary — detail the VLM didn’t mention is unsearchable.
  • Multimodal embeddings — a CLIP-style model embeds images and text into one shared space, so a text query retrieves images directly with no captioning step. More faithful, but a separate model and index to run, and the retrieved image still needs describing before an LLM can reason over it.

Caption-based is the pragmatic default; reach for multimodal embeddings when visual nuance the captions miss is genuinely important.

Mixed media breaks text-based intuitions:

  • Images and audio are token-hungry. A single high-detail image can cost as much as several pages of text; audio scales with duration. Re-model your cost estimates — multimodal requests are not text requests.
  • Generation is slow. Image and audio generation take seconds — run them asynchronously, stream progress, never block a request on them.
  • Each modality fails its own way. A VLM misreads a chart; STT mistranscribes a name. Build a separate evaluation signal per modality — one aggregate score hides which part broke.
  • Document intelligence — PDF or scan → VLM/OCR → structured data extraction.
  • Visual support — user sends a screenshot → VLM diagnoses the problem.
  • Voice assistant — the voice agent pipeline.
  • Generated media — text or image in → image/audio out, as an async job.

Each is the same discipline as any AI system: constrain the task, validate the output, handle the failure path — now across more than one kind of data.

Choose a native multimodal model for cross-modal nuance and simplicity, a pipeline of specialists for control and observability — watching for information loss at the seams. For multimodal RAG, caption-based retrieval is the pragmatic default; multimodal embeddings are more faithful but add a model and index. Images and audio consume far more tokens than text and generate slowly, so re-model cost and run generation asynchronously — and evaluate each modality separately, because each fails in its own way.