Multimodal AI

Most of this guide is about text. But real products increasingly speak, see, and listen. Multimodal AI covers models that work with images, audio, and video — and the engineering of applications that combine them with language.

The good news: almost everything you’ve learned about LLMs carries over. A vision-language model is still a transformer; a voice agent is still an LLM with extra stages around it. The mental models transfer — you’re just adding new input and output types.

In this section

Vision & Images Vision-language models for understanding images, and diffusion models for generating them.

Audio & Speech Speech-to-text, text-to-speech, audio understanding, and the architecture of voice agents.

Building Multimodal Apps Native models vs. pipelines, multimodal RAG, and the cost and latency realities of mixed media.

What you’ll be able to do

Choose between a vision-language model, dedicated OCR, and classic computer vision; design a voice agent and reason about its latency budget; and build retrieval over images and audio — not just text.

Prerequisites

LLM Engineering and, for the retrieval material, Vector Databases.