Skip to content
About

Multimodal AI

Most of this guide is about text. But real products increasingly speak, see, and listen. Multimodal AI covers models that work with images, audio, and video — and the engineering of applications that combine them with language.

The good news: almost everything you’ve learned about LLMs carries over. A vision-language model is still a transformer; a voice agent is still an LLM with extra stages around it. The mental models transfer — you’re just adding new input and output types.

Choose between a vision-language model, dedicated OCR, and classic computer vision; design a voice agent and reason about its latency budget; and build retrieval over images and audio — not just text.

LLM Engineering and, for the retrieval material, Vector Databases.