Vision & Images

Working with images splits into two very different capabilities: understanding images (describing, reading, answering questions about them) and generating images from a description. Different models, different trade-offs.

Vision-language models (VLMs)

A vision-language model takes images and text as input and produces text as output. “Describe this screenshot,” “extract the line items from this invoice,” “what’s wrong in this diagram” — all VLM tasks.

The architecture is simpler than it sounds. It reuses the LLM you already know:

An image encoder turns the picture into feature vectors; a small projector maps those into the LLM’s embedding space; from there the image is just tokens the LLM reads alongside the text. That’s the whole trick — and it’s why an image “costs tokens.”

What VLMs are good for

Document and screenshot understanding, OCR-style text extraction, chart and diagram reading, visual question answering, UI inspection, generating alt text for accessibility, and pulling structured data out of photos.

Limits to design around

VLMs are strong at describing and interpreting, weak at precision:

Exact counting of many objects is unreliable.
Precise spatial answers (exact coordinates, tight bounding boxes) are shaky.
Tiny or low-contrast text gets misread.
They hallucinate about images just as they do about text — confidently describing things that aren’t there.

Practical notes

Images consume tokens roughly in proportion to their resolution, so a high-detail image can cost as much as pages of text — most APIs expose a detail/resolution setting; use the lowest that works. Validate VLM output the same way you would any LLM output: structure it, check it, don’t trust it raw.

Image generation

Image generation runs the other direction: text in, a new image out. The dominant approach is the diffusion model.

The intuition: a diffusion model is trained by taking real images, progressively adding noise until they’re static, and learning to reverse that. To generate, it starts from pure noise and iteratively denoises — guided by your text prompt — until a coherent image emerges.

Variants you’ll meet: text-to-image (prompt only), image-to-image (transform an existing image), inpainting (regenerate a masked region), and conditioning on structure (depth, pose, edges) for control.

Strengths and limits

Generation excels at concepts, illustrations, mockups, textures, and variations. It struggles with: rendering exact text inside images, precise counts, consistent characters across images, and exact spatial layouts.

Engineering notes

Generation is slow (many denoising steps) and costly — treat it as an asynchronous job, not a synchronous request. Stream progress to the user, cache aggressively, and set clear expectations about wait time.

Key takeaways

Image work splits into understanding and generation. A vision-language model projects an image into the LLM’s token space, so it’s a transformer you already understand — great at interpreting images, weak at counting and precise spatial detail, and it hallucinates. Choose a VLM, dedicated OCR, or classic computer vision by the job’s variety and volume. Image generation uses diffusion — strong on concepts, weak on exact text and consistency, slow and costly enough to run asynchronously, and carrying real IP and safety concerns.