Skip to content
About

Embeddings

An embedding is a list of numbers — a vector — that represents the meaning of a piece of content. It is the data type that makes semantic search possible.

An embedding model maps text to a point in a high-dimensional space, arranged so that similar meanings land near each other — regardless of shared words.

"I forgot my password" "how do I reset my login" "what's the weather today" "is it going to rain" password / login weather Similar meanings land close together — even with no shared words.

The first two cluster despite sharing almost no words; the weather queries form their own cluster. The model learned this geometry from the self-supervised structure of language.

A real embedding isn’t 2-D — it has hundreds or thousands of dimensions. Each dimension is a learned axis of meaning; you can’t name them, but together they position text precisely.

Embedding models are separate from chat/generation models, and specialized for this job.

from openai import OpenAI
client = OpenAI()
resp = client.embeddings.create(
model="text-embedding-3-small",
input="How do I reset my password?",
)
vector = resp.data[0].embedding # e.g. a list of 1536 floats

When picking one, weigh:

  • Dimensions — vector length (commonly 384–3072). More can capture more nuance but costs more storage and compute. Bigger is not automatically better.
  • Max input length — how much text the model embeds at once; sets your chunk size.
  • Quality — task-relevant benchmarks (e.g. the MTEB leaderboard) beat vendor marketing.
  • API vs. self-hosted — managed APIs are simplest; open models (running locally) cut cost and keep data in-house.
  • Domain fit — general models can struggle with legal, medical, or code text; check, or use a domain-tuned model.

To find “nearby” vectors you need a distance measure. Three are common:

MetricMeasuresNotes
Cosine similarityAngle between vectorsThe default for text; ignores magnitude
Dot productAngle and magnitudeFast; equals cosine if vectors are normalized
Euclidean (L2)Straight-line distanceCommon for image and spatial data

For text embeddings, cosine similarity is almost always the right choice — it compares direction (meaning) and ignores length. Most embedding models are trained with it in mind. Whatever you choose, use the same metric for indexing and querying.

Embeddings aren’t limited to text. Multimodal models (CLIP-style) embed text and images into one shared space, so a text query can retrieve relevant images. The same idea extends to audio and code. The mechanics in this section — vectors, similarity, indexing — are identical regardless of what was embedded.

  • Not human-readable — you can’t reverse a vector back into exact text.
  • Not reasoning — they capture similarity, not logic or truth.
  • Not free — generating embeddings is a model call with real cost and latency; embed at ingestion time and store the result.

An embedding is a vector encoding the meaning of content, positioned so similar meanings are geometrically close. Embedding models are specialized; choose by dimensions, input length, quality, and domain fit — and embed queries and documents with the same model. Compare text vectors with cosine similarity. Multimodal models put text and images in one shared space.