Embedding

Vector representations of text, images, or other data in a continuous space.

fundamentals beginner Jan 28, 2025

What is an Embedding?

An embedding is a way to represent discrete data (like words, sentences, or images) as vectors of real numbers in a continuous space. This transformation allows us to capture semantic relationships and similarities between different pieces of data.

Key Concepts

Vector Representation

Each item is represented as a point in high-dimensional space
Similar items are placed closer together
Typically ranges from 128 to 1536 dimensions

Properties

Dimensionality: The size of the vector (e.g., 768 dimensions)
Distance Metrics: Cosine similarity, Euclidean distance
Semantic Meaning: Related concepts cluster together

Mathematical Foundation

Given a vocabulary V and embedding dimension d, an embedding is a function:

E: V → R^d

Where each word w ∈ V is mapped to a vector E(w) ∈ R^d.

Cosine Similarity

The similarity between two embeddings can be measured using cosine similarity:

similarity(A, B) = (A · B) / (||A|| × ||B||)

Where:

A · B is the dot product
||A|| is the magnitude of vector A

Common Use Cases

Text Search: Find semantically similar documents
Recommendation Systems: Identify similar items
Clustering: Group related content
Transfer Learning: Use pre-trained embeddings

Popular Embedding Models

Word2Vec: Word-level embeddings
BERT: Contextual embeddings
Sentence-BERT: Sentence-level embeddings
OpenAI Ada: General-purpose embeddings

Example

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
sentences = ["Machine learning is fun", "AI is interesting"]
embeddings = model.encode(sentences)

# embeddings[0] shape: (384,)
# A 384-dimensional vector representing the first sentence

Visualization

Embeddings are often visualized in 2D or 3D using dimensionality reduction techniques like:

t-SNE (t-Distributed Stochastic Neighbor Embedding)
UMAP (Uniform Manifold Approximation and Projection)
PCA (Principal Component Analysis)

Key Takeaways

Embeddings convert discrete data into continuous vectors
Similar items have similar embedding vectors
Foundation for many modern ML applications
Pre-trained embeddings save time and resources