Embedding

Vector representations of text, images, or other data in a continuous space.

fundamentals beginner Jan 28, 2025

What is an Embedding?

An embedding is a way to represent discrete data (like words, sentences, or images) as vectors of real numbers in a continuous space. This transformation allows us to capture semantic relationships and similarities between different pieces of data.

Key Concepts

Vector Representation

  • Each item is represented as a point in high-dimensional space
  • Similar items are placed closer together
  • Typically ranges from 128 to 1536 dimensions

Properties

  • Dimensionality: The size of the vector (e.g., 768 dimensions)
  • Distance Metrics: Cosine similarity, Euclidean distance
  • Semantic Meaning: Related concepts cluster together

Mathematical Foundation

Given a vocabulary V and embedding dimension d, an embedding is a function:

E: V → R^d

Where each word w ∈ V is mapped to a vector E(w) ∈ R^d.

Cosine Similarity

The similarity between two embeddings can be measured using cosine similarity:

similarity(A, B) = (A · B) / (||A|| × ||B||)

Where:

  • A · B is the dot product
  • ||A|| is the magnitude of vector A

Common Use Cases

  1. Text Search: Find semantically similar documents
  2. Recommendation Systems: Identify similar items
  3. Clustering: Group related content
  4. Transfer Learning: Use pre-trained embeddings
  • Word2Vec: Word-level embeddings
  • BERT: Contextual embeddings
  • Sentence-BERT: Sentence-level embeddings
  • OpenAI Ada: General-purpose embeddings

Example

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
sentences = ["Machine learning is fun", "AI is interesting"]
embeddings = model.encode(sentences)

# embeddings[0] shape: (384,)
# A 384-dimensional vector representing the first sentence

Visualization

Embeddings are often visualized in 2D or 3D using dimensionality reduction techniques like:

  • t-SNE (t-Distributed Stochastic Neighbor Embedding)
  • UMAP (Uniform Manifold Approximation and Projection)
  • PCA (Principal Component Analysis)

Key Takeaways

  • Embeddings convert discrete data into continuous vectors
  • Similar items have similar embedding vectors
  • Foundation for many modern ML applications
  • Pre-trained embeddings save time and resources