Embedding
Vector representations of text, images, or other data in a continuous space.
What is an Embedding?
An embedding is a way to represent discrete data (like words, sentences, or images) as vectors of real numbers in a continuous space. This transformation allows us to capture semantic relationships and similarities between different pieces of data.
Key Concepts
Vector Representation
- Each item is represented as a point in high-dimensional space
- Similar items are placed closer together
- Typically ranges from 128 to 1536 dimensions
Properties
- Dimensionality: The size of the vector (e.g., 768 dimensions)
- Distance Metrics: Cosine similarity, Euclidean distance
- Semantic Meaning: Related concepts cluster together
Mathematical Foundation
Given a vocabulary V and embedding dimension d, an embedding is a function:
E: V → R^d
Where each word w ∈ V is mapped to a vector E(w) ∈ R^d.
Cosine Similarity
The similarity between two embeddings can be measured using cosine similarity:
similarity(A, B) = (A · B) / (||A|| × ||B||)
Where:
- A · B is the dot product
- ||A|| is the magnitude of vector A
Common Use Cases
- Text Search: Find semantically similar documents
- Recommendation Systems: Identify similar items
- Clustering: Group related content
- Transfer Learning: Use pre-trained embeddings
Popular Embedding Models
- Word2Vec: Word-level embeddings
- BERT: Contextual embeddings
- Sentence-BERT: Sentence-level embeddings
- OpenAI Ada: General-purpose embeddings
Example
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
sentences = ["Machine learning is fun", "AI is interesting"]
embeddings = model.encode(sentences)
# embeddings[0] shape: (384,)
# A 384-dimensional vector representing the first sentence
Visualization
Embeddings are often visualized in 2D or 3D using dimensionality reduction techniques like:
- t-SNE (t-Distributed Stochastic Neighbor Embedding)
- UMAP (Uniform Manifold Approximation and Projection)
- PCA (Principal Component Analysis)
Key Takeaways
- Embeddings convert discrete data into continuous vectors
- Similar items have similar embedding vectors
- Foundation for many modern ML applications
- Pre-trained embeddings save time and resources