Vector Databases and Embeddings: How AI Finds Similar Content
You search for "comfortable running shoes." Traditional search looks for exact word matches. Vector search understands you want "athletic footwear with cushioning." It finds relevant results even without exact keywords. This is powered by embeddings and vector databases — the technology behind semantic search and modern AI applications.
Understanding how embeddings work helps you build smarter search, recommendation systems, and RAG applications.
What Are Embeddings?
Embeddings are numerical representations of text, images, or other data. Similar content has similar embeddings.
**Example:**
"dog" → [0.2, 0.8, 0.1, ...] (1536 numbers)
"puppy" → [0.3, 0.7, 0.2, ...] (similar numbers)
"car" → [0.9, 0.1, 0.8, ...] (different numbers)
"Dog" and "puppy" have similar embeddings because they're semantically related. "Car" has different embeddings because it's unrelated.
Embeddings capture meaning, not just words. Similar meanings have similar embeddings, regardless of exact wording.
How Embeddings Are Created
Embedding models (like OpenAI's text-embedding-ada-002) convert text to vectors:
**Input:** "The quick brown fox"
**Output:** [0.123, -0.456, 0.789, ...] (1536 dimensions)
The model learned these representations from massive text datasets. Similar contexts produce similar vectors.
**Common embedding models:**
- OpenAI text-embedding-ada-002: 1536 dimensions
- Cohere embed-english-v3.0: 1024 dimensions
- Sentence-BERT: 384-768 dimensions
- Google Universal Sentence Encoder: 512 dimensions
Vector Similarity
To find similar content, calculate distance between vectors:
**Cosine similarity:** Measures angle between vectors. Range: -1 to 1 (1 = identical, 0 = unrelated, -1 = opposite)
**Euclidean distance:** Straight-line distance between points. Lower = more similar.
**Dot product:** Similar to cosine but considers magnitude.
Most vector databases use cosine similarity for text embeddings.
What Are Vector Databases?
Vector databases store embeddings and enable fast similarity search across millions of vectors.
**Traditional database:** "Find exact match for 'dog'"
**Vector database:** "Find top 10 most similar to [0.2, 0.8, 0.1, ...]"
Vector databases use specialized indexes (HNSW, IVF) to search billions of vectors in milliseconds.
Popular Vector Databases
**Pinecone:** Fully managed, easy to use, scales automatically. Good for production.
**Weaviate:** Open-source, supports hybrid search (vector + keyword). Self-hosted or cloud.
**Qdrant:** Open-source, fast, written in Rust. Good performance.
**Chroma:** Open-source, simple, designed for LLM applications. Easy to start.
**Milvus:** Open-source, highly scalable, used by large companies.
**pgvector:** PostgreSQL extension. Use existing Postgres database for vectors.
The RAG Workflow
Retrieval-Augmented Generation uses vector databases to give LLMs access to external knowledge:
**1. Index documents:**
- Split documents into chunks
- Generate embeddings for each chunk
- Store embeddings in vector database
**2. Query time:**
- User asks question
- Generate embedding for question
- Search vector database for similar chunks
- Retrieve top 3-5 most relevant chunks
- Include chunks in LLM prompt
- LLM generates answer using retrieved context
This allows LLMs to answer questions about your specific documents without fine-tuning.
Chunking Strategies
Documents must be split into chunks before embedding:
**Fixed-size chunks:** 500-1000 tokens per chunk. Simple but might split mid-sentence.
**Semantic chunks:** Split at paragraph or section boundaries. Better context preservation.
**Overlapping chunks:** Include 50-100 tokens from previous chunk. Prevents losing context at boundaries.
**Recursive splitting:** Try paragraph splits first, fall back to sentence splits if too large.
Chunk size affects retrieval quality. Too small = missing context. Too large = irrelevant information.
Metadata Filtering
Vector databases support filtering by metadata before similarity search:
**Example:** "Find similar documents where author='John' AND date > '2023-01-01'"
This combines semantic search with traditional filtering for more precise results.
**Common metadata:**
- Document ID, title, author
- Date, category, tags
- Source, language
- Access permissions
Hybrid Search
Combine vector search with keyword search for best results:
**Vector search:** Good for semantic similarity ("comfortable shoes" finds "cushioned sneakers")
**Keyword search:** Good for exact matches (product codes, names, technical terms)
**Hybrid:** Use both, combine scores. Weaviate and some others support this natively.
Cost Considerations
**Embedding generation:**
- OpenAI ada-002: $0.0001 per 1K tokens
- 1 million documents (500 tokens each) = $50
**Vector storage:**
- Pinecone: $0.096 per million vectors per month
- Self-hosted: Storage + compute costs
**Query costs:**
- Pinecone: $0.10 per 1K queries
- Self-hosted: Compute costs only
For high-volume applications, self-hosted can be more cost-effective.
Performance Optimization
**Index selection:** HNSW for speed, IVF for memory efficiency
**Dimension reduction:** Use PCA to reduce 1536 dimensions to 384. Faster search, slight accuracy loss.
**Quantization:** Store vectors in lower precision (int8 instead of float32). 4x less storage, minimal accuracy loss.
**Caching:** Cache frequent queries to avoid repeated vector searches.
**Batch operations:** Insert/query multiple vectors at once for better throughput.
Common Use Cases
**Semantic search:** Find documents by meaning, not just keywords
**RAG applications:** Give LLMs access to your documents
**Recommendation systems:** Find similar products, articles, or content
**Duplicate detection:** Find near-duplicate content
**Clustering:** Group similar items together
**Anomaly detection:** Find outliers that don't match patterns
Limitations
**Cold start:** Need existing data to find similar items
**Embedding quality:** Results depend on embedding model quality
**Exact match issues:** Might miss exact keyword matches if relying only on vectors
**Cost at scale:** Storing billions of vectors can be expensive
**Explainability:** Hard to explain why two items are similar
Getting Started
**1. Choose embedding model:** OpenAI ada-002 is good default
**2. Choose vector database:** Pinecone for managed, Chroma for local development
**3. Prepare data:** Split documents into chunks
**4. Generate embeddings:** Use embedding API
**5. Store in database:** Insert vectors with metadata
**6. Query:** Generate query embedding, search for similar vectors
**7. Iterate:** Adjust chunk size, metadata, and retrieval count based on results
Building vector search applications? The vector search builder helps you set up embeddings and vector databases quickly.