The Story That Explains Vector Databases
You walk in and say: "I want something that feels like the sadness of exile, the beauty of mathematics, and the weight of unspoken love." The librarian doesn't search titles — she computes your description into a fingerprint and finds the twelve books whose fingerprints are closest in space to yours.
That is exactly what a vector database does. And it is the engine behind every modern LLM API application that needs memory, context, or semantic search.
Traditional databases answer questions like "Give me all rows where price < 100." Vector databases answer a fundamentally different kind of question: "Give me the things most similar in meaning to this." That distinction is everything when building AI applications.
LLMs have a fixed context window — they can only "see" a limited amount of text at once. Vector databases act as the LLM's long-term memory: store millions of document chunks as vectors, and at query time, retrieve only the most relevant ones to inject into the context window. This pattern is called RAG (Retrieval-Augmented Generation) and it is the most important architecture pattern in production LLM applications today.
What Is a Vector (Embedding)?
Before comparing databases, you need to understand what they actually store. An embedding is a list of floating-point numbers produced by an embedding model. These numbers encode semantic meaning — the position of the text in a high-dimensional mathematical space.
The core operation a vector database must perform efficiently is called Approximate Nearest Neighbour (ANN) search: given a query vector, find the k stored vectors that are closest to it — measured by cosine similarity or Euclidean distance — across millions or billions of stored vectors, in milliseconds.
The Vector Database Landscape — Visual Overview
Each quadrant represents a different deployment philosophy. The right choice depends on your scale, stack, and ops budget — not on marketing.
🌲 Pinecone — The Fully Managed Powerhouse
put_object and
get_object. Pinecone is the same philosophy for vectors: a fully managed,
serverless API that handles indexing, sharding, replication, and ANN search so you
never touch infrastructure. You pay per query and per stored vector. You never SSH
into anything.
Core Architecture
Pinecone uses a proprietary ANN index (based on HNSW and product quantisation) distributed across pods. Its serverless tier scales to zero when idle and spins up in milliseconds. Indexes are namespace-partitioned, allowing multi-tenancy within a single index.
Python: Full RAG Pipeline with Pinecone
# pip install pinecone-client openai
import os
from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI
# ── 1. Initialise clients ──────────────────────────────────
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
openai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# ── 2. Create or connect to an index ──────────────────────
INDEX_NAME = "rag-docs"
if INDEX_NAME not in [i.name for i in pc.list_indexes()]:
pc.create_index(
name=INDEX_NAME,
dimension=1536, # text-embedding-3-small output size
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index(INDEX_NAME)
# ── 3. Embed and upsert documents ─────────────────────────
documents = [
{"id": "doc-1", "text": "Vector databases store embeddings for semantic search.",
"metadata": {"source": "chapter_1", "topic": "vector-db"}},
{"id": "doc-2", "text": "Pinecone is a fully managed serverless vector database.",
"metadata": {"source": "chapter_2", "topic": "pinecone"}},
{"id": "doc-3", "text": "RAG improves LLM accuracy by grounding responses in retrieved context.",
"metadata": {"source": "chapter_3", "topic": "rag"}},
]
def embed(text: str) -> list[float]:
resp = openai.embeddings.create(input=text, model="text-embedding-3-small")
return resp.data[0].embedding
vectors = [
(f'ns1#{doc["id"]}', embed(doc["text"]), doc["metadata"])
for doc in documents
]
index.upsert(vectors=vectors, namespace="chapter-ns")
# ── 4. Semantic search with metadata filter ────────────────
query = "How does Pinecone scale for large datasets?"
q_vec = embed(query)
results = index.query(
vector=q_vec,
top_k=3,
namespace="chapter-ns",
filter={"topic": {"$in": ["pinecone", "vector-db"]}},
include_metadata=True
)
for match in results.matches:
print(f"Score: {match.score:.4f} | Source: {match.metadata['source']}")
Pinecone supports hybrid search by accepting both a dense vector
(vector) and a sparse vector (sparse_values) in one query.
This combines semantic similarity with keyword-exact matching — critical for
product catalogues and legal document retrieval where both matter.
Set alpha between 0 (pure keyword) and 1 (pure semantic) to weight the blend.
🔮 Weaviate — The Knowledge Graph + Vector Hybrid
Key Architecture: HNSW + Inverted Index
Weaviate uses HNSW (Hierarchical Navigable Small World) for ANN search, paired with a traditional inverted index for BM25 keyword search. The combination enables hybrid search — a weighted mix of semantic and keyword relevance — out of the box without extra infrastructure.
alpha parameter. No need for a separate Elasticsearch cluster alongside your vector DB.Python: Define Schema, Upsert and Hybrid-Search
# pip install weaviate-client
import weaviate
import weaviate.classes as wvc
import os
# ── 1. Connect (local Docker or Weaviate Cloud) ────────────
client = weaviate.connect_to_local() # or connect_to_wcs()
# ── 2. Create a collection (schema class) ─────────────────
if not client.collections.exists("Article"):
client.collections.create(
name="Article",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
generative_config=wvc.config.Configure.Generative.openai(),
properties=[
wvc.config.Property(name="title", data_type=wvc.config.DataType.TEXT),
wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
wvc.config.Property(name="year", data_type=wvc.config.DataType.INT),
]
)
# ── 3. Insert documents ───────────────────────────────────
articles = client.collections.get("Article")
with articles.batch.dynamic() as batch:
batch.add_object({"title": "Introduction to Vector Search",
"content": "Embeddings capture semantic meaning...",
"year": 2024})
batch.add_object({"title": "Weaviate in Production",
"content": "Weaviate uses HNSW with BM25 hybrid...",
"year": 2023})
# ── 4. Hybrid search (BM25 + semantic, alpha=0.7) ─────────
results = articles.query.hybrid(
query="vector database hybrid search",
alpha=0.7, # 0 = pure BM25, 1 = pure vector
filters=wvc.query.Filter.by_property("year").greater_than(2022),
limit=5,
return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True)
)
for obj in results.objects:
print(f"Title: {obj.properties['title']}")
print(f"Score: {obj.metadata.score:.4f}")
print(f"Explain: {obj.metadata.explain_score[:80]}")
client.close()
Weaviate's HNSW index is fully loaded into RAM at startup.
At 1,536 dimensions with float32, each vector takes ~6 KB. One million vectors =
~6 GB RAM. Plan memory carefully for large indexes. Use the
pq (product quantisation) compression option to cut RAM usage by 4–32× at a
small recall cost when memory is constrained.
🎨 Chroma — The Developer's Best Friend for Prototyping
Two Modes: Ephemeral and Persistent
Chroma operates in two modes. In-memory (ephemeral) — data lives in RAM, gone when Python exits, perfect for tests. Persistent — data written to a directory on disk using DuckDB + Parquet under the hood. Both modes expose the same identical Python API, making it trivial to switch.
chroma run to start a HTTP server. Connect multiple processes or microservices to the same Chroma instance.Python: Local RAG with Chroma in Under 30 Lines
# pip install chromadb sentence-transformers
import chromadb
from chromadb.utils import embedding_functions
# ── 1. Create a persistent client ─────────────────────────
client = chromadb.PersistentClient(path="./my_chroma_db")
# ── 2. Plug in a local embedding model (no API key!) ──────
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2" # 80 MB model, 384-dim
)
# ── 3. Create or get a collection ─────────────────────────
collection = client.get_or_create_collection(
name="my_rag_docs",
embedding_function=ef,
metadata={"hnsw:space": "cosine"}
)
# ── 4. Add documents (Chroma embeds them automatically) ───
collection.add(
documents=[
"Chroma is an open-source embedding database for AI applications.",
"Sentence transformers create dense vector representations of text.",
"RAG architecture retrieves context before generating an answer.",
"pgvector extends PostgreSQL with native vector column support.",
],
ids=["id1", "id2", "id3", "id4"],
metadatas=[
{"category": "chroma"},
{"category": "embeddings"},
{"category": "rag"},
{"category": "pgvector"},
]
)
# ── 5. Query ───────────────────────────────────────────────
results = collection.query(
query_texts=["Which database works well for RAG prototyping?"],
n_results=2,
where={"category": {"$in": ["chroma", "rag"]}},
include=["documents", "distances", "metadatas"]
)
for doc, dist, meta in zip(
results["documents"][0],
results["distances"][0],
results["metadatas"][0]
):
print(f"[{meta['category']:10s}] dist={dist:.4f} → {doc[:60]}")
Both LangChain and LlamaIndex ship first-class Chroma integrations.
Chroma.from_documents(docs, embedding) in LangChain wraps the entire
embed-and-upsert flow in one line. This makes Chroma the fastest path from
raw documents to a working RAG pipeline — measurably faster to prototype than
any other vector DB.
🐘 pgvector — Vectors Inside Your Existing PostgreSQL
Two Index Types: IVFFlat vs HNSW
| Property | Value |
|---|---|
| Build time | Fast |
| RAM usage | Low |
| Recall | ~90–95% |
| Updates | Requires rebuild |
| Best for | Static datasets |
| Property | Value |
|---|---|
| Build time | Slower |
| RAM usage | Higher |
| Recall | ~98–99% |
| Updates | Online (no rebuild) |
| Best for | Dynamic datasets |
Python + psycopg2: Full Setup and Search
# pip install psycopg2-binary pgvector openai
import psycopg2
from pgvector.psycopg2 import register_vector
from openai import OpenAI
import numpy as np
import os
conn = psycopg2.connect(os.environ["DATABASE_URL"])
cur = conn.cursor()
client = OpenAI()
register_vector(conn)
# ── 1. Enable extension and create table ──────────────────
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
CREATE TABLE IF NOT EXISTS articles (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
content TEXT NOT NULL,
category TEXT,
published DATE,
embedding VECTOR(1536) -- OpenAI text-embedding-3-small
);
""")
conn.commit()
# ── 2. Create HNSW index ──────────────────────────────────
cur.execute("""
CREATE INDEX IF NOT EXISTS articles_hnsw_idx
ON articles USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
""")
conn.commit()
# ── 3. Insert with embedding ──────────────────────────────
def embed(text: str):
resp = client.embeddings.create(input=text, model="text-embedding-3-small")
return np.array(resp.data[0].embedding)
rows = [
("Vector Databases 101", "Embeddings enable semantic search at scale.", "database"),
("pgvector Performance", "HNSW gives 99% recall with fast query time.", "database"),
("Building a RAG System", "Retrieval augmented generation needs a vector store.", "rag"),
]
for title, content, category in rows:
emb = embed(content)
cur.execute(
"INSERT INTO articles (title, content, category, embedding) VALUES (%s, %s, %s, %s)",
(title, content, category, emb)
)
conn.commit()
# ── 4. Nearest-neighbour search — pure SQL! ───────────────
query_vec = embed("How do I add vector search to my Postgres database?")
cur.execute("""
SELECT title, category,
1 - (embedding <=> %s::vector) AS cosine_similarity
FROM articles
WHERE category = 'database'
ORDER BY embedding <=> %s::vector
LIMIT 3;
""", (query_vec, query_vec))
print(f"{'Title':<30} {'Category':<10} {'Similarity'}")
print("-" * 55)
for row in cur.fetchall():
print(f"{row[0]:<30} {row[1]:<10} {row[2]:.4f}")
cur.close(); conn.close()
pgvector adds three custom SQL operators: <-> Euclidean distance,
<=> cosine distance (use 1 - (<=>) for similarity),
and <#> negative inner product. Use <=> for text
embeddings. Always normalise your embedding column to unit length first if using inner
product — unnormalised inner product does not equal cosine similarity.
Master Comparison: Pinecone vs Weaviate vs Chroma vs pgvector
| Dimension | 🌲 Pinecone | 🔮 Weaviate | 🎨 Chroma | 🐘 pgvector |
|---|---|---|---|---|
| Deployment | Fully managed SaaS | Self-host or WCS cloud | Embedded / server | Your Postgres server |
| Open Source | ❌ Proprietary | ✅ BSD-3 | ✅ Apache-2.0 | ✅ PostgreSQL Licence |
| Scale (vectors) | Billions+ | Hundreds of millions | ~1–10 million | ~10–100 million |
| ANN Algorithm | Proprietary (HNSW/PQ) | HNSW | HNSW (hnswlib) | HNSW or IVFFlat |
| Hybrid Search | Sparse-dense vectors | ✅ BM25 + vector native | ❌ Vector-only | Manual (+ pg_trgm) |
| Metadata Filtering | ✅ Rich filter syntax | ✅ GraphQL filters | Basic where-clause | ✅ Full SQL WHERE |
| Multi-modal | ❌ Text/vector only | ✅ Text, image, audio | ❌ Text focus | ❌ Custom only |
| Setup time | 2 minutes (API key) | 15 min (Docker) | 30 seconds (pip) | Depends on Postgres |
| Cost model | Per query + storage | Infrastructure cost | Free / OSS | Postgres hosting cost |
| Joins with app data | ❌ Separate system | Cross-refs only | ❌ Separate system | ✅ Native SQL JOINs |
| Best use case | Production RAG, high scale | Knowledge graphs, multi-modal | Local dev, prototyping | Existing Postgres stacks |
Which Vector Database Should You Choose? — Decision Flow
This flow covers ~90% of real use cases. The tiebreaker in ambiguous cases is almost always: team ops capacity vs managed cost trade-off.
Performance Benchmarks — What the Numbers Say
Benchmarking vector databases is genuinely hard — results depend on hardware, dataset size, index parameters, query concurrency, and recall tolerance. The table below reflects community benchmarks on 1M vectors at 1,536 dimensions (representative of OpenAI embeddings), single-node, non-filtered queries.
| Database | QPS (queries/sec) | p99 Latency | Recall@10 | Index Build (1M) | RAM (1M vecs) |
|---|---|---|---|---|---|
| 🌲 Pinecone | ~1,500+ | <20 ms | ~99% | Managed | Managed |
| 🔮 Weaviate HNSW | ~800–1,200 | 20–50 ms | ~97–99% | ~25 min | ~8 GB |
| 🎨 Chroma (hnswlib) | ~200–500 | 40–80 ms | ~97% | ~8 min | ~6 GB |
| 🐘 pgvector HNSW | ~300–700 | 30–60 ms | ~98% | ~30 min | ~10 GB |
| 🐘 pgvector IVFFlat | ~500–900 | <30 ms | ~90–95% | ~5 min | ~4 GB |
All these numbers degrade significantly with metadata filtering
— when you add a WHERE category = 'X' condition, you reduce the ANN search
space and some indexes (especially IVFFlat) lose recall rapidly. Pinecone
and Weaviate handle filtered ANN significantly better than pgvector IVFFlat.
HNSW-based indexes are more robust to filtering but use more RAM.
Always benchmark on your own data shape and filter pattern.
Full RAG Architecture — The Complete Picture
text-embedding-3-small, and upserted into the vector database. This happens once, then incrementally as pages change. The vector DB now holds ~500,000 embedding vectors.Golden Rules — Vector Database Production Checklist
text-embedding-3-small, you must query with the
same model. Different models produce incompatible vector spaces — querying
with a different model gives garbage results without error messages.
{"text": "...", "embedding": [...]} together.
SET LOCAL hnsw.ef_search = 100
at query time. The default ef_search = 40 trades recall for
speed. For sub-100M vectors, ef_search = 100 gives near-perfect recall
with latency still under 50 ms.