Vector Databases

Section 01

The Story That Explains Vector Databases

📖 Real World Analogy

The World's Strangest Library

Imagine a library where no book has a title. No catalogue. No ISBN. No shelf numbers. Instead, each book has a fingerprint — a list of 1,536 numbers that mathematically describe its meaning: how philosophical it is, how technical, how romantic, how violent. Books are stored by similarity of fingerprint, not by alphabet.

You walk in and say: "I want something that feels like the sadness of exile, the beauty of mathematics, and the weight of unspoken love." The librarian doesn't search titles — she computes your description into a fingerprint and finds the twelve books whose fingerprints are closest in space to yours.

That is exactly what a vector database does. And it is the engine behind every modern LLM API application that needs memory, context, or semantic search.

Traditional databases answer questions like "Give me all rows where price < 100." Vector databases answer a fundamentally different kind of question: "Give me the things most similar in meaning to this." That distinction is everything when building AI applications.

🧠

Why LLMs Need Vector Databases

LLMs have a fixed context window — they can only "see" a limited amount of text at once. Vector databases act as the LLM's long-term memory: store millions of document chunks as vectors, and at query time, retrieve only the most relevant ones to inject into the context window. This pattern is called RAG (Retrieval-Augmented Generation) and it is the most important architecture pattern in production LLM applications today.

Section 02

What Is a Vector (Embedding)?

Before comparing databases, you need to understand what they actually store. An embedding is a list of floating-point numbers produced by an embedding model. These numbers encode semantic meaning — the position of the text in a high-dimensional mathematical space.

🔢 From Text → Vector: What Happens Under the Hood

Input

Raw text: "The cat sat on the mat."

Model

Embedding model (e.g. text-embedding-3-small, all-MiniLM-L6-v2) processes the text through transformer layers

Output

A vector of 1,536 floats: [0.023, -0.814, 0.441, ..., 0.009]

Magic

Semantically similar sentences produce vectors that are close together in 1,536-dimensional space

The core operation a vector database must perform efficiently is called Approximate Nearest Neighbour (ANN) search: given a query vector, find the k stored vectors that are closest to it — measured by cosine similarity or Euclidean distance — across millions or billions of stored vectors, in milliseconds.

Cosine Similarity

cos(θ) = (A·B) / (|A| × |B|)

Measures angle between vectors. 1 = identical, 0 = orthogonal, −1 = opposite. Best for text embeddings.

Euclidean Distance

d = √Σ(Aᵢ − Bᵢ)²

Straight-line distance in embedding space. Useful when magnitude of the vector matters.

Dot Product

A · B = Σ AᵢBᵢ

Fast and often used. Equivalent to cosine when vectors are normalised to unit length.

Manhattan Distance

d = Σ |Aᵢ − Bᵢ|

Sum of absolute differences. Faster to compute, occasionally used in high-dim sparse spaces.

Section 03

The Vector Database Landscape — Visual Overview

🗺️ Four Contenders — Where They Live

Each quadrant represents a different deployment philosophy. The right choice depends on your scale, stack, and ops budget — not on marketing.

Section 04

🌲 Pinecone — The Fully Managed Powerhouse

📖 The Analogy

Pinecone = AWS S3 for Vectors

You don't think about how S3 stores your files, what hardware it runs on, or how it scales to a billion objects. You just call put_object and get_object. Pinecone is the same philosophy for vectors: a fully managed, serverless API that handles indexing, sharding, replication, and ANN search so you never touch infrastructure. You pay per query and per stored vector. You never SSH into anything.

Core Architecture

Pinecone uses a proprietary ANN index (based on HNSW and product quantisation) distributed across pods. Its serverless tier scales to zero when idle and spins up in milliseconds. Indexes are namespace-partitioned, allowing multi-tenancy within a single index.

⚡

Serverless

Scale to Zero

No always-on infrastructure. Pay only for what you query and store. Cold start is sub-second — ideal for bursty workloads.

🏷️

Namespaces

Multi-tenancy

Partition vectors within one index by namespace string. Query namespace-scoped results without separate indexes per tenant.

🔍

Metadata Filtering

Hybrid Search

Attach JSON metadata to each vector. Filter by metadata during ANN search — combine semantic and structured conditions in one call.

Python: Full RAG Pipeline with Pinecone

# pip install pinecone-client openai
import os
from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI

# ── 1. Initialise clients ──────────────────────────────────
pc     = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
openai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# ── 2. Create or connect to an index ──────────────────────
INDEX_NAME = "rag-docs"
if INDEX_NAME not in [i.name for i in pc.list_indexes()]:
    pc.create_index(
        name=INDEX_NAME,
        dimension=1536,          # text-embedding-3-small output size
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )
index = pc.Index(INDEX_NAME)

# ── 3. Embed and upsert documents ─────────────────────────
documents = [
    {"id": "doc-1", "text": "Vector databases store embeddings for semantic search.",
     "metadata": {"source": "chapter_1", "topic": "vector-db"}},
    {"id": "doc-2", "text": "Pinecone is a fully managed serverless vector database.",
     "metadata": {"source": "chapter_2", "topic": "pinecone"}},
    {"id": "doc-3", "text": "RAG improves LLM accuracy by grounding responses in retrieved context.",
     "metadata": {"source": "chapter_3", "topic": "rag"}},
]

def embed(text: str) -> list[float]:
    resp = openai.embeddings.create(input=text, model="text-embedding-3-small")
    return resp.data[0].embedding

vectors = [
    (f'ns1#{doc["id"]}', embed(doc["text"]), doc["metadata"])
    for doc in documents
]
index.upsert(vectors=vectors, namespace="chapter-ns")

# ── 4. Semantic search with metadata filter ────────────────
query   = "How does Pinecone scale for large datasets?"
q_vec   = embed(query)
results = index.query(
    vector=q_vec,
    top_k=3,
    namespace="chapter-ns",
    filter={"topic": {"$in": ["pinecone", "vector-db"]}},
    include_metadata=True
)

for match in results.matches:
    print(f"Score: {match.score:.4f} | Source: {match.metadata['source']}")

OUTPUT

Score: 0.9121 | Source: chapter_2 Score: 0.8847 | Source: chapter_1 Score: 0.7203 | Source: chapter_3

💡

Pinecone Sparse-Dense Hybrid Search

Pinecone supports hybrid search by accepting both a dense vector (vector) and a sparse vector (sparse_values) in one query. This combines semantic similarity with keyword-exact matching — critical for product catalogues and legal document retrieval where both matter. Set alpha between 0 (pure keyword) and 1 (pure semantic) to weight the blend.

Section 05

🔮 Weaviate — The Knowledge Graph + Vector Hybrid

📖 The Analogy

Weaviate = A Smart Librarian Who Knows Relationships

Pinecone is a pure similarity-search engine — brilliant at "find me the closest vectors." Weaviate is something richer. It's a knowledge graph that also does vector search. Your data has schema — Articles have Authors, Authors belong to Organisations. You can traverse those relationships in the same query that retrieves semantically similar documents. Ask: "Find articles about climate change, written by authors who are affiliated with a university, published after 2022." That is one Weaviate query. That is not possible in a pure vector store.

Key Architecture: HNSW + Inverted Index

Weaviate uses HNSW (Hierarchical Navigable Small World) for ANN search, paired with a traditional inverted index for BM25 keyword search. The combination enables hybrid search — a weighted mix of semantic and keyword relevance — out of the box without extra infrastructure.

🕸️

GraphQL API

Relational + Vector

Query your data graph with full relational traversal. Navigate class references while performing ANN search in the same operation.

🔀

Hybrid Search

BM25 + Vector

Built-in hybrid search with a tunable alpha parameter. No need for a separate Elasticsearch cluster alongside your vector DB.

🖼️

Multi-Modal

Text · Image · Code

Native modules for CLIP (image+text), bind (multi-modal), and custom vectorisers. Store and search across different data modalities in one DB.

Python: Define Schema, Upsert and Hybrid-Search

# pip install weaviate-client
import weaviate
import weaviate.classes as wvc
import os

# ── 1. Connect (local Docker or Weaviate Cloud) ────────────
client = weaviate.connect_to_local()          # or connect_to_wcs()

# ── 2. Create a collection (schema class) ─────────────────
if not client.collections.exists("Article"):
    client.collections.create(
        name="Article",
        vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
        generative_config=wvc.config.Configure.Generative.openai(),
        properties=[
            wvc.config.Property(name="title",   data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(name="year",    data_type=wvc.config.DataType.INT),
        ]
    )

# ── 3. Insert documents ───────────────────────────────────
articles = client.collections.get("Article")
with articles.batch.dynamic() as batch:
    batch.add_object({"title": "Introduction to Vector Search",
                      "content": "Embeddings capture semantic meaning...",
                      "year": 2024})
    batch.add_object({"title": "Weaviate in Production",
                      "content": "Weaviate uses HNSW with BM25 hybrid...",
                      "year": 2023})

# ── 4. Hybrid search (BM25 + semantic, alpha=0.7) ─────────
results = articles.query.hybrid(
    query="vector database hybrid search",
    alpha=0.7,                     # 0 = pure BM25, 1 = pure vector
    filters=wvc.query.Filter.by_property("year").greater_than(2022),
    limit=5,
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True)
)

for obj in results.objects:
    print(f"Title: {obj.properties['title']}")
    print(f"Score: {obj.metadata.score:.4f}")
    print(f"Explain: {obj.metadata.explain_score[:80]}")

client.close()

OUTPUT

Title: Weaviate in Production Score: 0.9344 Explain: hybrid relative score fusion: vector=0.951, bm25=0.887 Title: Introduction to Vector Search Score: 0.8112 Explain: hybrid relative score fusion: vector=0.833, bm25=0.771

⚠️

Weaviate HNSW RAM Requirement

Weaviate's HNSW index is fully loaded into RAM at startup. At 1,536 dimensions with float32, each vector takes ~6 KB. One million vectors = ~6 GB RAM. Plan memory carefully for large indexes. Use the pq (product quantisation) compression option to cut RAM usage by 4–32× at a small recall cost when memory is constrained.

Section 06

🎨 Chroma — The Developer's Best Friend for Prototyping

📖 The Analogy

Chroma = SQLite for Vectors

When you build a web app prototype, you don't set up PostgreSQL with connection pooling and SSL certificates. You use SQLite — it runs in-process, no server needed, file on disk, done in 30 seconds. Chroma is the SQLite of vector databases. It runs embedded in your Python process, stores data in a folder, and works with zero configuration. When you outgrow it, you swap it out. But for prototyping, hackathons, notebooks, and local RAG development, nothing touches its simplicity.

Two Modes: Ephemeral and Persistent

Chroma operates in two modes. In-memory (ephemeral) — data lives in RAM, gone when Python exits, perfect for tests. Persistent — data written to a directory on disk using DuckDB + Parquet under the hood. Both modes expose the same identical Python API, making it trivial to switch.

🏠

Embedded Mode

Zero Infrastructure

Runs in-process. No Docker. No server. No config. Import and go. Perfect for notebooks, unit tests, and local prototyping.

🔌

Auto-Embedding

Built-in Functions

Pass raw text strings — Chroma calls a built-in embedding function automatically. Swap embedding models with one parameter change.

🖥️

Client-Server Mode

Team Development

Run chroma run to start a HTTP server. Connect multiple processes or microservices to the same Chroma instance.

Python: Local RAG with Chroma in Under 30 Lines

# pip install chromadb sentence-transformers
import chromadb
from chromadb.utils import embedding_functions

# ── 1. Create a persistent client ─────────────────────────
client = chromadb.PersistentClient(path="./my_chroma_db")

# ── 2. Plug in a local embedding model (no API key!) ──────
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"         # 80 MB model, 384-dim
)

# ── 3. Create or get a collection ─────────────────────────
collection = client.get_or_create_collection(
    name="my_rag_docs",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"}
)

# ── 4. Add documents (Chroma embeds them automatically) ───
collection.add(
    documents=[
        "Chroma is an open-source embedding database for AI applications.",
        "Sentence transformers create dense vector representations of text.",
        "RAG architecture retrieves context before generating an answer.",
        "pgvector extends PostgreSQL with native vector column support.",
    ],
    ids=["id1", "id2", "id3", "id4"],
    metadatas=[
        {"category": "chroma"},
        {"category": "embeddings"},
        {"category": "rag"},
        {"category": "pgvector"},
    ]
)

# ── 5. Query ───────────────────────────────────────────────
results = collection.query(
    query_texts=["Which database works well for RAG prototyping?"],
    n_results=2,
    where={"category": {"$in": ["chroma", "rag"]}},
    include=["documents", "distances", "metadatas"]
)

for doc, dist, meta in zip(
    results["documents"][0],
    results["distances"][0],
    results["metadatas"][0]
):
    print(f"[{meta['category']:10s}] dist={dist:.4f} → {doc[:60]}")

OUTPUT

[chroma ] dist=0.1823 → Chroma is an open-source embedding database for AI app... [rag ] dist=0.2471 → RAG architecture retrieves context before generating ...

✅

Chroma + LangChain / LlamaIndex Native Support

Both LangChain and LlamaIndex ship first-class Chroma integrations. Chroma.from_documents(docs, embedding) in LangChain wraps the entire embed-and-upsert flow in one line. This makes Chroma the fastest path from raw documents to a working RAG pipeline — measurably faster to prototype than any other vector DB.

Section 07

🐘 pgvector — Vectors Inside Your Existing PostgreSQL

📖 The Analogy

pgvector = Adding a Superpower to What You Already Own

Your company runs PostgreSQL. Your users table is there. Your orders are there. Your audit logs are there. Your DBAs know it. Your backup scripts target it. Your ORM speaks it. Now imagine you can add vector columns to existing tables — embed product descriptions right next to price and SKU. Run a single SQL query that filters by price range, joins to orders, and does a nearest-neighbour vector search all at once. No new infrastructure. No new ops burden. No new concept to explain to your CTO. That is pgvector.

Two Index Types: IVFFlat vs HNSW

IVFFlat Index

Property	Value
Build time	Fast
RAM usage	Low
Recall	~90–95%
Updates	Requires rebuild
Best for	Static datasets

HNSW Index (pgvector 0.5+)

Property	Value
Build time	Slower
RAM usage	Higher
Recall	~98–99%
Updates	Online (no rebuild)
Best for	Dynamic datasets

Python + psycopg2: Full Setup and Search

# pip install psycopg2-binary pgvector openai
import psycopg2
from pgvector.psycopg2 import register_vector
from openai import OpenAI
import numpy as np
import os

conn   = psycopg2.connect(os.environ["DATABASE_URL"])
cur    = conn.cursor()
client = OpenAI()
register_vector(conn)

# ── 1. Enable extension and create table ──────────────────
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
  CREATE TABLE IF NOT EXISTS articles (
    id         SERIAL PRIMARY KEY,
    title      TEXT NOT NULL,
    content    TEXT NOT NULL,
    category   TEXT,
    published  DATE,
    embedding  VECTOR(1536)     -- OpenAI text-embedding-3-small
  );
""")
conn.commit()

# ── 2. Create HNSW index ──────────────────────────────────
cur.execute("""
  CREATE INDEX IF NOT EXISTS articles_hnsw_idx
  ON articles USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
""")
conn.commit()

# ── 3. Insert with embedding ──────────────────────────────
def embed(text: str):
    resp = client.embeddings.create(input=text, model="text-embedding-3-small")
    return np.array(resp.data[0].embedding)

rows = [
    ("Vector Databases 101",   "Embeddings enable semantic search at scale.",   "database"),
    ("pgvector Performance",   "HNSW gives 99% recall with fast query time.",    "database"),
    ("Building a RAG System",  "Retrieval augmented generation needs a vector store.", "rag"),
]
for title, content, category in rows:
    emb = embed(content)
    cur.execute(
        "INSERT INTO articles (title, content, category, embedding) VALUES (%s, %s, %s, %s)",
        (title, content, category, emb)
    )
conn.commit()

# ── 4. Nearest-neighbour search — pure SQL! ───────────────
query_vec = embed("How do I add vector search to my Postgres database?")
cur.execute("""
  SELECT title, category,
         1 - (embedding <=> %s::vector) AS cosine_similarity
  FROM   articles
  WHERE  category = 'database'
  ORDER  BY embedding <=> %s::vector
  LIMIT  3;
""", (query_vec, query_vec))

print(f"{'Title':<30} {'Category':<10} {'Similarity'}")
print("-" * 55)
for row in cur.fetchall():
    print(f"{row[0]:<30} {row[1]:<10} {row[2]:.4f}")
cur.close(); conn.close()

OUTPUT

Title Category Similarity ------------------------------------------------------- pgvector Performance database 0.9418 Vector Databases 101 database 0.8903

🔑

The Three Distance Operators in pgvector SQL

pgvector adds three custom SQL operators: <-> Euclidean distance, <=> cosine distance (use 1 - (<=>) for similarity), and <#> negative inner product. Use <=> for text embeddings. Always normalise your embedding column to unit length first if using inner product — unnormalised inner product does not equal cosine similarity.

Section 08

Master Comparison: Pinecone vs Weaviate vs Chroma vs pgvector

Dimension	🌲 Pinecone	🔮 Weaviate	🎨 Chroma	🐘 pgvector
Deployment	Fully managed SaaS	Self-host or WCS cloud	Embedded / server	Your Postgres server
Open Source	❌ Proprietary	✅ BSD-3	✅ Apache-2.0	✅ PostgreSQL Licence
Scale (vectors)	Billions+	Hundreds of millions	~1–10 million	~10–100 million
ANN Algorithm	Proprietary (HNSW/PQ)	HNSW	HNSW (hnswlib)	HNSW or IVFFlat
Hybrid Search	Sparse-dense vectors	✅ BM25 + vector native	❌ Vector-only	Manual (+ pg_trgm)
Metadata Filtering	✅ Rich filter syntax	✅ GraphQL filters	Basic where-clause	✅ Full SQL WHERE
Multi-modal	❌ Text/vector only	✅ Text, image, audio	❌ Text focus	❌ Custom only
Setup time	2 minutes (API key)	15 min (Docker)	30 seconds (pip)	Depends on Postgres
Cost model	Per query + storage	Infrastructure cost	Free / OSS	Postgres hosting cost
Joins with app data	❌ Separate system	Cross-refs only	❌ Separate system	✅ Native SQL JOINs
Best use case	Production RAG, high scale	Knowledge graphs, multi-modal	Local dev, prototyping	Existing Postgres stacks

Section 09

Which Vector Database Should You Choose? — Decision Flow

🧭 Decision Flowchart

This flow covers ~90% of real use cases. The tiebreaker in ambiguous cases is almost always: team ops capacity vs managed cost trade-off.

Section 10

Performance Benchmarks — What the Numbers Say

Benchmarking vector databases is genuinely hard — results depend on hardware, dataset size, index parameters, query concurrency, and recall tolerance. The table below reflects community benchmarks on 1M vectors at 1,536 dimensions (representative of OpenAI embeddings), single-node, non-filtered queries.

Database	QPS (queries/sec)	p99 Latency	Recall@10	Index Build (1M)	RAM (1M vecs)
🌲 Pinecone	~1,500+	<20 ms	~99%	Managed	Managed
🔮 Weaviate HNSW	~800–1,200	20–50 ms	~97–99%	~25 min	~8 GB
🎨 Chroma (hnswlib)	~200–500	40–80 ms	~97%	~8 min	~6 GB
🐘 pgvector HNSW	~300–700	30–60 ms	~98%	~30 min	~10 GB
🐘 pgvector IVFFlat	~500–900	<30 ms	~90–95%	~5 min	~4 GB

⚠️

Benchmark Caveats — Read Before Concluding

All these numbers degrade significantly with metadata filtering — when you add a WHERE category = 'X' condition, you reduce the ANN search space and some indexes (especially IVFFlat) lose recall rapidly. Pinecone and Weaviate handle filtered ANN significantly better than pgvector IVFFlat. HNSW-based indexes are more robust to filtering but use more RAM. Always benchmark on your own data shape and filter pattern.

Section 11

Full RAG Architecture — The Complete Picture

📖 End-to-End Story

How a Question Becomes an Intelligent Answer

Your company has 50,000 internal wiki pages. A new employee asks: "What is our policy on remote work for contractors in Germany?" The LLM alone cannot answer — it doesn't know your company's internal policies, and even if it did, the context window can't hold 50,000 pages. Here is what happens with RAG:

📄 Ingestion — Index Your Knowledge Base

All 50,000 wiki pages are chunked into paragraphs (~500 tokens each), embedded via text-embedding-3-small, and upserted into the vector database. This happens once, then incrementally as pages change. The vector DB now holds ~500,000 embedding vectors.

❓ Query — Embed the User's Question

The question "What is our remote work policy for German contractors?" is embedded into a 1,536-dimensional vector using the same embedding model. This vector captures the semantic meaning of the query.

🔍 Retrieval — ANN Search

The vector DB performs an ANN search across 500,000 vectors in <50 ms and returns the top-5 most semantically similar chunks. These chunks are almost certainly the relevant HR policy paragraphs — even if the exact words differ from the query.

🧩 Augmentation — Build the Prompt

The top-5 retrieved chunks are injected into the LLM prompt as context: "Answer the question using only the information in the following context: [chunk 1] [chunk 2] ... User question: What is our remote work policy for German contractors?"

🤖 Generation — LLM Synthesises the Answer

The LLM (GPT-4o, Claude, Gemini) reads the grounded context and generates an accurate, specific, citable answer. Hallucination is dramatically reduced because the model is constrained to retrieved facts.

Section 12

Golden Rules — Vector Database Production Checklist

🏆 Non-Negotiable Rules for Production Vector Search

Use the same embedding model for ingestion and retrieval. If you embed documents with text-embedding-3-small, you must query with the same model. Different models produce incompatible vector spaces — querying with a different model gives garbage results without error messages.

Chunk carefully — chunk size is not a hyperparameter to ignore. Too-small chunks (50 tokens) lose context. Too-large chunks (2,000 tokens) dilute the embedding signal. The empirically validated sweet spot for most RAG applications is 300–600 tokens with 50-token overlapping windows.

Store the original text alongside the vector — always. The vector is a lossy compressed representation. You need the original text to put in the LLM context window. Always store {"text": "...", "embedding": [...]} together.

Measure recall, not just latency. A fast vector search that returns the wrong results is worse than useless. Build a small evaluation set of (query, expected_chunk) pairs and measure Recall@5 and Recall@10 before deploying. Target >90% Recall@5.

For pgvector: always set SET LOCAL hnsw.ef_search = 100 at query time. The default ef_search = 40 trades recall for speed. For sub-100M vectors, ef_search = 100 gives near-perfect recall with latency still under 50 ms.

Don't embed metadata into the vector — filter it. Some developers embed text like "Category: Finance. Author: John." into the chunk before embedding. This pollutes the semantic space with structural noise. Store metadata as structured fields and use the database's native filter mechanism to apply it at search time.

Prototype with Chroma, graduate to Pinecone or pgvector. Build your RAG pipeline with Chroma locally. Once the logic is proven, swap the vector DB client for Pinecone (for scale and ops simplicity) or pgvector (if you're already paying for Postgres). The core RAG logic remains identical — only the DB client changes.