Large Language Models (LLMs) 📂 LLM architecture deep dive · 10 of 10 40 min read

Vector Databases Pinecone vs Weaviate vs Chroma vs pgvector

A comprehensive, example-driven tutorial comparing the four most important vector databases for LLM API development. Covers what embeddings are, how ANN search works, full Python code examples for each database, a master

Section 01

The Story That Explains Vector Databases

The World's Strangest Library
Imagine a library where no book has a title. No catalogue. No ISBN. No shelf numbers. Instead, each book has a fingerprint — a list of 1,536 numbers that mathematically describe its meaning: how philosophical it is, how technical, how romantic, how violent. Books are stored by similarity of fingerprint, not by alphabet.

You walk in and say: "I want something that feels like the sadness of exile, the beauty of mathematics, and the weight of unspoken love." The librarian doesn't search titles — she computes your description into a fingerprint and finds the twelve books whose fingerprints are closest in space to yours.

That is exactly what a vector database does. And it is the engine behind every modern LLM API application that needs memory, context, or semantic search.

Traditional databases answer questions like "Give me all rows where price < 100." Vector databases answer a fundamentally different kind of question: "Give me the things most similar in meaning to this." That distinction is everything when building AI applications.

🧠
Why LLMs Need Vector Databases

LLMs have a fixed context window — they can only "see" a limited amount of text at once. Vector databases act as the LLM's long-term memory: store millions of document chunks as vectors, and at query time, retrieve only the most relevant ones to inject into the context window. This pattern is called RAG (Retrieval-Augmented Generation) and it is the most important architecture pattern in production LLM applications today.


Section 02

What Is a Vector (Embedding)?

Before comparing databases, you need to understand what they actually store. An embedding is a list of floating-point numbers produced by an embedding model. These numbers encode semantic meaning — the position of the text in a high-dimensional mathematical space.

🔢 From Text → Vector: What Happens Under the Hood
Input
Raw text: "The cat sat on the mat."
Model
Embedding model (e.g. text-embedding-3-small, all-MiniLM-L6-v2) processes the text through transformer layers
Output
A vector of 1,536 floats: [0.023, -0.814, 0.441, ..., 0.009]
Magic
Semantically similar sentences produce vectors that are close together in 1,536-dimensional space

The core operation a vector database must perform efficiently is called Approximate Nearest Neighbour (ANN) search: given a query vector, find the k stored vectors that are closest to it — measured by cosine similarity or Euclidean distance — across millions or billions of stored vectors, in milliseconds.

Cosine Similarity
cos(θ) = (A·B) / (|A| × |B|)
Measures angle between vectors. 1 = identical, 0 = orthogonal, −1 = opposite. Best for text embeddings.
Euclidean Distance
d = √Σ(Aᵢ − Bᵢ)²
Straight-line distance in embedding space. Useful when magnitude of the vector matters.
Dot Product
A · B = Σ AᵢBᵢ
Fast and often used. Equivalent to cosine when vectors are normalised to unit length.
Manhattan Distance
d = Σ |Aᵢ − Bᵢ|
Sum of absolute differences. Faster to compute, occasionally used in high-dim sparse spaces.

Section 03

The Vector Database Landscape — Visual Overview

🗺️ Four Contenders — Where They Live
MANAGED CLOUD OPEN-SOURCE / SELF-HOSTED EMBEDDED / LOCAL POSTGRES EXTENSION 🌲 Pinecone Serverless · Scale to zero 🔮 Weaviate GraphQL · Multi-modal 🎨 Chroma Python-native · Prototyping 🐘 pgvector Postgres extension · SQL

Each quadrant represents a different deployment philosophy. The right choice depends on your scale, stack, and ops budget — not on marketing.


Section 04

🌲 Pinecone — The Fully Managed Powerhouse

Pinecone = AWS S3 for Vectors
You don't think about how S3 stores your files, what hardware it runs on, or how it scales to a billion objects. You just call put_object and get_object. Pinecone is the same philosophy for vectors: a fully managed, serverless API that handles indexing, sharding, replication, and ANN search so you never touch infrastructure. You pay per query and per stored vector. You never SSH into anything.

Core Architecture

Pinecone uses a proprietary ANN index (based on HNSW and product quantisation) distributed across pods. Its serverless tier scales to zero when idle and spins up in milliseconds. Indexes are namespace-partitioned, allowing multi-tenancy within a single index.

Serverless
Scale to Zero
No always-on infrastructure. Pay only for what you query and store. Cold start is sub-second — ideal for bursty workloads.
🏷️
Namespaces
Multi-tenancy
Partition vectors within one index by namespace string. Query namespace-scoped results without separate indexes per tenant.
🔍
Metadata Filtering
Hybrid Search
Attach JSON metadata to each vector. Filter by metadata during ANN search — combine semantic and structured conditions in one call.

Python: Full RAG Pipeline with Pinecone

# pip install pinecone-client openai
import os
from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI

# ── 1. Initialise clients ──────────────────────────────────
pc     = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
openai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# ── 2. Create or connect to an index ──────────────────────
INDEX_NAME = "rag-docs"
if INDEX_NAME not in [i.name for i in pc.list_indexes()]:
    pc.create_index(
        name=INDEX_NAME,
        dimension=1536,          # text-embedding-3-small output size
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )
index = pc.Index(INDEX_NAME)

# ── 3. Embed and upsert documents ─────────────────────────
documents = [
    {"id": "doc-1", "text": "Vector databases store embeddings for semantic search.",
     "metadata": {"source": "chapter_1", "topic": "vector-db"}},
    {"id": "doc-2", "text": "Pinecone is a fully managed serverless vector database.",
     "metadata": {"source": "chapter_2", "topic": "pinecone"}},
    {"id": "doc-3", "text": "RAG improves LLM accuracy by grounding responses in retrieved context.",
     "metadata": {"source": "chapter_3", "topic": "rag"}},
]

def embed(text: str) -> list[float]:
    resp = openai.embeddings.create(input=text, model="text-embedding-3-small")
    return resp.data[0].embedding

vectors = [
    (f'ns1#{doc["id"]}', embed(doc["text"]), doc["metadata"])
    for doc in documents
]
index.upsert(vectors=vectors, namespace="chapter-ns")

# ── 4. Semantic search with metadata filter ────────────────
query   = "How does Pinecone scale for large datasets?"
q_vec   = embed(query)
results = index.query(
    vector=q_vec,
    top_k=3,
    namespace="chapter-ns",
    filter={"topic": {"$in": ["pinecone", "vector-db"]}},
    include_metadata=True
)

for match in results.matches:
    print(f"Score: {match.score:.4f} | Source: {match.metadata['source']}")
OUTPUT
Score: 0.9121 | Source: chapter_2 Score: 0.8847 | Source: chapter_1 Score: 0.7203 | Source: chapter_3
💡
Pinecone Sparse-Dense Hybrid Search

Pinecone supports hybrid search by accepting both a dense vector (vector) and a sparse vector (sparse_values) in one query. This combines semantic similarity with keyword-exact matching — critical for product catalogues and legal document retrieval where both matter. Set alpha between 0 (pure keyword) and 1 (pure semantic) to weight the blend.


Section 05

🔮 Weaviate — The Knowledge Graph + Vector Hybrid

Weaviate = A Smart Librarian Who Knows Relationships
Pinecone is a pure similarity-search engine — brilliant at "find me the closest vectors." Weaviate is something richer. It's a knowledge graph that also does vector search. Your data has schema — Articles have Authors, Authors belong to Organisations. You can traverse those relationships in the same query that retrieves semantically similar documents. Ask: "Find articles about climate change, written by authors who are affiliated with a university, published after 2022." That is one Weaviate query. That is not possible in a pure vector store.

Key Architecture: HNSW + Inverted Index

Weaviate uses HNSW (Hierarchical Navigable Small World) for ANN search, paired with a traditional inverted index for BM25 keyword search. The combination enables hybrid search — a weighted mix of semantic and keyword relevance — out of the box without extra infrastructure.

🕸️
GraphQL API
Relational + Vector
Query your data graph with full relational traversal. Navigate class references while performing ANN search in the same operation.
🔀
Hybrid Search
BM25 + Vector
Built-in hybrid search with a tunable alpha parameter. No need for a separate Elasticsearch cluster alongside your vector DB.
🖼️
Multi-Modal
Text · Image · Code
Native modules for CLIP (image+text), bind (multi-modal), and custom vectorisers. Store and search across different data modalities in one DB.

Python: Define Schema, Upsert and Hybrid-Search

# pip install weaviate-client
import weaviate
import weaviate.classes as wvc
import os

# ── 1. Connect (local Docker or Weaviate Cloud) ────────────
client = weaviate.connect_to_local()          # or connect_to_wcs()

# ── 2. Create a collection (schema class) ─────────────────
if not client.collections.exists("Article"):
    client.collections.create(
        name="Article",
        vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
        generative_config=wvc.config.Configure.Generative.openai(),
        properties=[
            wvc.config.Property(name="title",   data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(name="year",    data_type=wvc.config.DataType.INT),
        ]
    )

# ── 3. Insert documents ───────────────────────────────────
articles = client.collections.get("Article")
with articles.batch.dynamic() as batch:
    batch.add_object({"title": "Introduction to Vector Search",
                      "content": "Embeddings capture semantic meaning...",
                      "year": 2024})
    batch.add_object({"title": "Weaviate in Production",
                      "content": "Weaviate uses HNSW with BM25 hybrid...",
                      "year": 2023})

# ── 4. Hybrid search (BM25 + semantic, alpha=0.7) ─────────
results = articles.query.hybrid(
    query="vector database hybrid search",
    alpha=0.7,                     # 0 = pure BM25, 1 = pure vector
    filters=wvc.query.Filter.by_property("year").greater_than(2022),
    limit=5,
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True)
)

for obj in results.objects:
    print(f"Title: {obj.properties['title']}")
    print(f"Score: {obj.metadata.score:.4f}")
    print(f"Explain: {obj.metadata.explain_score[:80]}")

client.close()
OUTPUT
Title: Weaviate in Production Score: 0.9344 Explain: hybrid relative score fusion: vector=0.951, bm25=0.887 Title: Introduction to Vector Search Score: 0.8112 Explain: hybrid relative score fusion: vector=0.833, bm25=0.771
⚠️
Weaviate HNSW RAM Requirement

Weaviate's HNSW index is fully loaded into RAM at startup. At 1,536 dimensions with float32, each vector takes ~6 KB. One million vectors = ~6 GB RAM. Plan memory carefully for large indexes. Use the pq (product quantisation) compression option to cut RAM usage by 4–32× at a small recall cost when memory is constrained.


Section 06

🎨 Chroma — The Developer's Best Friend for Prototyping

Chroma = SQLite for Vectors
When you build a web app prototype, you don't set up PostgreSQL with connection pooling and SSL certificates. You use SQLite — it runs in-process, no server needed, file on disk, done in 30 seconds. Chroma is the SQLite of vector databases. It runs embedded in your Python process, stores data in a folder, and works with zero configuration. When you outgrow it, you swap it out. But for prototyping, hackathons, notebooks, and local RAG development, nothing touches its simplicity.

Two Modes: Ephemeral and Persistent

Chroma operates in two modes. In-memory (ephemeral) — data lives in RAM, gone when Python exits, perfect for tests. Persistent — data written to a directory on disk using DuckDB + Parquet under the hood. Both modes expose the same identical Python API, making it trivial to switch.

🏠
Embedded Mode
Zero Infrastructure
Runs in-process. No Docker. No server. No config. Import and go. Perfect for notebooks, unit tests, and local prototyping.
🔌
Auto-Embedding
Built-in Functions
Pass raw text strings — Chroma calls a built-in embedding function automatically. Swap embedding models with one parameter change.
🖥️
Client-Server Mode
Team Development
Run chroma run to start a HTTP server. Connect multiple processes or microservices to the same Chroma instance.

Python: Local RAG with Chroma in Under 30 Lines

# pip install chromadb sentence-transformers
import chromadb
from chromadb.utils import embedding_functions

# ── 1. Create a persistent client ─────────────────────────
client = chromadb.PersistentClient(path="./my_chroma_db")

# ── 2. Plug in a local embedding model (no API key!) ──────
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"         # 80 MB model, 384-dim
)

# ── 3. Create or get a collection ─────────────────────────
collection = client.get_or_create_collection(
    name="my_rag_docs",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"}
)

# ── 4. Add documents (Chroma embeds them automatically) ───
collection.add(
    documents=[
        "Chroma is an open-source embedding database for AI applications.",
        "Sentence transformers create dense vector representations of text.",
        "RAG architecture retrieves context before generating an answer.",
        "pgvector extends PostgreSQL with native vector column support.",
    ],
    ids=["id1", "id2", "id3", "id4"],
    metadatas=[
        {"category": "chroma"},
        {"category": "embeddings"},
        {"category": "rag"},
        {"category": "pgvector"},
    ]
)

# ── 5. Query ───────────────────────────────────────────────
results = collection.query(
    query_texts=["Which database works well for RAG prototyping?"],
    n_results=2,
    where={"category": {"$in": ["chroma", "rag"]}},
    include=["documents", "distances", "metadatas"]
)

for doc, dist, meta in zip(
    results["documents"][0],
    results["distances"][0],
    results["metadatas"][0]
):
    print(f"[{meta['category']:10s}] dist={dist:.4f} → {doc[:60]}")
OUTPUT
[chroma ] dist=0.1823 → Chroma is an open-source embedding database for AI app... [rag ] dist=0.2471 → RAG architecture retrieves context before generating ...
Chroma + LangChain / LlamaIndex Native Support

Both LangChain and LlamaIndex ship first-class Chroma integrations. Chroma.from_documents(docs, embedding) in LangChain wraps the entire embed-and-upsert flow in one line. This makes Chroma the fastest path from raw documents to a working RAG pipeline — measurably faster to prototype than any other vector DB.


Section 07

🐘 pgvector — Vectors Inside Your Existing PostgreSQL

pgvector = Adding a Superpower to What You Already Own
Your company runs PostgreSQL. Your users table is there. Your orders are there. Your audit logs are there. Your DBAs know it. Your backup scripts target it. Your ORM speaks it. Now imagine you can add vector columns to existing tables — embed product descriptions right next to price and SKU. Run a single SQL query that filters by price range, joins to orders, and does a nearest-neighbour vector search all at once. No new infrastructure. No new ops burden. No new concept to explain to your CTO. That is pgvector.

Two Index Types: IVFFlat vs HNSW

IVFFlat Index
PropertyValue
Build timeFast
RAM usageLow
Recall~90–95%
UpdatesRequires rebuild
Best forStatic datasets
HNSW Index (pgvector 0.5+)
PropertyValue
Build timeSlower
RAM usageHigher
Recall~98–99%
UpdatesOnline (no rebuild)
Best forDynamic datasets

Python + psycopg2: Full Setup and Search

# pip install psycopg2-binary pgvector openai
import psycopg2
from pgvector.psycopg2 import register_vector
from openai import OpenAI
import numpy as np
import os

conn   = psycopg2.connect(os.environ["DATABASE_URL"])
cur    = conn.cursor()
client = OpenAI()
register_vector(conn)

# ── 1. Enable extension and create table ──────────────────
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
  CREATE TABLE IF NOT EXISTS articles (
    id         SERIAL PRIMARY KEY,
    title      TEXT NOT NULL,
    content    TEXT NOT NULL,
    category   TEXT,
    published  DATE,
    embedding  VECTOR(1536)     -- OpenAI text-embedding-3-small
  );
""")
conn.commit()

# ── 2. Create HNSW index ──────────────────────────────────
cur.execute("""
  CREATE INDEX IF NOT EXISTS articles_hnsw_idx
  ON articles USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
""")
conn.commit()

# ── 3. Insert with embedding ──────────────────────────────
def embed(text: str):
    resp = client.embeddings.create(input=text, model="text-embedding-3-small")
    return np.array(resp.data[0].embedding)

rows = [
    ("Vector Databases 101",   "Embeddings enable semantic search at scale.",   "database"),
    ("pgvector Performance",   "HNSW gives 99% recall with fast query time.",    "database"),
    ("Building a RAG System",  "Retrieval augmented generation needs a vector store.", "rag"),
]
for title, content, category in rows:
    emb = embed(content)
    cur.execute(
        "INSERT INTO articles (title, content, category, embedding) VALUES (%s, %s, %s, %s)",
        (title, content, category, emb)
    )
conn.commit()

# ── 4. Nearest-neighbour search — pure SQL! ───────────────
query_vec = embed("How do I add vector search to my Postgres database?")
cur.execute("""
  SELECT title, category,
         1 - (embedding <=> %s::vector) AS cosine_similarity
  FROM   articles
  WHERE  category = 'database'
  ORDER  BY embedding <=> %s::vector
  LIMIT  3;
""", (query_vec, query_vec))

print(f"{'Title':<30} {'Category':<10} {'Similarity'}")
print("-" * 55)
for row in cur.fetchall():
    print(f"{row[0]:<30} {row[1]:<10} {row[2]:.4f}")
cur.close(); conn.close()
OUTPUT
Title Category Similarity ------------------------------------------------------- pgvector Performance database 0.9418 Vector Databases 101 database 0.8903
🔑
The Three Distance Operators in pgvector SQL

pgvector adds three custom SQL operators: <-> Euclidean distance, <=> cosine distance (use 1 - (<=>) for similarity), and <#> negative inner product. Use <=> for text embeddings. Always normalise your embedding column to unit length first if using inner product — unnormalised inner product does not equal cosine similarity.


Section 08

Master Comparison: Pinecone vs Weaviate vs Chroma vs pgvector

Dimension 🌲 Pinecone 🔮 Weaviate 🎨 Chroma 🐘 pgvector
Deployment Fully managed SaaS Self-host or WCS cloud Embedded / server Your Postgres server
Open Source ❌ Proprietary ✅ BSD-3 ✅ Apache-2.0 ✅ PostgreSQL Licence
Scale (vectors) Billions+ Hundreds of millions ~1–10 million ~10–100 million
ANN Algorithm Proprietary (HNSW/PQ) HNSW HNSW (hnswlib) HNSW or IVFFlat
Hybrid Search Sparse-dense vectors ✅ BM25 + vector native ❌ Vector-only Manual (+ pg_trgm)
Metadata Filtering ✅ Rich filter syntax ✅ GraphQL filters Basic where-clause ✅ Full SQL WHERE
Multi-modal ❌ Text/vector only ✅ Text, image, audio ❌ Text focus ❌ Custom only
Setup time 2 minutes (API key) 15 min (Docker) 30 seconds (pip) Depends on Postgres
Cost model Per query + storage Infrastructure cost Free / OSS Postgres hosting cost
Joins with app data ❌ Separate system Cross-refs only ❌ Separate system ✅ Native SQL JOINs
Best use case Production RAG, high scale Knowledge graphs, multi-modal Local dev, prototyping Existing Postgres stacks

Section 09

Which Vector Database Should You Choose? — Decision Flow

🧭 Decision Flowchart
Start: New AI project Already using PostgreSQL? YES pgvector SQL joins + vectors NO Just prototyping / local dev? YES Chroma Zero setup, fast NO Need multi-modal or object relationships? YES Weaviate Graph + vector NO Pinecone Managed · scalable · production-ready

This flow covers ~90% of real use cases. The tiebreaker in ambiguous cases is almost always: team ops capacity vs managed cost trade-off.


Section 10

Performance Benchmarks — What the Numbers Say

Benchmarking vector databases is genuinely hard — results depend on hardware, dataset size, index parameters, query concurrency, and recall tolerance. The table below reflects community benchmarks on 1M vectors at 1,536 dimensions (representative of OpenAI embeddings), single-node, non-filtered queries.

Database QPS (queries/sec) p99 Latency Recall@10 Index Build (1M) RAM (1M vecs)
🌲 Pinecone ~1,500+ <20 ms ~99% Managed Managed
🔮 Weaviate HNSW ~800–1,200 20–50 ms ~97–99% ~25 min ~8 GB
🎨 Chroma (hnswlib) ~200–500 40–80 ms ~97% ~8 min ~6 GB
🐘 pgvector HNSW ~300–700 30–60 ms ~98% ~30 min ~10 GB
🐘 pgvector IVFFlat ~500–900 <30 ms ~90–95% ~5 min ~4 GB
⚠️
Benchmark Caveats — Read Before Concluding

All these numbers degrade significantly with metadata filtering — when you add a WHERE category = 'X' condition, you reduce the ANN search space and some indexes (especially IVFFlat) lose recall rapidly. Pinecone and Weaviate handle filtered ANN significantly better than pgvector IVFFlat. HNSW-based indexes are more robust to filtering but use more RAM. Always benchmark on your own data shape and filter pattern.


Section 11

Full RAG Architecture — The Complete Picture

How a Question Becomes an Intelligent Answer
Your company has 50,000 internal wiki pages. A new employee asks: "What is our policy on remote work for contractors in Germany?" The LLM alone cannot answer — it doesn't know your company's internal policies, and even if it did, the context window can't hold 50,000 pages. Here is what happens with RAG:
01
📄 Ingestion — Index Your Knowledge Base
All 50,000 wiki pages are chunked into paragraphs (~500 tokens each), embedded via text-embedding-3-small, and upserted into the vector database. This happens once, then incrementally as pages change. The vector DB now holds ~500,000 embedding vectors.
02
❓ Query — Embed the User's Question
The question "What is our remote work policy for German contractors?" is embedded into a 1,536-dimensional vector using the same embedding model. This vector captures the semantic meaning of the query.
03
🔍 Retrieval — ANN Search
The vector DB performs an ANN search across 500,000 vectors in <50 ms and returns the top-5 most semantically similar chunks. These chunks are almost certainly the relevant HR policy paragraphs — even if the exact words differ from the query.
04
🧩 Augmentation — Build the Prompt
The top-5 retrieved chunks are injected into the LLM prompt as context: "Answer the question using only the information in the following context: [chunk 1] [chunk 2] ... User question: What is our remote work policy for German contractors?"
05
🤖 Generation — LLM Synthesises the Answer
The LLM (GPT-4o, Claude, Gemini) reads the grounded context and generates an accurate, specific, citable answer. Hallucination is dramatically reduced because the model is constrained to retrieved facts.

Section 12

Golden Rules — Vector Database Production Checklist

🏆 Non-Negotiable Rules for Production Vector Search
1
Use the same embedding model for ingestion and retrieval. If you embed documents with text-embedding-3-small, you must query with the same model. Different models produce incompatible vector spaces — querying with a different model gives garbage results without error messages.
2
Chunk carefully — chunk size is not a hyperparameter to ignore. Too-small chunks (50 tokens) lose context. Too-large chunks (2,000 tokens) dilute the embedding signal. The empirically validated sweet spot for most RAG applications is 300–600 tokens with 50-token overlapping windows.
3
Store the original text alongside the vector — always. The vector is a lossy compressed representation. You need the original text to put in the LLM context window. Always store {"text": "...", "embedding": [...]} together.
4
Measure recall, not just latency. A fast vector search that returns the wrong results is worse than useless. Build a small evaluation set of (query, expected_chunk) pairs and measure Recall@5 and Recall@10 before deploying. Target >90% Recall@5.
5
For pgvector: always set SET LOCAL hnsw.ef_search = 100 at query time. The default ef_search = 40 trades recall for speed. For sub-100M vectors, ef_search = 100 gives near-perfect recall with latency still under 50 ms.
6
Don't embed metadata into the vector — filter it. Some developers embed text like "Category: Finance. Author: John." into the chunk before embedding. This pollutes the semantic space with structural noise. Store metadata as structured fields and use the database's native filter mechanism to apply it at search time.
7
Prototype with Chroma, graduate to Pinecone or pgvector. Build your RAG pipeline with Chroma locally. Once the logic is proven, swap the vector DB client for Pinecone (for scale and ops simplicity) or pgvector (if you're already paying for Postgres). The core RAG logic remains identical — only the DB client changes.
You have completed LLM architecture deep dive. View all sections →