Large Language Models (LLMs) 📂 LLM architecture deep dive · 9 of 10 46 min read

Comparing LLM APIs: OpenAI vs Anthropic vs Gemini vs Open Source

A practical, code-first comparison of the four major LLM API families — OpenAI (GPT-4o, o3), Anthropic (Claude), Google Gemini, and open-source models (Llama, Mistral)

Section 01

The Story That Sets the Stage

The Four Chefs — Choosing the Right Kitchen for Your Dish
Imagine you need a meal cooked. You have four chefs available: Chef OpenAI — renowned, expensive, polished, with a long menu. Chef Anthropic — thoughtful, safety-conscious, writes beautifully. Chef Google (Gemini) — encyclopaedic, always connected to the internet, fast with multimodal dishes.

And then there's the Open Kitchen — no single chef, just open-source recipes anyone can cook. Free to customise, but you're responsible for the stove, the gas bill, and keeping the knives sharp.

Choosing the wrong chef doesn't mean bad food — it means a mismatched kitchen. A task needing deep reasoning shouldn't go to the cheapest option. A simple classification job shouldn't burn through GPT-4o tokens at $15 per million. This tutorial helps you match the task to the kitchen.

The modern AI landscape offers developers a rich but confusing set of LLM API choices. OpenAI, Anthropic, Google Gemini, and open-source models (via Hugging Face, Ollama, Together AI, etc.) each have distinct strengths, pricing models, safety postures, context windows, and API designs. Understanding these differences is not just academic — it directly impacts cost, latency, reliability, and the quality of what you build.

🧭
What This Tutorial Covers

We will compare API design, pricing, context windows, capabilities, safety, and practical use-cases for OpenAI, Anthropic Claude, Google Gemini, and leading open-source options — with Python code examples for each, decision tables, and real-world scenario guidance.


Section 02

The Four Players — Quick Landscape Map

🟢
OpenAI
GPT-4o · o3 · o1 · GPT-4o mini
The pioneer. Widest ecosystem, largest third-party tooling, best function-calling support, and the most mature API. Models range from fast+cheap (GPT-4o mini) to frontier reasoning (o3). Also offers fine-tuning, Assistants API, and Batch API.
🟣
Anthropic
Claude Opus 4 · Sonnet 4 · Haiku 3.5
Safety-first, exceptionally long context (up to 200K tokens), renowned for instruction-following, nuanced writing, and complex reasoning. The preferred choice for document analysis, coding assistance, and enterprise use-cases requiring careful, reliable outputs.
🔵
Google Gemini
Gemini 2.5 Pro · Flash · Flash-8B
Native multimodal (text, image, audio, video, code). Deep Google Search grounding, massive context (up to 1M tokens in Gemini 1.5 Pro). Strong for retrieval-augmented tasks, real-time information, and multimedia applications via Vertex AI or the Gemini API.
⚙️
Open Source (Llama 3, Mistral, Qwen, DeepSeek, Phi-4)
Self-hosted · Via API: Together AI, Groq, Replicate, Ollama
No vendor lock-in, full data privacy, zero per-token cost (only infra cost), fine-tuning freedom. Llama 3.3 70B rivals GPT-4 on many benchmarks. Trade-offs: you manage deployment, scaling, and model updates. Best for privacy-sensitive workloads, cost-sensitive high-volume tasks, and research.

Section 03

API Design — Hello World Across All Four

The best way to feel an API's design philosophy is to run the same task through each one. Below is a basic chat completion — "Explain transformers in one sentence" — implemented in each SDK.

🔧 Setup — Install the SDKs
OpenAI
pip install openai
Anthropic
pip install anthropic
Gemini
pip install google-generativeai
Open Source
pip install openai — most open-source hosts are OpenAI-compatible

OpenAI — GPT-4o

from openai import OpenAI

client = OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a concise data science tutor."},
        {"role": "user",   "content": "Explain transformers in one sentence."}
    ],
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].message.content)
# Usage stats
print(f"Tokens used: {response.usage.total_tokens}")
OUTPUT
Transformers are neural networks that use self-attention mechanisms to process sequences in parallel, enabling the model to weigh the importance of each token relative to every other token simultaneously — making them far more effective at long-range dependencies than RNNs. Tokens used: 87

Anthropic — Claude

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=100,
    system="You are a concise data science tutor.",  # system is TOP-LEVEL, not in messages
    messages=[
        {"role": "user", "content": "Explain transformers in one sentence."}
    ]
)

print(message.content[0].text)
# Usage stats
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
OUTPUT
Transformers are deep learning models built around the self-attention mechanism, which allows them to dynamically focus on relevant parts of an input sequence regardless of distance, enabling parallel processing and exceptional performance on language tasks. Input tokens: 28 Output tokens: 44
🔑
Key Anthropic API Difference

Anthropic separates the system prompt from messages at the top level. OpenAI embeds it inside the messages array as {"role": "system"}. This distinction matters when migrating code between providers.

Google Gemini

import google.generativeai as genai

genai.configure(api_key="AIza...")

model = genai.GenerativeModel(
    model_name="gemini-2.0-flash",
    system_instruction="You are a concise data science tutor."
)

response = model.generate_content("Explain transformers in one sentence.")

print(response.text)
# Token usage
print(f"Total tokens: {response.usage_metadata.total_token_count}")
OUTPUT
Transformers are sequence models that use multi-head self-attention to capture contextual relationships between all positions in an input simultaneously, replacing recurrence with parallelism and enabling breakthrough performance across NLP and beyond. Total tokens: 71

Open Source via Together AI (Llama 3.3 70B — OpenAI-compatible)

from openai import OpenAI

# Together AI, Groq, Anyscale etc. all accept the OpenAI client
client = OpenAI(
    api_key="together-...",
    base_url="https://api.together.xyz/v1"   # only change needed
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[
        {"role": "system", "content": "You are a concise data science tutor."},
        {"role": "user",   "content": "Explain transformers in one sentence."}
    ],
    max_tokens=100
)

print(response.choices[0].message.content)
OUTPUT
Transformers are a type of deep learning architecture that uses self-attention mechanisms to process sequential data, such as text, by allowing the model to focus on different parts of the input sequence simultaneously, enabling state-of-the-art performance in NLP tasks.

Section 04

Pricing — The Real Cost of Intelligence

The Taxi Meter Metaphor
Every LLM API is a taxi that charges per mile — except the meter runs on tokens, not miles. Input tokens (your prompt) and output tokens (the model's response) are often priced differently, with output costing 3–5× more because generation is computationally heavier than encoding. A 10,000-document summarisation pipeline that sends 500 tokens and receives 100 tokens per document will have very different economics than a chatbot sending 50 tokens and receiving 300. Always model your actual token flow before committing to a provider.
Provider & Model Input (per 1M tokens) Output (per 1M tokens) Context Window Best For
OpenAI GPT-4o $2.50 $10.00 128K General, function calling, agents
OpenAI GPT-4o mini $0.15 $0.60 128K High-volume, cost-sensitive
OpenAI o3 $10.00 $40.00 200K Complex reasoning, math, code
Anthropic Claude Opus 4 $15.00 $75.00 200K Hardest tasks, long docs, research
Anthropic Claude Sonnet 4 $3.00 $15.00 200K Balanced quality + cost
Anthropic Claude Haiku 3.5 $0.80 $4.00 200K Fast, cheap, high volume
Gemini 2.5 Pro $1.25 $10.00 1M Massive context, multimodal
Gemini 2.0 Flash $0.10 $0.40 1M Fastest, cheapest Gemini
Llama 3.3 70B (Together AI) $0.88 $0.88 128K Open source quality at low cost
Mistral Large (Mistral API) $2.00 $6.00 128K European data residency, GDPR
Self-hosted (Ollama / vLLM) $0.00 $0.00 Depends on model Privacy, infinite volume, research
⚠️
Pricing Changes Rapidly

LLM prices drop roughly 30–50% every 12 months. Always verify current pricing on the provider's official page before budgeting a production project. The table above reflects approximate mid-2025 rates.

Python: A Token Cost Estimator

import tiktoken

# Pricing per 1M tokens (input, output)
PRICING = {
    "gpt-4o":             (2.50,  10.00),
    "gpt-4o-mini":       (0.15,   0.60),
    "claude-sonnet-4-5": (3.00,  15.00),
    "claude-haiku-3-5":  (0.80,   4.00),
    "gemini-2.0-flash":  (0.10,   0.40),
    "llama-3.3-70b":     (0.88,   0.88),
}

def estimate_cost(prompt: str, expected_output_tokens: int, model: str) -> dict:
    enc = tiktoken.get_encoding("cl100k_base")
    input_tokens = len(enc.encode(prompt))
    price_in, price_out = PRICING[model]
    cost_in  = (input_tokens          / 1_000_000) * price_in
    cost_out = (expected_output_tokens / 1_000_000) * price_out
    return {
        "model":        model,
        "input_tokens": input_tokens,
        "cost_input":   round(cost_in,  6),
        "cost_output":  round(cost_out, 6),
        "total_usd":    round(cost_in + cost_out, 6),
    }

prompt = "Summarise the following 2000-word research paper in 5 bullet points: ..."

for model in PRICING:
    result = estimate_cost(prompt, expected_output_tokens=150, model=model)
    print(f"{result['model']:25s}  ${result['total_usd']:.6f}")
OUTPUT
gpt-4o $0.001530 gpt-4o-mini $0.000101 claude-sonnet-4-5 $0.002394 claude-haiku-3-5 $0.000614 gemini-2.0-flash $0.000067 llama-3.3-70b $0.000148

Section 05

Context Windows — How Much Can It Remember?

The context window is how many tokens the model can "see" in one call — your prompt, chat history, retrieved documents, and the model's own response combined. A small context window forces chunking; a large one allows whole-book analysis in a single call.

📏 Context Window Comparison (Tokens)
Gemini 1.5 Pro
1,000,000
Gemini 2.5 Pro
1,000,000
Claude (all)
200K
OpenAI o3
200K
GPT-4o / mini
128K
Llama 3.3 70B
128K

Bars scaled relative to 1M. Note: effective performance may degrade near context limits ("lost in the middle" problem).

Rough Token Rule
1 token ≈ 0.75 words
1,000 tokens ≈ 750 words ≈ ~1.5 pages of A4 text
200K Tokens Fits
~150,000 words
An entire novel, a large codebase, or 300+ research paper pages
1M Tokens Fits
~750,000 words
The entire Harry Potter series twice over — or a massive codebase with tests
Practical Tip
Stay under 80%
Performance often degrades near the limit; budget headroom for the response

Section 06

Streaming — Making Your App Feel Instant

Without streaming, users stare at a loading spinner until the entire response is generated. With streaming, tokens appear word-by-word — like watching someone type. This is a UX requirement for any chat interface. All four providers support streaming.

OpenAI Streaming

from openai import OpenAI

client = OpenAI(api_key="sk-...")

with client.chat.completions.stream(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write 3 fun facts about Python."}]
) as stream:
    for text in stream.text_stream():
        print(text, end="", flush=True)

Anthropic Streaming

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=300,
    messages=[{"role": "user", "content": "Write 3 fun facts about Python."}]
) as stream:
    for text in stream.text_stream():
        print(text, end="", flush=True)

Gemini Streaming

import google.generativeai as genai

genai.configure(api_key="AIza...")
model = genai.GenerativeModel("gemini-2.0-flash")

for chunk in model.generate_content(
    "Write 3 fun facts about Python.",
    stream=True
):
    print(chunk.text, end="", flush=True)
💡
Streaming API Note

OpenAI and Anthropic both offer a .stream() context manager in their official Python SDKs for clean streaming. Gemini uses a simple stream=True flag. All approaches yield tokens as they arrive with minimal latency.


Section 07

Function Calling & Tool Use — Making LLMs Act

Function calling (OpenAI/Gemini) or tool use (Anthropic) lets the model decide when to call an external function and what arguments to pass. This is the backbone of AI agents — allowing models to search the web, query databases, call APIs, or run code.

🔩 The Pattern — Always 3 Steps
Step 1
Define the tool/function schema (name, description, parameters as JSON Schema)
Step 2
Send to the model — it returns a structured call request when appropriate
Step 3
Execute the real function, send result back, model generates final response

OpenAI Function Calling

from openai import OpenAI
import json

client = OpenAI(api_key="sk-...")

tools = [{
    "type": "function",
    "function": {
        "name": "get_stock_price",
        "description": "Get the current stock price for a given ticker symbol.",
        "parameters": {
            "type": "object",
            "properties": {
                "ticker": {"type": "string", "description": "Stock ticker e.g. AAPL"}
            },
            "required": ["ticker"]
        }
    }
}]

messages = [{"role": "user", "content": "What is the current price of Apple stock?"}]
response = client.chat.completions.create(model="gpt-4o", tools=tools, messages=messages)

tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(f"Model wants to call: {tool_call.function.name}({args})")
# → Model wants to call: get_stock_price({'ticker': 'AAPL'})

# Execute the real function and return result back to model
def get_stock_price(ticker): return {"ticker": ticker, "price": 213.49, "currency": "USD"}

messages.append(response.choices[0].message)
messages.append({
    "role": "tool",
    "tool_call_id": tool_call.id,
    "content": json.dumps(get_stock_price(**args))
})

final = client.chat.completions.create(model="gpt-4o", messages=messages)
print(final.choices[0].message.content)
OUTPUT
Model wants to call: get_stock_price({'ticker': 'AAPL'}) The current price of Apple (AAPL) is $213.49 USD.

Anthropic Tool Use

import anthropic, json

client = anthropic.Anthropic(api_key="sk-ant-...")

tools = [{
    "name": "get_stock_price",
    "description": "Get the current stock price for a ticker symbol.",
    "input_schema": {  # "input_schema" — Anthropic's term (not "parameters")
        "type": "object",
        "properties": {
            "ticker": {"type": "string", "description": "e.g. AAPL"}
        },
        "required": ["ticker"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=500,
    tools=tools,
    messages=[{"role": "user", "content": "What is Apple's stock price?"}]
)

tool_use_block = [b for b in response.content if b.type == "tool_use"][0]
print(f"Calling: {tool_use_block.name}({tool_use_block.input})")
OUTPUT
Calling: get_stock_price({'ticker': 'AAPL'})
🔍
Key Difference: Schema Field Name

OpenAI uses "parameters" for the tool schema definition. Anthropic uses "input_schema". This single field name difference trips up every developer migrating between the two — bookmark it.


Section 08

Capability Matrix — What Can Each Do?

Capability OpenAI Anthropic Gemini Open Source
Text Generation ✅ Excellent ✅ Excellent ✅ Excellent ✅ Good–Excellent
Code Generation ✅ Best-in-class ✅ Excellent ⚡ Very Good ⚡ Good (DeepSeek)
Image Input ✅ GPT-4o ✅ All Claude 3+ ✅ Native multimodal ⚡ LLaVA, Llama 3.2
Image Generation ✅ DALL-E 3 ❌ No ✅ Imagen 3 ✅ Stable Diffusion
Audio Input ✅ Whisper / GPT-4o ❌ No ✅ Native ⚡ Whisper
Video Input ❌ No ❌ No ✅ Native ❌ Limited
Web Search (grounding) ✅ (web search tool) ⚡ Web search tool ✅ Google Search ❌ Needs integration
Function Calling ✅ Mature ✅ Excellent ✅ Good ⚡ Model-dependent
Structured JSON Output ✅ Strict mode ✅ Reliable ✅ Reliable ⚡ Varies
Fine-tuning ✅ GPT-4o mini, 3.5 ❌ No public API ✅ Via Vertex AI ✅ Full freedom
Batch API (async cheap) ✅ 50% discount ✅ Message Batches ⚡ Via Vertex ⚡ Depends on host
Data Privacy / On-Prem ⚡ Enterprise tier ⚡ Enterprise tier ⚡ Vertex AI ✅ Full control

Section 09

Safety & Alignment — Who Refuses What, and Why?

The Overly Cautious Librarian vs. the Reckless One
Imagine two librarians. One refuses to hand you a book on forensic science because "it could theoretically help someone". The other hands you anything without even reading the spine. Neither is ideal — you want a librarian who applies contextual judgement: answering research questions openly, but declining to provide operational instructions for harm.

That is the challenge every LLM provider wrestles with. Anthropic's Constitutional AI and OpenAI's RLHF-based alignment both aim for the thoughtful middle ground — but they are calibrated differently, and you will notice the differences in production.
🟢
OpenAI
RLHF + Policy-based
Broadly permissive for creative and professional tasks. Refusals can feel inconsistent across model versions. The Moderation API allows pre-checking inputs. GPT-4o has a lower refusal rate than GPT-4 Turbo for edge content.
🟣
Anthropic Claude
Constitutional AI
Trains models using a set of principles ("constitution") for self-critique. More consistent refusals, more nuanced reasoning before refusing, and a lower tendency to hallucinate malicious intent into benign requests. Preferred for legal, medical, and compliance-sensitive applications.
🔵
Gemini
Responsible AI policies
Google's safety filters are aggressively tuned — sometimes over-tuned for consumer markets. Vertex AI deployments give more control. The Gemini API Safety Settings API allows fine-grained category-level threshold adjustment for developers.
⚙️
Open Source Models
Community RLHF + Uncensored variants exist
Alignment varies wildly. Meta's Llama 3 is well-aligned. Community fine-tunes exist with reduced safety filters — useful for specialised research but requiring careful governance in production. Self-hosting means you own the safety responsibility end-to-end.

Section 10

Structured JSON Output — Reliable Extraction

Production pipelines rarely need prose — they need structured data. Extracting a {"name": ..., "price": ..., "category": ...} object reliably from model output requires structured output features.

OpenAI — Strict JSON Mode with Pydantic

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI(api_key="sk-...")

class Product(BaseModel):
    name: str
    price_usd: float
    category: str
    in_stock: bool

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "Extract product info: Sony WH-1000XM5 headphones, $299.99, electronics, available."
    }],
    response_format=Product   # Guaranteed schema adherence
)

product = response.choices[0].message.parsed
print(f"Name:     {product.name}")
print(f"Price:    ${product.price_usd}")
print(f"Category: {product.category}")
print(f"In Stock: {product.in_stock}")
OUTPUT
Name: Sony WH-1000XM5 Price: $299.99 Category: electronics In Stock: True

Anthropic — JSON via Prompt Engineering

import anthropic, json

client = anthropic.Anthropic(api_key="sk-ant-...")

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=300,
    system="""You are a data extraction assistant.
Always respond with valid JSON only — no prose, no markdown fences.
Schema: {"name": str, "price_usd": float, "category": str, "in_stock": bool}""",
    messages=[{
        "role": "user",
        "content": "Extract: Sony WH-1000XM5 headphones, $299.99, electronics, available."
    }]
)

product = json.loads(message.content[0].text)
print(product)
OUTPUT
{'name': 'Sony WH-1000XM5', 'price_usd': 299.99, 'category': 'electronics', 'in_stock': True}

Section 11

Batch API — Cheap Async Processing at Scale

For offline workloads — classifying 10,000 reviews, generating 5,000 product descriptions, embedding a corpus — the Batch APIs of OpenAI and Anthropic offer ~50% cost reduction with async processing (results in 24 hours).

Anthropic Message Batches

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

reviews = [
    "Great product, arrived quickly!",
    "Terrible quality, broke after one day.",
    "Average experience, nothing special.",
]

# Build a batch of requests
requests = [
    anthropic.types.message_create_params.MessageCreateParamsNonStreaming(
        custom_id=f"review-{i}",
        params={
            "model": "claude-haiku-3-5-20251001",
            "max_tokens": 10,
            "system": "Classify sentiment as POSITIVE, NEGATIVE, or NEUTRAL. One word only.",
            "messages": [{"role": "user", "content": review}]
        }
    )
    for i, review in enumerate(reviews)
]

batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}  Status: {batch.processing_status}")
# Poll later — or use webhook for completion notification
OUTPUT
Batch ID: msgbatch_01XFDUDYJgAACncnFsF6xhCE Status: in_progress

Section 12

Multimodal — Sending Images to the Model

OpenAI Image Input
"type": "image_url"
Pass URL or base64 in message content array
Supports: JPEG, PNG, GIF, WEBP
Anthropic Image Input
"type": "image" with source object
Base64 or URL via media_type field
Supports: JPEG, PNG, GIF, WEBP

OpenAI Vision

import base64, openai

client = openai.OpenAI(api_key="sk-...")

with open("chart.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",      "text": "What trend does this chart show?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
        ]
    }]
)
print(response.choices[0].message.content)

Anthropic Vision

import anthropic, base64

client = anthropic.Anthropic(api_key="sk-ant-...")

with open("chart.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=500,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": b64
                }
            },
            {"type": "text", "text": "What trend does this chart show?"}
        ]
    }]
)
print(message.content[0].text)

Section 13

Embeddings — Turning Text into Vectors

Embeddings are fixed-length numerical vectors that capture semantic meaning. They power semantic search, RAG (Retrieval-Augmented Generation), clustering, and recommendation systems. OpenAI and Gemini offer embedding endpoints directly; Anthropic does not — use open-source alternatives for embeddings when on Claude.

ProviderModelDimensionsCost / 1M tokensBest For
OpenAI text-embedding-3-large 3072 (configurable) $0.13 High-accuracy retrieval
OpenAI text-embedding-3-small 1536 $0.02 Balanced cost/accuracy
Google text-embedding-004 768 $0.00 (free tier) Google stack, RAG
Anthropic No embedding model Use Voyage AI or OpenAI
Open Source nomic-embed-text, bge-m3 768–1024 $0.00 Private data, no API call
from openai import OpenAI
import numpy as np

client = OpenAI(api_key="sk-...")

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

texts = [
    "Machine learning automates pattern recognition.",
    "Deep learning uses neural networks with many layers.",
    "The Eiffel Tower is in Paris, France."
]

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts
)

vectors = [item.embedding for item in response.data]

print("ML vs Deep Learning similarity:",
      round(cosine_similarity(vectors[0], vectors[1]), 4))
print("ML vs Eiffel Tower similarity:  ",
      round(cosine_similarity(vectors[0], vectors[2]), 4))
OUTPUT
ML vs Deep Learning similarity: 0.7834 ML vs Eiffel Tower similarity: 0.1102

Section 14

Open Source — Running Models Locally with Ollama

The Home Kitchen
Every restaurant (API provider) charges you for every meal. But some chefs have published their recipes (model weights) openly. With Ollama, you install those recipes on your own hardware and cook unlimited meals with no per-plate charge. The ingredients (compute) cost money, but there's no restaurant bill. If you run 50,000 API calls a month, your break-even against a $0.10/1M token provider might be a $300 GPU in under 3 months.
# Terminal: pull a model
# ollama pull llama3.3:70b
# ollama serve  (starts local server at localhost:11434)

from openai import OpenAI

# Ollama exposes an OpenAI-compatible endpoint
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"   # required but not checked
)

response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[
        {"role": "system", "content": "You are a helpful data science tutor."},
        {"role": "user",   "content": "What is gradient descent?"}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
The Best Part About Ollama

Because Ollama exposes an OpenAI-compatible API, any code written for GPT-4o works with Llama 3.3 locally by changing just two lines: base_url and model. This makes provider migration essentially zero-cost.


Section 15

Decision Framework — Which Provider For Which Task?

Use Case Recommended Provider Recommended Model Reason
General chatbot (production) OpenAI GPT-4o mini Cost-efficient, fast, reliable, ecosystem
Long document analysis (200K+) Anthropic Claude Sonnet 4 Reliable at long context, best instruction-following
Real-time web + data (RAG) Gemini Gemini 2.5 Pro Native Google Search grounding
Video / audio analysis Gemini Gemini 2.5 Pro Only provider with native video input
Complex reasoning / math OpenAI / Anthropic o3 / Claude Opus 4 Best reasoning benchmarks
High-volume batch (cheap) Gemini or Open Source Flash 2.0 / Llama 3.3 Lowest cost per token
Data privacy / GDPR / on-prem Open Source Llama 3.3 70B Data never leaves your infra
Code generation / debugging OpenAI / Anthropic GPT-4o / Claude Sonnet 4 Tied — try both
Fine-tuned domain expert OpenAI or Open Source GPT-4o mini fine-tuned / Llama OpenAI has fine-tune API; open source = full control
Creative writing / storytelling Anthropic Claude Sonnet 4 / Opus 4 Best prose quality and style consistency
European data residency Mistral Mistral Large EU-based servers, GDPR native

Section 16

Provider-Agnostic Code — Write Once, Switch Freely

Avoid coupling your application code to a single provider. A thin abstraction layer lets you swap providers with a single config change — critical when a provider raises prices, has an outage, or releases a better model.

from dataclasses import dataclass
from typing import Literal
import openai, anthropic

@dataclass
class LLMConfig:
    provider: Literal["openai", "anthropic", "openai-compat"]
    model: str
    api_key: str
    base_url: str | None = None

def chat(config: LLMConfig, system: str, user: str, max_tokens: int = 500) -> str:
    if config.provider in ("openai", "openai-compat"):
        client = openai.OpenAI(
            api_key=config.api_key,
            base_url=config.base_url
        )
        resp = client.chat.completions.create(
            model=config.model,
            max_tokens=max_tokens,
            messages=[
                {"role": "system", "content": system},
                {"role": "user",   "content": user}
            ]
        )
        return resp.choices[0].message.content

    elif config.provider == "anthropic":
        client = anthropic.Anthropic(api_key=config.api_key)
        resp = client.messages.create(
            model=config.model,
            max_tokens=max_tokens,
            system=system,
            messages=[{"role": "user", "content": user}]
        )
        return resp.content[0].text

    raise ValueError(f"Unknown provider: {config.provider}")

# Swap providers by changing only the config:
gpt_config     = LLMConfig("openai",       "gpt-4o",             "sk-...")
claude_config  = LLMConfig("anthropic",    "claude-sonnet-4-5",  "sk-ant-...")
llama_config   = LLMConfig("openai-compat", "llama3.3:70b",       "ollama",
                           base_url="http://localhost:11434/v1")

prompt = ("You are a helpful tutor.", "What is overfitting in ML?")
for cfg in [gpt_config, claude_config, llama_config]:
    answer = chat(cfg, *prompt, max_tokens=80)
    print(f"\n[{cfg.model}]\n{answer}")

Section 17

Golden Rules

🏆 LLM API Selection — Non-Negotiable Rules
1
Never hardcode a provider. Wrap all LLM calls behind a thin abstraction. Every provider will eventually have an outage, a price hike, or a superior competitor. Switching should take minutes, not weeks.
2
Model your token economics before choosing a provider. Calculate input/output token ratios for your actual workload. The "cheapest" provider for a summarisation task may be the most expensive for a chatbot with long histories.
3
Use the Batch API whenever latency doesn't matter. 50% cost reduction for offline workloads is free money. Route overnight classification jobs through Batch; reserve synchronous calls for interactive use-cases only.
4
Default to a small/fast model; escalate to the frontier only when needed. GPT-4o mini, Claude Haiku, or Gemini Flash are 10–50× cheaper and adequate for 80% of tasks. Reserve Opus/o3 for complex reasoning chains.
5
Implement provider fallback logic in production. If OpenAI returns a 503, automatically retry with Anthropic or a cached response. A multi-provider strategy dramatically improves reliability.
6
For anything touching personal data in the EU, evaluate Mistral or self-hosting. GDPR Article 46 constraints on third-country data transfers apply to US-based APIs. Mistral's EU infrastructure or an on-prem Llama deployment are the clean options.
7
Always log prompt + response + token usage + latency. LLM debugging without logs is archaeological work. Track model version alongside calls — silent model updates can change behaviour.