The Story That Sets the Stage
And then there's the Open Kitchen — no single chef, just open-source recipes anyone can cook. Free to customise, but you're responsible for the stove, the gas bill, and keeping the knives sharp.
Choosing the wrong chef doesn't mean bad food — it means a mismatched kitchen. A task needing deep reasoning shouldn't go to the cheapest option. A simple classification job shouldn't burn through GPT-4o tokens at $15 per million. This tutorial helps you match the task to the kitchen.
The modern AI landscape offers developers a rich but confusing set of LLM API choices. OpenAI, Anthropic, Google Gemini, and open-source models (via Hugging Face, Ollama, Together AI, etc.) each have distinct strengths, pricing models, safety postures, context windows, and API designs. Understanding these differences is not just academic — it directly impacts cost, latency, reliability, and the quality of what you build.
We will compare API design, pricing, context windows, capabilities, safety, and practical use-cases for OpenAI, Anthropic Claude, Google Gemini, and leading open-source options — with Python code examples for each, decision tables, and real-world scenario guidance.
The Four Players — Quick Landscape Map
API Design — Hello World Across All Four
The best way to feel an API's design philosophy is to run the same task through each one. Below is a basic chat completion — "Explain transformers in one sentence" — implemented in each SDK.
pip install openai
pip install anthropic
pip install google-generativeai
pip install openai — most open-source hosts are OpenAI-compatible
OpenAI — GPT-4o
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a concise data science tutor."},
{"role": "user", "content": "Explain transformers in one sentence."}
],
max_tokens=100,
temperature=0.7
)
print(response.choices[0].message.content)
# Usage stats
print(f"Tokens used: {response.usage.total_tokens}")
Anthropic — Claude
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=100,
system="You are a concise data science tutor.", # system is TOP-LEVEL, not in messages
messages=[
{"role": "user", "content": "Explain transformers in one sentence."}
]
)
print(message.content[0].text)
# Usage stats
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
Anthropic separates the system prompt from messages
at the top level. OpenAI embeds it inside the messages array as
{"role": "system"}. This distinction matters when migrating code
between providers.
Google Gemini
import google.generativeai as genai
genai.configure(api_key="AIza...")
model = genai.GenerativeModel(
model_name="gemini-2.0-flash",
system_instruction="You are a concise data science tutor."
)
response = model.generate_content("Explain transformers in one sentence.")
print(response.text)
# Token usage
print(f"Total tokens: {response.usage_metadata.total_token_count}")
Open Source via Together AI (Llama 3.3 70B — OpenAI-compatible)
from openai import OpenAI
# Together AI, Groq, Anyscale etc. all accept the OpenAI client
client = OpenAI(
api_key="together-...",
base_url="https://api.together.xyz/v1" # only change needed
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[
{"role": "system", "content": "You are a concise data science tutor."},
{"role": "user", "content": "Explain transformers in one sentence."}
],
max_tokens=100
)
print(response.choices[0].message.content)
Pricing — The Real Cost of Intelligence
| Provider & Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window | Best For |
|---|---|---|---|---|
| OpenAI GPT-4o | $2.50 | $10.00 | 128K | General, function calling, agents |
| OpenAI GPT-4o mini | $0.15 | $0.60 | 128K | High-volume, cost-sensitive |
| OpenAI o3 | $10.00 | $40.00 | 200K | Complex reasoning, math, code |
| Anthropic Claude Opus 4 | $15.00 | $75.00 | 200K | Hardest tasks, long docs, research |
| Anthropic Claude Sonnet 4 | $3.00 | $15.00 | 200K | Balanced quality + cost |
| Anthropic Claude Haiku 3.5 | $0.80 | $4.00 | 200K | Fast, cheap, high volume |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M | Massive context, multimodal |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M | Fastest, cheapest Gemini |
| Llama 3.3 70B (Together AI) | $0.88 | $0.88 | 128K | Open source quality at low cost |
| Mistral Large (Mistral API) | $2.00 | $6.00 | 128K | European data residency, GDPR |
| Self-hosted (Ollama / vLLM) | $0.00 | $0.00 | Depends on model | Privacy, infinite volume, research |
LLM prices drop roughly 30–50% every 12 months. Always verify current pricing on the provider's official page before budgeting a production project. The table above reflects approximate mid-2025 rates.
Python: A Token Cost Estimator
import tiktoken
# Pricing per 1M tokens (input, output)
PRICING = {
"gpt-4o": (2.50, 10.00),
"gpt-4o-mini": (0.15, 0.60),
"claude-sonnet-4-5": (3.00, 15.00),
"claude-haiku-3-5": (0.80, 4.00),
"gemini-2.0-flash": (0.10, 0.40),
"llama-3.3-70b": (0.88, 0.88),
}
def estimate_cost(prompt: str, expected_output_tokens: int, model: str) -> dict:
enc = tiktoken.get_encoding("cl100k_base")
input_tokens = len(enc.encode(prompt))
price_in, price_out = PRICING[model]
cost_in = (input_tokens / 1_000_000) * price_in
cost_out = (expected_output_tokens / 1_000_000) * price_out
return {
"model": model,
"input_tokens": input_tokens,
"cost_input": round(cost_in, 6),
"cost_output": round(cost_out, 6),
"total_usd": round(cost_in + cost_out, 6),
}
prompt = "Summarise the following 2000-word research paper in 5 bullet points: ..."
for model in PRICING:
result = estimate_cost(prompt, expected_output_tokens=150, model=model)
print(f"{result['model']:25s} ${result['total_usd']:.6f}")
Context Windows — How Much Can It Remember?
The context window is how many tokens the model can "see" in one call — your prompt, chat history, retrieved documents, and the model's own response combined. A small context window forces chunking; a large one allows whole-book analysis in a single call.
Bars scaled relative to 1M. Note: effective performance may degrade near context limits ("lost in the middle" problem).
Streaming — Making Your App Feel Instant
Without streaming, users stare at a loading spinner until the entire response is generated. With streaming, tokens appear word-by-word — like watching someone type. This is a UX requirement for any chat interface. All four providers support streaming.
OpenAI Streaming
from openai import OpenAI
client = OpenAI(api_key="sk-...")
with client.chat.completions.stream(
model="gpt-4o",
messages=[{"role": "user", "content": "Write 3 fun facts about Python."}]
) as stream:
for text in stream.text_stream():
print(text, end="", flush=True)
Anthropic Streaming
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=300,
messages=[{"role": "user", "content": "Write 3 fun facts about Python."}]
) as stream:
for text in stream.text_stream():
print(text, end="", flush=True)
Gemini Streaming
import google.generativeai as genai
genai.configure(api_key="AIza...")
model = genai.GenerativeModel("gemini-2.0-flash")
for chunk in model.generate_content(
"Write 3 fun facts about Python.",
stream=True
):
print(chunk.text, end="", flush=True)
OpenAI and Anthropic both offer a .stream() context manager in their
official Python SDKs for clean streaming. Gemini uses a simple stream=True
flag. All approaches yield tokens as they arrive with minimal latency.
Function Calling & Tool Use — Making LLMs Act
Function calling (OpenAI/Gemini) or tool use (Anthropic) lets the model decide when to call an external function and what arguments to pass. This is the backbone of AI agents — allowing models to search the web, query databases, call APIs, or run code.
OpenAI Function Calling
from openai import OpenAI
import json
client = OpenAI(api_key="sk-...")
tools = [{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Get the current stock price for a given ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"ticker": {"type": "string", "description": "Stock ticker e.g. AAPL"}
},
"required": ["ticker"]
}
}
}]
messages = [{"role": "user", "content": "What is the current price of Apple stock?"}]
response = client.chat.completions.create(model="gpt-4o", tools=tools, messages=messages)
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(f"Model wants to call: {tool_call.function.name}({args})")
# → Model wants to call: get_stock_price({'ticker': 'AAPL'})
# Execute the real function and return result back to model
def get_stock_price(ticker): return {"ticker": ticker, "price": 213.49, "currency": "USD"}
messages.append(response.choices[0].message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(get_stock_price(**args))
})
final = client.chat.completions.create(model="gpt-4o", messages=messages)
print(final.choices[0].message.content)
Anthropic Tool Use
import anthropic, json
client = anthropic.Anthropic(api_key="sk-ant-...")
tools = [{
"name": "get_stock_price",
"description": "Get the current stock price for a ticker symbol.",
"input_schema": { # "input_schema" — Anthropic's term (not "parameters")
"type": "object",
"properties": {
"ticker": {"type": "string", "description": "e.g. AAPL"}
},
"required": ["ticker"]
}
}]
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
tools=tools,
messages=[{"role": "user", "content": "What is Apple's stock price?"}]
)
tool_use_block = [b for b in response.content if b.type == "tool_use"][0]
print(f"Calling: {tool_use_block.name}({tool_use_block.input})")
OpenAI uses "parameters" for the tool schema definition.
Anthropic uses "input_schema". This single field name difference
trips up every developer migrating between the two — bookmark it.
Capability Matrix — What Can Each Do?
| Capability | OpenAI | Anthropic | Gemini | Open Source |
|---|---|---|---|---|
| Text Generation | ✅ Excellent | ✅ Excellent | ✅ Excellent | ✅ Good–Excellent |
| Code Generation | ✅ Best-in-class | ✅ Excellent | ⚡ Very Good | ⚡ Good (DeepSeek) |
| Image Input | ✅ GPT-4o | ✅ All Claude 3+ | ✅ Native multimodal | ⚡ LLaVA, Llama 3.2 |
| Image Generation | ✅ DALL-E 3 | ❌ No | ✅ Imagen 3 | ✅ Stable Diffusion |
| Audio Input | ✅ Whisper / GPT-4o | ❌ No | ✅ Native | ⚡ Whisper |
| Video Input | ❌ No | ❌ No | ✅ Native | ❌ Limited |
| Web Search (grounding) | ✅ (web search tool) | ⚡ Web search tool | ✅ Google Search | ❌ Needs integration |
| Function Calling | ✅ Mature | ✅ Excellent | ✅ Good | ⚡ Model-dependent |
| Structured JSON Output | ✅ Strict mode | ✅ Reliable | ✅ Reliable | ⚡ Varies |
| Fine-tuning | ✅ GPT-4o mini, 3.5 | ❌ No public API | ✅ Via Vertex AI | ✅ Full freedom |
| Batch API (async cheap) | ✅ 50% discount | ✅ Message Batches | ⚡ Via Vertex | ⚡ Depends on host |
| Data Privacy / On-Prem | ⚡ Enterprise tier | ⚡ Enterprise tier | ⚡ Vertex AI | ✅ Full control |
Safety & Alignment — Who Refuses What, and Why?
That is the challenge every LLM provider wrestles with. Anthropic's Constitutional AI and OpenAI's RLHF-based alignment both aim for the thoughtful middle ground — but they are calibrated differently, and you will notice the differences in production.
Structured JSON Output — Reliable Extraction
Production pipelines rarely need prose — they need structured data. Extracting
a {"name": ..., "price": ..., "category": ...} object reliably
from model output requires structured output features.
OpenAI — Strict JSON Mode with Pydantic
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI(api_key="sk-...")
class Product(BaseModel):
name: str
price_usd: float
category: str
in_stock: bool
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{
"role": "user",
"content": "Extract product info: Sony WH-1000XM5 headphones, $299.99, electronics, available."
}],
response_format=Product # Guaranteed schema adherence
)
product = response.choices[0].message.parsed
print(f"Name: {product.name}")
print(f"Price: ${product.price_usd}")
print(f"Category: {product.category}")
print(f"In Stock: {product.in_stock}")
Anthropic — JSON via Prompt Engineering
import anthropic, json
client = anthropic.Anthropic(api_key="sk-ant-...")
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=300,
system="""You are a data extraction assistant.
Always respond with valid JSON only — no prose, no markdown fences.
Schema: {"name": str, "price_usd": float, "category": str, "in_stock": bool}""",
messages=[{
"role": "user",
"content": "Extract: Sony WH-1000XM5 headphones, $299.99, electronics, available."
}]
)
product = json.loads(message.content[0].text)
print(product)
Batch API — Cheap Async Processing at Scale
For offline workloads — classifying 10,000 reviews, generating 5,000 product descriptions, embedding a corpus — the Batch APIs of OpenAI and Anthropic offer ~50% cost reduction with async processing (results in 24 hours).
Anthropic Message Batches
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")
reviews = [
"Great product, arrived quickly!",
"Terrible quality, broke after one day.",
"Average experience, nothing special.",
]
# Build a batch of requests
requests = [
anthropic.types.message_create_params.MessageCreateParamsNonStreaming(
custom_id=f"review-{i}",
params={
"model": "claude-haiku-3-5-20251001",
"max_tokens": 10,
"system": "Classify sentiment as POSITIVE, NEGATIVE, or NEUTRAL. One word only.",
"messages": [{"role": "user", "content": review}]
}
)
for i, review in enumerate(reviews)
]
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id} Status: {batch.processing_status}")
# Poll later — or use webhook for completion notification
Multimodal — Sending Images to the Model
"type": "image_url" |
| Pass URL or base64 in message content array |
| Supports: JPEG, PNG, GIF, WEBP |
"type": "image" with source object |
| Base64 or URL via media_type field |
| Supports: JPEG, PNG, GIF, WEBP |
OpenAI Vision
import base64, openai
client = openai.OpenAI(api_key="sk-...")
with open("chart.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What trend does this chart show?"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
]
}]
)
print(response.choices[0].message.content)
Anthropic Vision
import anthropic, base64
client = anthropic.Anthropic(api_key="sk-ant-...")
with open("chart.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": b64
}
},
{"type": "text", "text": "What trend does this chart show?"}
]
}]
)
print(message.content[0].text)
Embeddings — Turning Text into Vectors
Embeddings are fixed-length numerical vectors that capture semantic meaning. They power semantic search, RAG (Retrieval-Augmented Generation), clustering, and recommendation systems. OpenAI and Gemini offer embedding endpoints directly; Anthropic does not — use open-source alternatives for embeddings when on Claude.
| Provider | Model | Dimensions | Cost / 1M tokens | Best For |
|---|---|---|---|---|
| OpenAI | text-embedding-3-large |
3072 (configurable) | $0.13 | High-accuracy retrieval |
| OpenAI | text-embedding-3-small |
1536 | $0.02 | Balanced cost/accuracy |
text-embedding-004 |
768 | $0.00 (free tier) | Google stack, RAG | |
| Anthropic | No embedding model | — | — | Use Voyage AI or OpenAI |
| Open Source | nomic-embed-text, bge-m3 |
768–1024 | $0.00 | Private data, no API call |
from openai import OpenAI
import numpy as np
client = OpenAI(api_key="sk-...")
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
texts = [
"Machine learning automates pattern recognition.",
"Deep learning uses neural networks with many layers.",
"The Eiffel Tower is in Paris, France."
]
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
vectors = [item.embedding for item in response.data]
print("ML vs Deep Learning similarity:",
round(cosine_similarity(vectors[0], vectors[1]), 4))
print("ML vs Eiffel Tower similarity: ",
round(cosine_similarity(vectors[0], vectors[2]), 4))
Open Source — Running Models Locally with Ollama
# Terminal: pull a model
# ollama pull llama3.3:70b
# ollama serve (starts local server at localhost:11434)
from openai import OpenAI
# Ollama exposes an OpenAI-compatible endpoint
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but not checked
)
response = client.chat.completions.create(
model="llama3.3:70b",
messages=[
{"role": "system", "content": "You are a helpful data science tutor."},
{"role": "user", "content": "What is gradient descent?"}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Because Ollama exposes an OpenAI-compatible API, any code written for GPT-4o
works with Llama 3.3 locally by changing just two lines:
base_url and model.
This makes provider migration essentially zero-cost.
Decision Framework — Which Provider For Which Task?
| Use Case | Recommended Provider | Recommended Model | Reason |
|---|---|---|---|
| General chatbot (production) | OpenAI | GPT-4o mini | Cost-efficient, fast, reliable, ecosystem |
| Long document analysis (200K+) | Anthropic | Claude Sonnet 4 | Reliable at long context, best instruction-following |
| Real-time web + data (RAG) | Gemini | Gemini 2.5 Pro | Native Google Search grounding |
| Video / audio analysis | Gemini | Gemini 2.5 Pro | Only provider with native video input |
| Complex reasoning / math | OpenAI / Anthropic | o3 / Claude Opus 4 | Best reasoning benchmarks |
| High-volume batch (cheap) | Gemini or Open Source | Flash 2.0 / Llama 3.3 | Lowest cost per token |
| Data privacy / GDPR / on-prem | Open Source | Llama 3.3 70B | Data never leaves your infra |
| Code generation / debugging | OpenAI / Anthropic | GPT-4o / Claude Sonnet 4 | Tied — try both |
| Fine-tuned domain expert | OpenAI or Open Source | GPT-4o mini fine-tuned / Llama | OpenAI has fine-tune API; open source = full control |
| Creative writing / storytelling | Anthropic | Claude Sonnet 4 / Opus 4 | Best prose quality and style consistency |
| European data residency | Mistral | Mistral Large | EU-based servers, GDPR native |
Provider-Agnostic Code — Write Once, Switch Freely
Avoid coupling your application code to a single provider. A thin abstraction layer lets you swap providers with a single config change — critical when a provider raises prices, has an outage, or releases a better model.
from dataclasses import dataclass
from typing import Literal
import openai, anthropic
@dataclass
class LLMConfig:
provider: Literal["openai", "anthropic", "openai-compat"]
model: str
api_key: str
base_url: str | None = None
def chat(config: LLMConfig, system: str, user: str, max_tokens: int = 500) -> str:
if config.provider in ("openai", "openai-compat"):
client = openai.OpenAI(
api_key=config.api_key,
base_url=config.base_url
)
resp = client.chat.completions.create(
model=config.model,
max_tokens=max_tokens,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user}
]
)
return resp.choices[0].message.content
elif config.provider == "anthropic":
client = anthropic.Anthropic(api_key=config.api_key)
resp = client.messages.create(
model=config.model,
max_tokens=max_tokens,
system=system,
messages=[{"role": "user", "content": user}]
)
return resp.content[0].text
raise ValueError(f"Unknown provider: {config.provider}")
# Swap providers by changing only the config:
gpt_config = LLMConfig("openai", "gpt-4o", "sk-...")
claude_config = LLMConfig("anthropic", "claude-sonnet-4-5", "sk-ant-...")
llama_config = LLMConfig("openai-compat", "llama3.3:70b", "ollama",
base_url="http://localhost:11434/v1")
prompt = ("You are a helpful tutor.", "What is overfitting in ML?")
for cfg in [gpt_config, claude_config, llama_config]:
answer = chat(cfg, *prompt, max_tokens=80)
print(f"\n[{cfg.model}]\n{answer}")