What Is Ollama? — The Story of the Laptop That Became a Brain
Now imagine that assistant normally lives in a giant server room in San Francisco, and every time you ask it a question, your words fly across the internet, get processed on a machine you've never seen, and the answer flies back. That's how cloud AI works today.
Ollama is the moving truck that brings that entire library — assistant and all — into your own home. Onto your own laptop. Behind your own firewall. No internet required. No subscription. No data leaving your machine. The librarian lives with you now.
Ollama is a free, open-source tool that lets you download, manage, and run large language models (LLMs) entirely on your local machine. It wraps the complexity of running these models — GPU acceleration, memory management, quantisation — behind a clean command-line interface and a local REST API.
Everything you need — installation, model library, documentation — lives at ollama.com. The model library alone has over 100 pre-built models ranging from tiny 1B to massive 70B+ parameter giants.
What Are "Billion Parameters"? — The Chef with a Million Recipes
This intuition lives in the chef's brain as billions of tiny learned associations — like a web of interconnected instincts. Each instinct is called a parameter. A model with 7 billion parameters has 7,000,000,000 such learned associations. A model with 70 billion has ten times as many, making it ten times more nuanced — but also ten times harder to host in your kitchen.
A parameter (also called a weight) is a single floating-point number inside a neural network. During training, the model processes trillions of words and adjusts these numbers — incrementally, over weeks on giant GPU clusters — until they collectively capture the patterns of human language.
| Model Size | Parameters | RAM / VRAM Needed | Speed | Who Can Run It |
|---|---|---|---|---|
| Tiny (1B–3B) | 1–3 Billion | 2–4 GB | ⚡ Very Fast | Any laptop, even old ones |
| Small (7B) | 7 Billion | 4–8 GB | Fast | Modern laptop w/ 8GB RAM+ |
| Medium (13B) | 13 Billion | 8–16 GB | Moderate | 16GB RAM machine or mid GPU |
| Large (34B) | 34 Billion | 20–40 GB | Slow on CPU | High-end PC, 32GB RAM min |
| Huge (70B) | 70 Billion | 40–80 GB | Very Slow (CPU) | Workstation / multi-GPU setup |
* RAM requirements shown are for 4-bit quantized (Q4) versions — the default Ollama uses. Full-precision models need 2–4× more memory.
Ollama automatically downloads quantised models. A full 7B model stores each parameter as a 16-bit or 32-bit float — requiring ~14GB. A Q4-quantised version compresses each parameter to ~4 bits — cutting memory to ~4GB with only a small drop in quality. It's like compressing a 4K movie to HD: still excellent, but fits in your pocket.
more weights7 billion total
Each parameter is just a number. Magic emerges from 7 billion of them working together — shaped by training on hundreds of billions of words.
Installing Ollama & Running Your First Model
🆕 Download and Run a Model in 30 Seconds
Ollama uses a simple pull + run workflow. Think of it like docker pull but for AI models:
# Pull the Llama 3.2 3B model (~2 GB download)
ollama pull llama3.2
# Or pull and start an interactive chat immediately
ollama run llama3.2
# Start a chat with a specific model
ollama run mistral
# Ask a single question and exit (great for scripts)
ollama run llama3.2 "What is the capital of France?"
# List all downloaded models on your machine
ollama list
# Check what models are currently loaded in memory
ollama ps
# Remove a model to free up disk space
ollama rm llama3.2
The Ollama REST API — Your Local AI Endpoint
When Ollama is running, it exposes a local REST API at http://localhost:11434. This is how Python (and any other language) talks to your local models — no internet, no API key, no rate limits.
Using Ollama with Python — Example 1: Basic Generation
Method A — Using the requests library (raw HTTP)
The most transparent approach — no extra packages beyond what Python already has nearby:
import requests
import json
# ── Configuration ───────────────────────────────────────────
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL_NAME = "llama3.2" # must be already downloaded via: ollama pull llama3.2
def ask_ollama(prompt: str, stream: bool = False) -> str:
"""Send a prompt to local Ollama and return the full response."""
payload = {
"model": MODEL_NAME,
"prompt": prompt,
"stream": stream # False = wait for complete response
}
response = requests.post(OLLAMA_URL, json=payload, timeout=120)
response.raise_for_status()
data = response.json()
return data["response"]
# ── Example usage ────────────────────────────────────────────
answer = ask_ollama("Explain quantum entanglement in simple terms.")
print(answer)
Method B — Streaming responses (token by token)
For a ChatGPT-like typewriter effect where tokens appear as they're generated:
import requests
import json
def stream_ollama(prompt: str) -> None:
"""Stream the response token-by-token to stdout."""
payload = {
"model": "llama3.2",
"prompt": prompt,
"stream": True # ← key difference
}
with requests.post(
"http://localhost:11434/api/generate",
json=payload,
stream=True,
timeout=120
) as resp:
for line in resp.iter_lines():
if line:
chunk = json.loads(line)
print(chunk.get("response", ""), end="", flush=True)
if chunk.get("done"):
break
print() # newline at end
# Usage — tokens will appear progressively in terminal
stream_ollama("Write a haiku about machine learning.")
Using the Official Ollama Python Library
The ollama Python package provides a cleaner, higher-level interface with auto-retry, better streaming, and a fully typed API. Install it with: pip install ollama
import ollama
# ── Single prompt — simplest form ────────────────────────────
response = ollama.generate(
model="llama3.2",
prompt="What are three benefits of local LLMs?"
)
print(response["response"])
# ── Chat with message history (multi-turn) ────────────────────
messages = [
{"role": "system", "content": "You are a helpful Python tutor."},
{"role": "user", "content": "What is a list comprehension?"},
]
response = ollama.chat(model="llama3.2", messages=messages)
reply = response["message"]["content"]
print(reply)
# Continue the conversation by appending assistant + new user message
messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "Can you give me an example with numbers?"})
response2 = ollama.chat(model="llama3.2", messages=messages)
print(response2["message"]["content"])
LLMs are stateless — they don't remember previous calls.
"Memory" in a chatbot is just sending the full conversation history
(all past messages) with every new request. The messages list is the memory.
The longer it gets, the more context the model has — up to its context window limit.
Advanced Example — Building a Local AI Code Reviewer
With Ollama running locally, you can build a code reviewer that lives on your machine, reads your Python files, and gives instant feedback — without sending your proprietary code to any external server. At 3am. In a plane with no WiFi. For free. Forever.
import ollama
import sys
from pathlib import Path
# ── Local Code Reviewer using Ollama ─────────────────────────
SYSTEM_PROMPT = """You are a senior Python developer doing a code review.
For each piece of code you receive:
1. Identify bugs or logical errors
2. Point out style issues (PEP 8)
3. Suggest performance improvements
4. Highlight any security concerns
Be specific, concise, and constructive. Use bullet points."""
def review_code(code: str, filename: str = "snippet") -> str:
"""Ask the local LLM to review a code snippet."""
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Please review this Python code from '{filename}':\n\n```python\n{code}\n```"}
]
response = ollama.chat(model="codellama:7b", messages=messages)
return response["message"]["content"]
# ── Review a code snippet directly ───────────────────────────
sample_code = """
def get_user_data(user_id):
import sqlite3
conn = sqlite3.connect("users.db")
query = "SELECT * FROM users WHERE id = " + user_id
result = conn.execute(query)
return result.fetchall()
"""
review = review_code(sample_code, "user_service.py")
print("=== CODE REVIEW ===")
print(review)
# ── Or review a file from disk ────────────────────────────────
def review_file(filepath: str) -> None:
code = Path(filepath).read_text()
print(f"\n🔍 Reviewing: {filepath}")
print(review_code(code, filepath))
# review_file("my_script.py") # ← uncomment to use
Advanced Example — RAG: Chatting with Your Own Documents
This pattern — injecting your own text into the prompt as context — is called Retrieval Augmented Generation (RAG). At its simplest, it's just prompt engineering. At its most powerful, it involves a vector database. Let's start simple.
import ollama
# ── Simple RAG: Chat with a text document ─────────────────────
def load_document(filepath: str, chunk_size: int = 500) -> list[str]:
"""Load a text file and split into chunks."""
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
words = text.split()
chunks = [
" ".join(words[i : i + chunk_size])
for i in range(0, len(words), chunk_size)
]
return chunks
def find_relevant_chunks(chunks: list, question: str, top_k: int = 3) -> list[str]:
"""Simple keyword-based retrieval (no vector DB needed!)."""
question_words = set(question.lower().split())
scored = []
for chunk in chunks:
chunk_words = set(chunk.lower().split())
score = len(question_words & chunk_words) # word overlap
scored.append((score, chunk))
scored.sort(reverse=True)
return [chunk for _, chunk in scored[:top_k]]
def answer_question(document_path: str, question: str) -> str:
"""Full RAG pipeline: load doc → retrieve → generate answer."""
# Step 1: Load and chunk
chunks = load_document(document_path)
# Step 2: Retrieve relevant chunks
relevant = find_relevant_chunks(chunks, question)
context = "\n\n---\n\n".join(relevant)
# Step 3: Build augmented prompt
prompt = f"""You are a helpful assistant. Answer the question using ONLY
the context below. If the answer isn't in the context, say so.
CONTEXT:
{context}
QUESTION: {question}
ANSWER:"""
# Step 4: Generate answer
response = ollama.generate(model="llama3.2", prompt=prompt)
return response["response"]
# ── Usage ─────────────────────────────────────────────────────
answer = answer_question("manual.txt", "Does this device support Bluetooth 5.2?")
print(answer)
The keyword overlap trick works surprisingly well for small documents.
For production or large document sets, replace find_relevant_chunks
with semantic search using ChromaDB + Ollama embeddings:
pip install chromadb and use ollama.embeddings() to generate
vectors for each chunk. Then query by cosine similarity for much better retrieval.
Generating Embeddings with Ollama
An embedding is a numerical vector that represents the meaning of text. Similar sentences produce similar vectors. This is the foundation of semantic search, recommendation systems, and clustering.
import ollama
import math
# ── Generate an embedding vector ─────────────────────────────
response = ollama.embeddings(
model="nomic-embed-text", # pull this first: ollama pull nomic-embed-text
prompt="The cat sat on the mat"
)
vector = response["embedding"] # list of 768 floats
print(f"Embedding dimensions: {len(vector)}") # 768
# ── Cosine similarity function ────────────────────────────────
def cosine_similarity(v1: list, v2: list) -> float:
dot = sum(a * b for a, b in zip(v1, v2))
norm1 = math.sqrt(sum(a**2 for a in v1))
norm2 = math.sqrt(sum(b**2 for b in v2))
return dot / (norm1 * norm2)
# ── Compare two sentences ─────────────────────────────────────
sentences = [
"A feline rested on the rug.", # very similar meaning
"Machine learning is fascinating.", # completely different topic
]
def embed(text: str) -> list:
return ollama.embeddings(model="nomic-embed-text", prompt=text)["embedding"]
base = embed("The cat sat on the mat")
for s in sentences:
sim = cosine_similarity(base, embed(s))
print(f"Similarity: {sim:.4f} | '{s}'")
"Cat sat on mat" and "feline rested on rug" share no words, yet the model gives them a 93% similarity score. This is the magic of semantic embeddings — they capture meaning, not vocabulary. This is why vector search dramatically outperforms keyword search for natural language queries.
Model Comparison & Choosing the Right One
| Model | Size | Best For | Min RAM | Speed | Pull Command |
|---|---|---|---|---|---|
| llama3.2:3b | 2.0 GB | Chat, summarisation, quick answers | 4 GB | ⚡ Very fast | ollama pull llama3.2 |
| llama3.1:8b | 4.7 GB | Balanced: reasoning + speed | 8 GB | Fast | ollama pull llama3.1 |
| mistral:7b | 4.1 GB | Instructions, structured output | 8 GB | Fast | ollama pull mistral |
| codellama:7b | 3.8 GB | Code generation, review, debug | 8 GB | Fast | ollama pull codellama |
| deepseek-coder:6.7b | 3.8 GB | Python, JS, SQL specialisation | 8 GB | Fast | ollama pull deepseek-coder |
| llama3.1:70b | 40 GB | Complex reasoning, analysis | 48 GB+ | Slow (CPU) | ollama pull llama3.1:70b |
| nomic-embed-text | 274 MB | Embeddings only (RAG, search) | 2 GB | ⚡ Very fast | ollama pull nomic-embed-text |
Golden Rules — Running LLMs Locally Like a Pro
stream=False for scripts, stream=True for UIs.
Non-streaming waits for the full response — easier to work with in automation.
Streaming gives the typewriter effect for interactive apps.
messages array. The model does not remember your last call.
codellama or deepseek-coder for code tasks.
General-purpose models write code too, but code-specialised models were fine-tuned
on millions of GitHub repositories. For anything code-related, the specialised model wins.
ollama ps — it shows you which models are loaded in VRAM
right now. If you're running multiple models and RAM is tight, use ollama stop <model>
to unload a model and free memory before loading another.
Every model has a maximum context window — the total number of tokens (words + punctuation) it can process in one request. Llama 3.2 supports up to 128K tokens, but older or smaller models may cap at 4K–8K. If your prompt + history exceeds the limit, the model will silently truncate the beginning of the conversation, causing it to "forget" earlier context. Always check the model's spec on ollama.com.