Large Language Models (LLMs) 📂 LLM architecture deep dive · 8 of 10 37 min read

Ollama Tutorial: Run Local LLMs with Python

Learn how to install Ollama, understand what billion-parameter models really mean, and use local LLMs entirely from Python — with code examples for chat, code review, RAG document Q&A, and semantic embeddings.

Section 01

What Is Ollama? — The Story of the Laptop That Became a Brain

The Library That Moved Into Your House
Imagine the world's greatest library — millions of books, scientific papers, code repositories, philosophical treatises, every Wikipedia article ever written — all distilled by a master librarian into a single, extraordinarily knowledgeable assistant.

Now imagine that assistant normally lives in a giant server room in San Francisco, and every time you ask it a question, your words fly across the internet, get processed on a machine you've never seen, and the answer flies back. That's how cloud AI works today.

Ollama is the moving truck that brings that entire library — assistant and all — into your own home. Onto your own laptop. Behind your own firewall. No internet required. No subscription. No data leaving your machine. The librarian lives with you now.

Ollama is a free, open-source tool that lets you download, manage, and run large language models (LLMs) entirely on your local machine. It wraps the complexity of running these models — GPU acceleration, memory management, quantisation — behind a clean command-line interface and a local REST API.

🌐
Official Website

Everything you need — installation, model library, documentation — lives at ollama.com. The model library alone has over 100 pre-built models ranging from tiny 1B to massive 70B+ parameter giants.

🏭 How Ollama Sits Between You and Your Model
☁️ Cloud AI (Before)
YouLocal machine
InternetYour data travels
ServerGPU Farm
🏠 Ollama (Now)
YouCLI / Python / API
Ollamalocalhost:11434
LLMYour GPU/CPU

Section 02

What Are "Billion Parameters"? — The Chef with a Million Recipes

The World's Most Experienced Chef
Imagine a chef who has tasted and studied every recipe ever written. After years of training, they don't memorise the recipes word-for-word — instead, they develop an intuition. They learn: "garlic + tomato + olive oil → Italian direction", "soy + ginger + sesame → Asian direction", and countless other patterns.

This intuition lives in the chef's brain as billions of tiny learned associations — like a web of interconnected instincts. Each instinct is called a parameter. A model with 7 billion parameters has 7,000,000,000 such learned associations. A model with 70 billion has ten times as many, making it ten times more nuanced — but also ten times harder to host in your kitchen.

A parameter (also called a weight) is a single floating-point number inside a neural network. During training, the model processes trillions of words and adjusts these numbers — incrementally, over weeks on giant GPU clusters — until they collectively capture the patterns of human language.

⚡ Model Size vs What You Need to Run It
Model SizeParametersRAM / VRAM NeededSpeedWho Can Run It
Tiny (1B–3B) 1–3 Billion
2–4 GB
⚡ Very Fast Any laptop, even old ones
Small (7B) 7 Billion
4–8 GB
Fast Modern laptop w/ 8GB RAM+
Medium (13B) 13 Billion
8–16 GB
Moderate 16GB RAM machine or mid GPU
Large (34B) 34 Billion
20–40 GB
Slow on CPU High-end PC, 32GB RAM min
Huge (70B) 70 Billion
40–80 GB
Very Slow (CPU) Workstation / multi-GPU setup

* RAM requirements shown are for 4-bit quantized (Q4) versions — the default Ollama uses. Full-precision models need 2–4× more memory.

💡
The Quantisation Secret — Getting More for Less

Ollama automatically downloads quantised models. A full 7B model stores each parameter as a 16-bit or 32-bit float — requiring ~14GB. A Q4-quantised version compresses each parameter to ~4 bits — cutting memory to ~4GB with only a small drop in quality. It's like compressing a 4K movie to HD: still excellent, but fits in your pocket.

🧠 What a Parameter Actually Does — Animated Flow
Token: "Paris"Input vector
×
w = 0.847A parameter!
+
… × 6,999,999,999
more weights7 billion total
ActivationNon-linearity
"capital of France"Predicted next tokens

Each parameter is just a number. Magic emerges from 7 billion of them working together — shaped by training on hundreds of billions of words.


Section 03

Installing Ollama & Running Your First Model

🚀 Installation — One Command on Every Platform
macOS / Linux
Open Terminal and run: curl -fsSL https://ollama.com/install.sh | sh — installs in under 60 seconds.
Windows
Download the installer from ollama.com/download. Installs like any app. Runs as a system tray service.
Docker
Run: docker run -d -p 11434:11434 ollama/ollama for a containerised setup perfect for servers.
Verify
After install, run ollama --version in terminal. You should see a version number like ollama version 0.x.x.

🆕 Download and Run a Model in 30 Seconds

Ollama uses a simple pull + run workflow. Think of it like docker pull but for AI models:

# Pull the Llama 3.2 3B model (~2 GB download)
ollama pull llama3.2

# Or pull and start an interactive chat immediately
ollama run llama3.2

# Start a chat with a specific model
ollama run mistral

# Ask a single question and exit (great for scripts)
ollama run llama3.2 "What is the capital of France?"

# List all downloaded models on your machine
ollama list

# Check what models are currently loaded in memory
ollama ps

# Remove a model to free up disk space
ollama rm llama3.2
OUTPUT — ollama list
NAME ID SIZE MODIFIED llama3.2:latest a80c4f17acd5 2.0 GB 2 hours ago mistral:latest f974a74358d6 4.1 GB 1 day ago codellama:7b 8fdf8f752f6e 3.8 GB 3 days ago
🤖
General Chat
Best starter models
llama3.2, mistral, gemma2 — great for everyday Q&A, writing, and reasoning tasks.
💻
Code Generation
For developers
codellama, deepseek-coder, qwen2.5-coder — specialised in Python, JS, SQL, and more.
📈
Data & Science
Analysis focused
phi3, llama3.1 8B — excellent for data analysis explanations, math, and structured reasoning.

Section 04

The Ollama REST API — Your Local AI Endpoint

When Ollama is running, it exposes a local REST API at http://localhost:11434. This is how Python (and any other language) talks to your local models — no internet, no API key, no rate limits.

🔁 Request / Response Flow
1
POST /api/generate
Send a prompt string. Returns a streamed or single-shot response. Simplest endpoint.
2
POST /api/chat
Send a conversation history (messages array). Maintains multi-turn context. Use for chatbots.
3
POST /api/embeddings
Convert text to a vector embedding. Used for semantic search, RAG pipelines, similarity tasks.
4
GET /api/tags
List all downloaded models. Returns model names, sizes, and modification dates.

Section 05

Using Ollama with Python — Example 1: Basic Generation

The Receptionist Who Speaks HTTP
Ollama is running on your machine right now, listening patiently at port 11434. Think of it as a very knowledgeable receptionist. Your Python code walks up to the desk, hands over a note (your prompt as JSON), and the receptionist passes it to the expert in the back room (the LLM). A few seconds later, the answer comes back. You don't need to know anything about how the expert thinks — just how to write the note.

Method A — Using the requests library (raw HTTP)

The most transparent approach — no extra packages beyond what Python already has nearby:

import requests
import json

# ── Configuration ───────────────────────────────────────────
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL_NAME = "llama3.2"  # must be already downloaded via: ollama pull llama3.2

def ask_ollama(prompt: str, stream: bool = False) -> str:
    """Send a prompt to local Ollama and return the full response."""
    payload = {
        "model": MODEL_NAME,
        "prompt": prompt,
        "stream": stream       # False = wait for complete response
    }
    response = requests.post(OLLAMA_URL, json=payload, timeout=120)
    response.raise_for_status()
    data = response.json()
    return data["response"]

# ── Example usage ────────────────────────────────────────────
answer = ask_ollama("Explain quantum entanglement in simple terms.")
print(answer)
OUTPUT
Quantum entanglement is like having two magic coins. When you flip one coin and it lands heads, the other coin — no matter how far away it is — instantly lands tails. They're "entangled": the state of one is always correlated with the other, no matter the distance between them...

Method B — Streaming responses (token by token)

For a ChatGPT-like typewriter effect where tokens appear as they're generated:

import requests
import json

def stream_ollama(prompt: str) -> None:
    """Stream the response token-by-token to stdout."""
    payload = {
        "model": "llama3.2",
        "prompt": prompt,
        "stream": True      # ← key difference
    }
    with requests.post(
        "http://localhost:11434/api/generate",
        json=payload,
        stream=True,
        timeout=120
    ) as resp:
        for line in resp.iter_lines():
            if line:
                chunk = json.loads(line)
                print(chunk.get("response", ""), end="", flush=True)
                if chunk.get("done"):
                    break
    print()  # newline at end

# Usage — tokens will appear progressively in terminal
stream_ollama("Write a haiku about machine learning.")
OUTPUT (appears token by token)
Weights slowly trained, Patterns hidden in the noise— The model now knows.

Section 06

Using the Official Ollama Python Library

The ollama Python package provides a cleaner, higher-level interface with auto-retry, better streaming, and a fully typed API. Install it with: pip install ollama

import ollama

# ── Single prompt — simplest form ────────────────────────────
response = ollama.generate(
    model="llama3.2",
    prompt="What are three benefits of local LLMs?"
)
print(response["response"])

# ── Chat with message history (multi-turn) ────────────────────
messages = [
    {"role": "system",    "content": "You are a helpful Python tutor."},
    {"role": "user",      "content": "What is a list comprehension?"},
]
response = ollama.chat(model="llama3.2", messages=messages)
reply = response["message"]["content"]
print(reply)

# Continue the conversation by appending assistant + new user message
messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user",      "content": "Can you give me an example with numbers?"})

response2 = ollama.chat(model="llama3.2", messages=messages)
print(response2["message"]["content"])
RESPONSE — What is a list comprehension?
A list comprehension is a concise way to create lists in Python. Instead of writing a for loop, you write the logic in a single line: squares = [x**2 for x in range(10)] This is equivalent to: squares = [] for x in range(10): squares.append(x**2)
Memory Lives in the Messages List

LLMs are stateless — they don't remember previous calls. "Memory" in a chatbot is just sending the full conversation history (all past messages) with every new request. The messages list is the memory. The longer it gets, the more context the model has — up to its context window limit.


Section 07

Advanced Example — Building a Local AI Code Reviewer

The Senior Dev Who Never Gets Tired
Your team has a rule: every PR must be reviewed by a senior developer before merging. But your senior dev is in a meeting. Again. And your deadline is in an hour.

With Ollama running locally, you can build a code reviewer that lives on your machine, reads your Python files, and gives instant feedback — without sending your proprietary code to any external server. At 3am. In a plane with no WiFi. For free. Forever.
import ollama
import sys
from pathlib import Path

# ── Local Code Reviewer using Ollama ─────────────────────────
SYSTEM_PROMPT = """You are a senior Python developer doing a code review.
For each piece of code you receive:
1. Identify bugs or logical errors
2. Point out style issues (PEP 8)
3. Suggest performance improvements
4. Highlight any security concerns
Be specific, concise, and constructive. Use bullet points."""

def review_code(code: str, filename: str = "snippet") -> str:
    """Ask the local LLM to review a code snippet."""
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": f"Please review this Python code from '{filename}':\n\n```python\n{code}\n```"}
    ]
    response = ollama.chat(model="codellama:7b", messages=messages)
    return response["message"]["content"]

# ── Review a code snippet directly ───────────────────────────
sample_code = """
def get_user_data(user_id):
    import sqlite3
    conn = sqlite3.connect("users.db")
    query = "SELECT * FROM users WHERE id = " + user_id
    result = conn.execute(query)
    return result.fetchall()
"""

review = review_code(sample_code, "user_service.py")
print("=== CODE REVIEW ===")
print(review)

# ── Or review a file from disk ────────────────────────────────
def review_file(filepath: str) -> None:
    code = Path(filepath).read_text()
    print(f"\n🔍 Reviewing: {filepath}")
    print(review_code(code, filepath))

# review_file("my_script.py")   # ← uncomment to use
OUTPUT — Code Review
=== CODE REVIEW === 🐛 **Bugs & Security Issues:** • CRITICAL: SQL Injection vulnerability — user_id is directly concatenated into the query. Use parameterized queries: cursor.execute("SELECT * FROM users WHERE id = ?", (user_id,)) • The database connection is never closed — will cause resource leaks. 📏 **PEP 8 Issues:** • Import inside function — move `import sqlite3` to module top level. • Missing type hints and docstring. ⚡ **Performance:** • Use context manager (with statement) to auto-close the connection. • Consider returning a list directly: return list(result.fetchall())

Section 08

Advanced Example — RAG: Chatting with Your Own Documents

The Expert Who Read Your Manual
You have a 200-page product manual. A customer asks: "Does this device support Bluetooth 5.2?" You could grep the PDF manually, or... you could give a local LLM your document as context and just ask it in plain English. The LLM becomes an expert on your specific document.

This pattern — injecting your own text into the prompt as context — is called Retrieval Augmented Generation (RAG). At its simplest, it's just prompt engineering. At its most powerful, it involves a vector database. Let's start simple.
🔁 Simple RAG Pipeline — How It Works
1
Load Your Document
Read text from PDF, .txt, .md, or any file. Chunk it into 500-word segments to fit in context window.
2
Find Relevant Chunks
When user asks a question, find the document chunks that are most relevant (by keyword or semantic similarity).
3
Build an Augmented Prompt
Combine: system instruction + relevant chunks + user question into one big prompt for the model.
4
Get a Grounded Answer
The model answers using only the context you gave it. No hallucinations about facts outside your document.
import ollama

# ── Simple RAG: Chat with a text document ─────────────────────

def load_document(filepath: str, chunk_size: int = 500) -> list[str]:
    """Load a text file and split into chunks."""
    with open(filepath, "r", encoding="utf-8") as f:
        text = f.read()
    words = text.split()
    chunks = [
        " ".join(words[i : i + chunk_size])
        for i in range(0, len(words), chunk_size)
    ]
    return chunks

def find_relevant_chunks(chunks: list, question: str, top_k: int = 3) -> list[str]:
    """Simple keyword-based retrieval (no vector DB needed!)."""
    question_words = set(question.lower().split())
    scored = []
    for chunk in chunks:
        chunk_words = set(chunk.lower().split())
        score = len(question_words & chunk_words)  # word overlap
        scored.append((score, chunk))
    scored.sort(reverse=True)
    return [chunk for _, chunk in scored[:top_k]]

def answer_question(document_path: str, question: str) -> str:
    """Full RAG pipeline: load doc → retrieve → generate answer."""
    # Step 1: Load and chunk
    chunks = load_document(document_path)
    
    # Step 2: Retrieve relevant chunks
    relevant = find_relevant_chunks(chunks, question)
    context = "\n\n---\n\n".join(relevant)
    
    # Step 3: Build augmented prompt
    prompt = f"""You are a helpful assistant. Answer the question using ONLY 
the context below. If the answer isn't in the context, say so.

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""
    
    # Step 4: Generate answer
    response = ollama.generate(model="llama3.2", prompt=prompt)
    return response["response"]

# ── Usage ─────────────────────────────────────────────────────
answer = answer_question("manual.txt", "Does this device support Bluetooth 5.2?")
print(answer)
💡
Level Up: Add Vector Search with ChromaDB

The keyword overlap trick works surprisingly well for small documents. For production or large document sets, replace find_relevant_chunks with semantic search using ChromaDB + Ollama embeddings: pip install chromadb and use ollama.embeddings() to generate vectors for each chunk. Then query by cosine similarity for much better retrieval.


Section 09

Generating Embeddings with Ollama

An embedding is a numerical vector that represents the meaning of text. Similar sentences produce similar vectors. This is the foundation of semantic search, recommendation systems, and clustering.

import ollama
import math

# ── Generate an embedding vector ─────────────────────────────
response = ollama.embeddings(
    model="nomic-embed-text",   # pull this first: ollama pull nomic-embed-text
    prompt="The cat sat on the mat"
)
vector = response["embedding"]      # list of 768 floats
print(f"Embedding dimensions: {len(vector)}")   # 768

# ── Cosine similarity function ────────────────────────────────
def cosine_similarity(v1: list, v2: list) -> float:
    dot   = sum(a * b for a, b in zip(v1, v2))
    norm1 = math.sqrt(sum(a**2 for a in v1))
    norm2 = math.sqrt(sum(b**2 for b in v2))
    return dot / (norm1 * norm2)

# ── Compare two sentences ─────────────────────────────────────
sentences = [
    "A feline rested on the rug.",          # very similar meaning
    "Machine learning is fascinating.",     # completely different topic
]

def embed(text: str) -> list:
    return ollama.embeddings(model="nomic-embed-text", prompt=text)["embedding"]

base   = embed("The cat sat on the mat")

for s in sentences:
    sim = cosine_similarity(base, embed(s))
    print(f"Similarity: {sim:.4f}  |  '{s}'")
OUTPUT
Embedding dimensions: 768 Similarity: 0.9341 | 'A feline rested on the rug.' Similarity: 0.3218 | 'Machine learning is fascinating.'
🧠
0.93 Similarity — Different Words, Same Meaning

"Cat sat on mat" and "feline rested on rug" share no words, yet the model gives them a 93% similarity score. This is the magic of semantic embeddings — they capture meaning, not vocabulary. This is why vector search dramatically outperforms keyword search for natural language queries.


Section 10

Model Comparison & Choosing the Right One

Model Size Best For Min RAM Speed Pull Command
llama3.2:3b 2.0 GB Chat, summarisation, quick answers 4 GB ⚡ Very fast ollama pull llama3.2
llama3.1:8b 4.7 GB Balanced: reasoning + speed 8 GB Fast ollama pull llama3.1
mistral:7b 4.1 GB Instructions, structured output 8 GB Fast ollama pull mistral
codellama:7b 3.8 GB Code generation, review, debug 8 GB Fast ollama pull codellama
deepseek-coder:6.7b 3.8 GB Python, JS, SQL specialisation 8 GB Fast ollama pull deepseek-coder
llama3.1:70b 40 GB Complex reasoning, analysis 48 GB+ Slow (CPU) ollama pull llama3.1:70b
nomic-embed-text 274 MB Embeddings only (RAG, search) 2 GB ⚡ Very fast ollama pull nomic-embed-text

Section 11

Golden Rules — Running LLMs Locally Like a Pro

🛠️ Ollama + Local LLM — Essential Rules
1
Start with the smallest model that does the job. A 3B model answering a simple question is 5× faster than a 13B model answering the same question. Only go bigger when quality falls short.
2
Always set stream=False for scripts, stream=True for UIs. Non-streaming waits for the full response — easier to work with in automation. Streaming gives the typewriter effect for interactive apps.
3
Keep models warm between requests. The first request after loading a model is slow (it loads into memory). Subsequent requests are fast. In a long-running service, send a warm-up prompt at startup.
4
Write a system prompt for every serious use case. A good system prompt can turn a generic 7B model into a specialist. "You are a senior Python developer who writes PEP-8 compliant, type-annotated code" produces dramatically better code than no system prompt.
5
LLMs are stateless — you manage the memory. Every API call is independent. If you want conversation context, you must include all previous messages in the messages array. The model does not remember your last call.
6
Use codellama or deepseek-coder for code tasks. General-purpose models write code too, but code-specialised models were fine-tuned on millions of GitHub repositories. For anything code-related, the specialised model wins.
7
Monitor with ollama ps — it shows you which models are loaded in VRAM right now. If you're running multiple models and RAM is tight, use ollama stop <model> to unload a model and free memory before loading another.
⚠️
Context Window Is Not Infinite

Every model has a maximum context window — the total number of tokens (words + punctuation) it can process in one request. Llama 3.2 supports up to 128K tokens, but older or smaller models may cap at 4K–8K. If your prompt + history exceeds the limit, the model will silently truncate the beginning of the conversation, causing it to "forget" earlier context. Always check the model's spec on ollama.com.


Section 12

Quick Reference — All The Commands You'll Use Daily

Pull a Model
ollama pull <model>
Downloads model to ~/.ollama/models. Only needed once per model.
Run Interactively
ollama run <model>
Opens an interactive chat session in your terminal. Type /bye to exit.
Python API Endpoint
localhost:11434
REST API base URL. Active whenever Ollama is running as a service.
Install Python Package
pip install ollama
Official Python client. Cleaner than raw requests. Auto-handles streaming.
🌠 Five-Minute Quick Start — From Zero to Running LLM
Step 1
Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Step 2
Pull a model: ollama pull llama3.2 — downloads ~2GB, takes 1–3 minutes
Step 3
Test it: ollama run llama3.2 "Hello, who are you?"
Step 4
Install Python package: pip install ollama
Step 5
Run Python: import ollama; print(ollama.generate(model="llama3.2", prompt="Hello!")["response"])