Prompt Engineering Tutorial: Zero-Shot, Few-Shot

Section 01

The Story That Explains Prompt Engineering

📖 Real World Analogy

The Genius Intern — Instructions Are Everything

Imagine you hire the most brilliant intern in the world. She speaks 100 languages, has read every book ever written, and can write, code, calculate, and reason at superhuman speed. On her first day, you say: "Do the thing."

She stares at you. Does nothing useful. Not because she's incompetent — but because your instruction was meaningless.

Now you say: "You are a senior tax attorney. I am a freelancer in India. In plain English, list the five most overlooked deductions for me this financial year, with a short example for each."

Suddenly, she produces a brilliant, focused, actionable report. Same person. Different prompt. That delta — from useless to brilliant — is entirely Prompt Engineering.

Prompt Engineering is the discipline of designing, structuring, and refining inputs (prompts) to get the best possible output from a Large Language Model (LLM). It is part art, part science, and entirely learnable.

🌟

Why It Matters More Than You Think

You do not need to retrain a model to dramatically improve its output. The same model — GPT-4, Claude, Gemini — can produce output ranging from useless hallucination to expert-level accuracy purely based on how the prompt is written. Prompt engineering is the highest-ROI skill in AI right now.

Section 02

The Anatomy of a Great Prompt

Before diving into techniques, you need to understand what a prompt actually contains. Most powerful prompts share the same building blocks:

🧰 The 5 Layers of a Production-Grade Prompt

Role

Who the model should behave as. "You are a senior data scientist…"

Context

Background the model needs. "The dataset has 50,000 rows, 12 features, heavy class imbalance…"

Task

Exactly what to do. "Write a Python function that…"

Format

How to structure the output. "Respond in a numbered list. Under each item include a one-line code example."

Constraints

What to avoid. "Do not use external libraries. Keep each explanation under 3 sentences."

💡

Not Every Layer Is Always Needed

Simple tasks need fewer layers. "Translate this to French" is a complete prompt. Complex tasks — code generation, multi-step analysis, creative writing with rules — benefit enormously from all five. Always ask yourself: what does the model need to know that it cannot guess?

Section 03

Technique 1 — Zero-Shot Prompting

📖 Story

The Expert Called Cold

A detective gets a phone call at midnight. No briefing, no case file. The caller just says: "A man is dead. His window is open. There is a puddle of water on the floor and a fish bowl nearby." The detective immediately knows what happened — because her expertise lets her infer from almost nothing.

Zero-shot prompting is asking the model to solve a task with no examples provided. You rely entirely on the knowledge baked into the model during training. Works brilliantly for tasks the model has seen a million times. Struggles when precision or format matters a lot.

Zero-Shot Prompting means you give the model a task directly — no demonstrations, no examples. The model is expected to draw entirely on its pre-trained knowledge.

✅

When It Works

Best for well-defined tasks

Classification, translation, summarisation, answering factual questions. Any task that is well-represented in the model's training data.

❌

When It Fails

Avoid for novel formats

Custom output formats, unusual reasoning styles, niche domain knowledge, or any task where the model lacks sufficient training exposure.

🔍

The Fix

Upgrade to few-shot

If zero-shot gives inconsistent or poorly formatted results, simply add 2–3 demonstrations (few-shot). This is the single most impactful upgrade.

Zero-Shot: Basic Examples

Task	Weak Zero-Shot Prompt	Strong Zero-Shot Prompt
Sentiment	Is this positive or negative?	Classify the sentiment of the following text as exactly one of: Positive, Negative, or Neutral. Output only the label, nothing else. Text: "…"
Summarise	Summarise this article.	Summarise the following article in exactly 3 bullet points. Each bullet must be one sentence and under 20 words. Article: "…"
Code	Write a Python function for sorting.	You are a Python 3.11 expert. Write a pure-Python function that sorts a list of dictionaries by a specified key, ascending or descending. Include type hints and a docstring.

Zero-Shot Code Example — Sentiment Classifier

import anthropic

client = anthropic.Anthropic()

def classify_sentiment_zero_shot(text: str) -> str:
    """Zero-shot sentiment classification — no examples given."""
    prompt = (
        f"Classify the sentiment of the following customer review as exactly "
        f"one of: POSITIVE, NEGATIVE, or NEUTRAL. "
        f"Output ONLY the label — no explanation, no punctuation.\n\n"
        f"Review: {text}"
    )
    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text.strip()

# Test it
reviews = [
    "The battery died after 2 days. Terrible product.",
    "Absolutely love this! Best purchase I've made all year.",
    "It arrived on Tuesday. The box was blue."
]

for review in reviews:
    result = classify_sentiment_zero_shot(review)
    print(f"Sentiment: {result:10s} | Review: {review[:50]}...")

OUTPUT

Sentiment: NEGATIVE | Review: The battery died after 2 days. Terrible product... Sentiment: POSITIVE | Review: Absolutely love this! Best purchase I've made all... Sentiment: NEUTRAL | Review: It arrived on Tuesday. The box was blue...

⚠️

Zero-Shot's Hidden Danger — Format Drift

Without constraints, the model might return "The sentiment is Positive." instead of just "POSITIVE". This breaks any downstream parser. Always specify: label only, no punctuation, exact casing. Zero-shot without format rules is a time bomb in production.

Section 04

Technique 2 — Few-Shot Prompting

📖 Story

The Apprentice and the Three Paintings

A master painter wants to teach her apprentice a unique style — the exact blend of colours she's developed over 40 years. She cannot describe it in words. So she paints three canvases and says: "Do this. Do this. Do this. Now paint one yourself."

The apprentice looks at the three examples, extracts the pattern, and produces something that perfectly captures the style — without a single explicit rule being stated.

That is few-shot prompting. You show, not tell. The model learns the pattern from your demonstrations and applies it to new input.

Few-Shot Prompting means providing 2–10 input/output examples before your actual task. The model learns the desired format, style, and reasoning pattern from those examples and applies the same pattern to the final input.

📚

The Research Behind Few-Shot

Brown et al. (2020) in the original GPT-3 paper showed that performance improved dramatically as you increased from 0 to 1 to 5+ examples — often matching or beating task-specific fine-tuned models. The model is doing in-context learning: it updates its implicit reasoning at inference time, not its weights.

Few-Shot: The Anatomy of a Shot

🎯 Structure of a Single "Shot"

Input

The example question, text, or task instance the model should process.

Output

The exact, correct response — in the exact format you want for every future answer.

Separator

A clear delimiter (like ### or ---) between shots to avoid confusion.

Actual

The real input at the end — what you actually want answered, following the same format.

Few-Shot: Visual Diagram

📊 HOW FEW-SHOT WORKS — INPUT → PATTERN → OUTPUT

The model sees 3 examples, extracts the I/O pattern, then applies it to the new input — all within a single API call.

Few-Shot Code Example — Custom Entity Extractor

import anthropic
import json

client = anthropic.Anthropic()

def extract_entities_few_shot(text: str) -> dict:
    """
    Few-shot entity extraction.
    Returns JSON with keys: person, organisation, location, amount.
    """
    few_shot_prompt = """Extract entities from financial news. 
Return ONLY a JSON object with keys: person, organisation, location, amount.
If a key is not present, use null.

---
Input: "Sundar Pichai announced Google will invest $1.2 billion in its Dublin campus."
Output: {"person": "Sundar Pichai", "organisation": "Google", "location": "Dublin", "amount": "$1.2 billion"}

---
Input: "The Reserve Bank of India raised interest rates by 0.25%."
Output: {"person": null, "organisation": "Reserve Bank of India", "location": "India", "amount": "0.25%"}

---
Input: "Elon Musk's X Corp faced regulatory scrutiny in Brussels last month."
Output: {"person": "Elon Musk", "organisation": "X Corp", "location": "Brussels", "amount": null}

---
Input: """ + text + """
Output:"""

    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=200,
        messages=[{"role": "user", "content": few_shot_prompt}]
    )
    raw = message.content[0].text.strip()
    return json.loads(raw)

# Test
test = "Mukesh Ambani's Reliance Industries secured a ₹5,000 crore deal in Mumbai."
result = extract_entities_few_shot(test)
print(json.dumps(result, indent=2))

OUTPUT

{ "person": "Mukesh Ambani", "organisation": "Reliance Industries", "location": "Mumbai", "amount": "₹5,000 crore" }

How Many Shots Do You Need?

Shots (k)	Use Case	Notes
k = 0	Zero-Shot — model relies on training	Start here. Works for simple, common tasks.
k = 1	One-Shot — single demonstration	Better format control. Still unreliable on edge cases.
k = 3–5	Sweet spot for most tasks	Covers positive, negative, edge cases. High accuracy.
k = 5–10	Complex classification, custom formats	Use for niche domains. Diminishing returns beyond 10.
k > 10	Consider fine-tuning instead	Long prompts → higher cost, longer latency, risk of confusion.

⚡

Golden Rule of Few-Shot Example Selection

Your examples must cover the full range of variability in your task: include easy cases, hard edge cases, and at least one example per output class. A model trained only on positive examples will bias toward positive outputs. Diversity in examples beats quantity every time.

Section 05

Technique 3 — Chain-of-Thought (CoT) Prompting

📖 Story

Sherlock Holmes Never Just Says the Answer

Holmes does not walk into a room and blurt: "The butler did it." He examines the mud on the victim's shoes, the ink stain on the suspect's right index finger, the angle of the wound — and then walks you step-by-step through his reasoning. Each step feeds the next. The conclusion emerges from a chain.

When you ask an LLM a hard maths or logic question and expect an instant answer, you are asking Holmes to skip the reasoning. He gets it wrong.

Chain-of-Thought prompting forces the model to show its working, just like Holmes. And — remarkably — this alone dramatically improves accuracy on multi-step problems.

Chain-of-Thought (CoT) Prompting — introduced by Wei et al. (2022) at Google — instructs the model to reason through a problem step-by-step before giving the final answer. The reasoning chain itself becomes part of the context, and each step acts as a scaffold for the next.

🧠

Why CoT Works — The Computational Argument

Transformer models generate one token at a time. A hard problem might need 20 reasoning steps. Without CoT, the model tries to compress all 20 steps into the probability distribution for a single token — impossible. CoT spreads those steps across 20+ tokens of "thinking", giving each step its own computation. It is like trading context window space for reasoning power.

Three Forms of Chain-of-Thought

🧰

Zero-Shot CoT

Magic phrase trigger

Simply append "Let's think step by step." to your prompt. No examples needed. Surprisingly effective on maths and logic problems. The phrase activates the model's reasoning mode.

📝

Few-Shot CoT

Demonstrated reasoning

Provide 2–3 examples where you show the full reasoning chain, not just the answer. Teaches the model your preferred reasoning style — not just what to conclude, but how to get there.

🧪

Self-Consistency CoT

Sample then vote

Generate the same CoT prompt multiple times (e.g. 5×) with high temperature. Take the majority-vote answer across all chains. Dramatically boosts accuracy on complex problems at the cost of more API calls.

Chain-of-Thought: Visual Comparison

📊 STANDARD PROMPT vs CHAIN-OF-THOUGHT

The same question, same model, same temperature — only the prompt changed. CoT surfaces the computation instead of hiding it.

CoT Code Example — Multi-Step Reasoning with Structured Output

import anthropic
import re

client = anthropic.Anthropic()

def solve_with_cot(problem: str) -> dict:
    """
    Chain-of-Thought solver.
    Forces the model to reason step-by-step, then extract the final answer.
    Returns: {"reasoning": str, "answer": str}
    """
    prompt = f"""You are a careful analytical reasoner.
Solve the problem below step-by-step.

Rules:
1. Show every calculation or logical step.
2. Label each step clearly: Step 1, Step 2, etc.
3. At the very end, write exactly: FINAL ANSWER: 

Problem: {problem}

Let's think step by step."""

    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=800,
        messages=[{"role": "user", "content": prompt}]
    )

    full_response = message.content[0].text
    # Parse out reasoning and final answer
    match = re.search(r"FINAL ANSWER:\s*(.+)", full_response, re.IGNORECASE)
    final_answer = match.group(1).strip() if match else "Not found"

    return {
        "reasoning": full_response,
        "answer": final_answer
    }

# Test with a word problem
problem = """A shopkeeper sells apples at ₹12 each and oranges at ₹8 each.
If a customer buys 5 apples and 7 oranges and pays ₹100,
how much change does the customer receive?"""

result = solve_with_cot(problem)
print("=== REASONING ===")
print(result["reasoning"])
print("\n=== FINAL ANSWER ===")
print(result["answer"])

OUTPUT

=== REASONING === Step 1: Calculate cost of 5 apples. 5 × ₹12 = ₹60 Step 2: Calculate cost of 7 oranges. 7 × ₹8 = ₹56 Step 3: Calculate total cost. ₹60 + ₹56 = ₹116 Step 4: Calculate change. Customer paid ₹100. But the total is ₹116. The customer does NOT have enough money — they are ₹16 short. FINAL ANSWER: The customer is short by ₹16 (cannot complete the purchase). === FINAL ANSWER === The customer is short by ₹16 (cannot complete the purchase).

🌟

Notice What Happened — CoT Found the Trap

A standard prompt might have returned "₹16 change." The CoT approach walked through every step and caught that the customer couldn't afford the purchase at all. Reasoning chains prevent the model from pattern-matching to a plausible-sounding answer instead of the correct one.

Section 06

Few-Shot CoT — The Combination That Dominates

Few-Shot CoT combines both techniques: you provide 2–3 examples where each example shows a full reasoning chain, not just an answer. This teaches the model both what to produce (format) and how to think (reasoning style).

📖 Story

The Medical Resident and the Attending Physician

A junior doctor watches the attending physician work through three patient cases out loud: "The fever started 3 days ago, combined with the rash pattern, plus the recent travel history — this suggests dengue, not malaria. I'll order NS1 antigen first."

The resident does not just observe the conclusion. She witnesses the clinical reasoning process. When the fourth patient arrives, she can reproduce that same systematic thinking herself.

That is Few-Shot CoT. Show the chain, not just the answer.

Few-Shot CoT: Full Code Example — Medical Triage Classifier

import anthropic

client = anthropic.Anthropic()

FEW_SHOT_COT_SYSTEM = """You are a clinical triage assistant.
For each patient description, reason through the symptoms step-by-step,
then assign exactly one triage level: CRITICAL, URGENT, or STANDARD.

Format your response as:
Reasoning: 
Triage: """

FEW_SHOT_EXAMPLES = [
    {
        "role": "user",
        "content": "Patient: 58-year-old male. Sudden severe chest pain radiating to left arm. Profuse sweating. Onset 20 minutes ago."
    },
    {
        "role": "assistant",
        "content": """Reasoning:
Step 1: Sudden severe chest pain in a 58yo male is a cardiac red flag.
Step 2: Radiation to the left arm strongly suggests myocardial involvement.
Step 3: Diaphoresis (profuse sweating) indicates sympathetic response — consistent with acute MI.
Step 4: 20-minute onset means active event, not resolved episode.
Conclusion: Classic STEMI presentation. Every minute counts for myocardial salvage.
Triage: CRITICAL"""
    },
    {
        "role": "user",
        "content": "Patient: 24-year-old female. Persistent headache for 2 days. No fever. No vomiting. No vision changes. Responsive and oriented."
    },
    {
        "role": "assistant",
        "content": """Reasoning:
Step 1: Persistent headache for 2 days — duration warrants attention but not emergency.
Step 2: No fever rules out meningitis/encephalitis as immediate concern.
Step 3: No vomiting or vision changes — no raised ICP indicators.
Step 4: Patient is alert and oriented — no neurological compromise.
Conclusion: Tension or migraine-type headache most likely. Needs evaluation, not emergency bypass.
Triage: URGENT"""
    }
]

def triage_patient(description: str) -> str:
    messages = FEW_SHOT_EXAMPLES + [
        {"role": "user", "content": description}
    ]
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=400,
        system=FEW_SHOT_COT_SYSTEM,
        messages=messages
    )
    return response.content[0].text

# New patient
new_patient = "Patient: 6-year-old child. Mild cough and runny nose for 3 days. Playing normally. No difficulty breathing. Temp: 37.8°C."
print(triage_patient(new_patient))

OUTPUT

Reasoning: Step 1: Mild cough and runny nose for 3 days in a child — classic upper respiratory infection pattern. Step 2: "Playing normally" is a critical indicator — children who are critically ill stop playing. Step 3: No difficulty breathing rules out lower respiratory involvement (bronchitis, pneumonia). Step 4: Low-grade fever at 37.8°C is typical of viral URI, not sepsis. Conclusion: Common cold presentation. Needs standard paediatric consultation, not emergency triage. Triage: STANDARD

Section 07

Advanced CoT Variants

🌐

Tree-of-Thought (ToT)

Branching reasoning

Explore multiple reasoning branches simultaneously. At each step, generate several possible next steps and evaluate which branch is most promising. Better for open-ended creative or strategic problems.

♿

Self-Consistency

Sample & majority vote

Run the same CoT prompt 5–10× with temperature > 0. Extract the final answer from each run. Take the most common answer as the final output. Eliminates lucky guesses and reasoning flukes.

📋

ReAct (Reason + Act)

CoT + tool use

Interleave reasoning steps with actions (search, code execution, calculator). The model thinks, then acts, then thinks again based on the action result. Foundation of modern AI agents.

📄

Least-to-Most

Decompose & solve

First prompt the model to break the problem into ordered sub-problems. Then solve each sub-problem in sequence, feeding prior answers as context. Excellent for complex multi-step tasks.

👥

Auto-CoT

Generated demonstrations

Auto-generate the few-shot CoT examples themselves using another prompt, instead of hand-crafting them. Scales to large tasks without manual example writing. Uses k-means clustering for diversity.

👀

Metacognitive CoT

Confidence & reflection

After the reasoning chain, ask: "How confident are you in each step, on a scale of 1–5? Flag any step you are uncertain about." Forces the model to surface its own uncertainty — critical for high-stakes use cases.

Self-Consistency CoT — Code Example

import anthropic
from collections import Counter
import re

client = anthropic.Anthropic()

def self_consistency_cot(problem: str, n_samples: int = 5) -> dict:
    """
    Self-Consistency CoT:
    Generate n reasoning chains, extract final answers, return majority vote.
    """
    prompt = f"""Solve this maths problem step by step.
At the end, write: ANSWER: 

Problem: {problem}

Let's think step by step."""

    answers = []
    reasoning_chains = []

    for i in range(n_samples):
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=500,
            temperature=0.7,   # Temperature > 0 for diverse chains
            messages=[{"role": "user", "content": prompt}]
        )
        text = response.content[0].text
        reasoning_chains.append(text)

        match = re.search(r"ANSWER:\s*([\d.,]+)", text)
        if match:
            answers.append(match.group(1).strip())

    # Majority vote
    vote_counts = Counter(answers)
    majority_answer = vote_counts.most_common(1)[0][0] if answers else "No answer"

    return {
        "all_answers": answers,
        "vote_distribution": dict(vote_counts),
        "final_answer": majority_answer,
        "confidence": vote_counts[majority_answer] / n_samples if answers else 0.0
    }

problem = "A tank fills in 6 hours. A drain empties it in 9 hours. If both are open, how many hours to fill an empty tank?"
result = self_consistency_cot(problem, n_samples=5)
print(f"All answers:       {result['all_answers']}")
print(f"Vote distribution: {result['vote_distribution']}")
print(f"Final answer:      {result['final_answer']}")
print(f"Confidence:        {result['confidence']:.0%}")

OUTPUT

All answers: ['18', '18', '18', '18', '18'] Vote distribution: {'18': 5} Final answer: 18 Confidence: 100%

Section 08

Comparing All Three Techniques

Property	Zero-Shot	Few-Shot	Chain-of-Thought
Best for	Simple, well-known tasks	Custom formats & labels	Multi-step reasoning & maths
Examples needed	None	2–10 input/output pairs	0 (Zero-CoT) or 2–5 with chains
Token cost	Lowest	Medium	Highest (long output)
Accuracy on reasoning	Lowest	Medium	Highest
Format consistency	Inconsistent	Very consistent	Varies — requires parsing
Latency	Fastest	Medium	Slowest
Combine with	Add examples if failing	Add CoT to each example	Few-shot + CoT = peak power

🏆

The Practitioner's Decision Tree

Start with Zero-Shot. If accuracy is insufficient or format is inconsistent → upgrade to Few-Shot. If reasoning is still wrong on multi-step problems → add Chain-of-Thought (either "Let's think step by step" or full few-shot CoT chains). If you need maximum accuracy and cost is secondary → add Self-Consistency (5× samples + majority vote). This four-step ladder covers 95% of production scenarios.

Section 09

Prompt Engineering in the Wild — Full Pipeline

Here is a complete, production-grade prompt engineering pipeline that automatically selects the best technique for a given task type:

import anthropic
from enum import Enum
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()

class TaskType(Enum):
    CLASSIFICATION = "classification"
    REASONING      = "reasoning"
    EXTRACTION     = "extraction"
    GENERATION     = "generation"

@dataclass
class PromptConfig:
    task_type:   TaskType
    user_input:  str
    system_role: str            = "You are a helpful AI assistant."
    examples:    list           = None
    use_cot:     bool           = False
    output_fmt:  Optional[str]  = None

def build_prompt(config: PromptConfig) -> str:
    """Build the optimal prompt for the given config."""
    parts = []

    if config.output_fmt:
        parts.append(f"Output format: {config.output_fmt}\n")

    # Inject few-shot examples if provided
    if config.examples:
        parts.append("Examples:\n")
        for i, ex in enumerate(config.examples, 1):
            parts.append(f"Example {i}:\nInput: {ex['input']}\nOutput: {ex['output']}\n---")
        parts.append("")

    parts.append(f"Input: {config.user_input}")

    if config.use_cot:
        parts.append("\nLet's think step by step.")

    return "\n".join(parts)

def run_pipeline(config: PromptConfig) -> str:
    prompt = build_prompt(config)
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=600,
        system=config.system_role,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

# ── Example 1: Few-Shot Classification ───────────────────────
classification_config = PromptConfig(
    task_type=TaskType.CLASSIFICATION,
    system_role="You are a financial sentiment classifier. Output one label only.",
    examples=[
        {"input": "Revenue grew 23% YoY. Strong forward guidance.",      "output": "BULLISH"},
        {"input": "CEO resigned amid accounting irregularities.",          "output": "BEARISH"},
        {"input": "Q3 results in line with analyst consensus estimates.",    "output": "NEUTRAL"},
    ],
    user_input="The company beat earnings by 15% and raised its full-year dividend forecast.",
    output_fmt="Exactly one of: BULLISH, BEARISH, or NEUTRAL"
)

# ── Example 2: CoT Reasoning ────────────────────────────────
reasoning_config = PromptConfig(
    task_type=TaskType.REASONING,
    system_role="You are a careful logical reasoner.",
    user_input="If all Bloops are Razzles, and all Razzles are Lazzles, are all Bloops definitely Lazzles?",
    use_cot=True
)

print("=== Classification Result ===")
print(run_pipeline(classification_config))

print("\n=== Reasoning Result ===")
print(run_pipeline(reasoning_config))

OUTPUT

=== Classification Result === BULLISH === Reasoning Result === Step 1: Examine the first premise — "All Bloops are Razzles." This means every Bloop is contained within the set of Razzles. Step 2: Examine the second premise — "All Razzles are Lazzles." This means every Razzle is contained within the set of Lazzles. Step 3: Apply transitive reasoning. If Bloops ⊆ Razzles, and Razzles ⊆ Lazzles, then by transitivity: Bloops ⊆ Lazzles. Step 4: Conclusion. Yes — all Bloops are definitely Lazzles. The syllogism is valid.

Section 10

Golden Rules of Prompt Engineering

⚡ Non-Negotiable Rules for Production Prompts

Be explicit about output format. Never assume the model will guess the format you want. State it: "Respond with ONLY a JSON object with keys x, y, z. No prose, no markdown fences." Format ambiguity is the #1 source of broken pipelines.

Include what NOT to do. Negative constraints are often more powerful than positive ones: "Do not include a disclaimer." "Do not repeat the question." "Do not use bullet points." Models default to safe, padded, repetitive outputs without explicit constraints.

Use few-shot before fine-tuning. Adding 3 high-quality examples is free, instant, and often matches the accuracy of a fine-tuned model for structured tasks. Always exhaust prompt engineering before spending money on fine-tuning.

For reasoning tasks, always use CoT. Appending "Let's think step by step." is free and consistently improves accuracy on maths, logic, and multi-step tasks. It is the single highest-ROI prompt addition ever discovered.

Version-control your prompts. Treat prompts like code: use git, write tests, record accuracy metrics for each version. A prompt that works on GPT-4 may degrade silently on the next model version. Prompt regression testing is not optional in production.

Temperature controls creativity vs precision. Classification tasks → temperature=0.0 (deterministic). Creative writing → temperature=0.8–1.0. Self-Consistency CoT → temperature=0.7 (diversity needed). Defaulting to temperature=1.0 on a classifier is a silent bug.

Put the most important instruction at the end. Recency bias in transformer attention means instructions placed near the actual input are weighted more heavily than those buried at the top of a long system prompt. Repeat critical constraints at the end of the prompt, especially for long contexts.