The Story That Explains Prompt Engineering
She stares at you. Does nothing useful. Not because she's incompetent — but because your instruction was meaningless.
Now you say: "You are a senior tax attorney. I am a freelancer in India. In plain English, list the five most overlooked deductions for me this financial year, with a short example for each."
Suddenly, she produces a brilliant, focused, actionable report. Same person. Different prompt. That delta — from useless to brilliant — is entirely Prompt Engineering.
Prompt Engineering is the discipline of designing, structuring, and refining inputs (prompts) to get the best possible output from a Large Language Model (LLM). It is part art, part science, and entirely learnable.
You do not need to retrain a model to dramatically improve its output. The same model — GPT-4, Claude, Gemini — can produce output ranging from useless hallucination to expert-level accuracy purely based on how the prompt is written. Prompt engineering is the highest-ROI skill in AI right now.
The Anatomy of a Great Prompt
Before diving into techniques, you need to understand what a prompt actually contains. Most powerful prompts share the same building blocks:
Simple tasks need fewer layers. "Translate this to French" is a complete prompt. Complex tasks — code generation, multi-step analysis, creative writing with rules — benefit enormously from all five. Always ask yourself: what does the model need to know that it cannot guess?
Technique 1 — Zero-Shot Prompting
Zero-shot prompting is asking the model to solve a task with no examples provided. You rely entirely on the knowledge baked into the model during training. Works brilliantly for tasks the model has seen a million times. Struggles when precision or format matters a lot.
Zero-Shot Prompting means you give the model a task directly — no demonstrations, no examples. The model is expected to draw entirely on its pre-trained knowledge.
Zero-Shot: Basic Examples
| Task | Weak Zero-Shot Prompt | Strong Zero-Shot Prompt |
|---|---|---|
| Sentiment | Is this positive or negative? | Classify the sentiment of the following text as exactly one of: Positive, Negative, or Neutral. Output only the label, nothing else. Text: "…" |
| Summarise | Summarise this article. | Summarise the following article in exactly 3 bullet points. Each bullet must be one sentence and under 20 words. Article: "…" |
| Code | Write a Python function for sorting. | You are a Python 3.11 expert. Write a pure-Python function that sorts a list of dictionaries by a specified key, ascending or descending. Include type hints and a docstring. |
Zero-Shot Code Example — Sentiment Classifier
import anthropic
client = anthropic.Anthropic()
def classify_sentiment_zero_shot(text: str) -> str:
"""Zero-shot sentiment classification — no examples given."""
prompt = (
f"Classify the sentiment of the following customer review as exactly "
f"one of: POSITIVE, NEGATIVE, or NEUTRAL. "
f"Output ONLY the label — no explanation, no punctuation.\n\n"
f"Review: {text}"
)
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text.strip()
# Test it
reviews = [
"The battery died after 2 days. Terrible product.",
"Absolutely love this! Best purchase I've made all year.",
"It arrived on Tuesday. The box was blue."
]
for review in reviews:
result = classify_sentiment_zero_shot(review)
print(f"Sentiment: {result:10s} | Review: {review[:50]}...")
Without constraints, the model might return "The sentiment is Positive." instead of just "POSITIVE". This breaks any downstream parser. Always specify: label only, no punctuation, exact casing. Zero-shot without format rules is a time bomb in production.
Technique 2 — Few-Shot Prompting
The apprentice looks at the three examples, extracts the pattern, and produces something that perfectly captures the style — without a single explicit rule being stated.
That is few-shot prompting. You show, not tell. The model learns the pattern from your demonstrations and applies it to new input.
Few-Shot Prompting means providing 2–10 input/output examples before your actual task. The model learns the desired format, style, and reasoning pattern from those examples and applies the same pattern to the final input.
Brown et al. (2020) in the original GPT-3 paper showed that performance improved dramatically as you increased from 0 to 1 to 5+ examples — often matching or beating task-specific fine-tuned models. The model is doing in-context learning: it updates its implicit reasoning at inference time, not its weights.
Few-Shot: The Anatomy of a Shot
Few-Shot: Visual Diagram
The model sees 3 examples, extracts the I/O pattern, then applies it to the new input — all within a single API call.
Few-Shot Code Example — Custom Entity Extractor
import anthropic
import json
client = anthropic.Anthropic()
def extract_entities_few_shot(text: str) -> dict:
"""
Few-shot entity extraction.
Returns JSON with keys: person, organisation, location, amount.
"""
few_shot_prompt = """Extract entities from financial news.
Return ONLY a JSON object with keys: person, organisation, location, amount.
If a key is not present, use null.
---
Input: "Sundar Pichai announced Google will invest $1.2 billion in its Dublin campus."
Output: {"person": "Sundar Pichai", "organisation": "Google", "location": "Dublin", "amount": "$1.2 billion"}
---
Input: "The Reserve Bank of India raised interest rates by 0.25%."
Output: {"person": null, "organisation": "Reserve Bank of India", "location": "India", "amount": "0.25%"}
---
Input: "Elon Musk's X Corp faced regulatory scrutiny in Brussels last month."
Output: {"person": "Elon Musk", "organisation": "X Corp", "location": "Brussels", "amount": null}
---
Input: """ + text + """
Output:"""
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=200,
messages=[{"role": "user", "content": few_shot_prompt}]
)
raw = message.content[0].text.strip()
return json.loads(raw)
# Test
test = "Mukesh Ambani's Reliance Industries secured a ₹5,000 crore deal in Mumbai."
result = extract_entities_few_shot(test)
print(json.dumps(result, indent=2))
How Many Shots Do You Need?
| Shots (k) | Use Case | Notes |
|---|---|---|
| k = 0 | Zero-Shot — model relies on training | Start here. Works for simple, common tasks. |
| k = 1 | One-Shot — single demonstration | Better format control. Still unreliable on edge cases. |
| k = 3–5 | Sweet spot for most tasks | Covers positive, negative, edge cases. High accuracy. |
| k = 5–10 | Complex classification, custom formats | Use for niche domains. Diminishing returns beyond 10. |
| k > 10 | Consider fine-tuning instead | Long prompts → higher cost, longer latency, risk of confusion. |
Your examples must cover the full range of variability in your task: include easy cases, hard edge cases, and at least one example per output class. A model trained only on positive examples will bias toward positive outputs. Diversity in examples beats quantity every time.
Technique 3 — Chain-of-Thought (CoT) Prompting
When you ask an LLM a hard maths or logic question and expect an instant answer, you are asking Holmes to skip the reasoning. He gets it wrong.
Chain-of-Thought prompting forces the model to show its working, just like Holmes. And — remarkably — this alone dramatically improves accuracy on multi-step problems.
Chain-of-Thought (CoT) Prompting — introduced by Wei et al. (2022) at Google — instructs the model to reason through a problem step-by-step before giving the final answer. The reasoning chain itself becomes part of the context, and each step acts as a scaffold for the next.
Transformer models generate one token at a time. A hard problem might need 20 reasoning steps. Without CoT, the model tries to compress all 20 steps into the probability distribution for a single token — impossible. CoT spreads those steps across 20+ tokens of "thinking", giving each step its own computation. It is like trading context window space for reasoning power.
Three Forms of Chain-of-Thought
Chain-of-Thought: Visual Comparison
The same question, same model, same temperature — only the prompt changed. CoT surfaces the computation instead of hiding it.
CoT Code Example — Multi-Step Reasoning with Structured Output
import anthropic
import re
client = anthropic.Anthropic()
def solve_with_cot(problem: str) -> dict:
"""
Chain-of-Thought solver.
Forces the model to reason step-by-step, then extract the final answer.
Returns: {"reasoning": str, "answer": str}
"""
prompt = f"""You are a careful analytical reasoner.
Solve the problem below step-by-step.
Rules:
1. Show every calculation or logical step.
2. Label each step clearly: Step 1, Step 2, etc.
3. At the very end, write exactly: FINAL ANSWER:
Problem: {problem}
Let's think step by step."""
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=800,
messages=[{"role": "user", "content": prompt}]
)
full_response = message.content[0].text
# Parse out reasoning and final answer
match = re.search(r"FINAL ANSWER:\s*(.+)", full_response, re.IGNORECASE)
final_answer = match.group(1).strip() if match else "Not found"
return {
"reasoning": full_response,
"answer": final_answer
}
# Test with a word problem
problem = """A shopkeeper sells apples at ₹12 each and oranges at ₹8 each.
If a customer buys 5 apples and 7 oranges and pays ₹100,
how much change does the customer receive?"""
result = solve_with_cot(problem)
print("=== REASONING ===")
print(result["reasoning"])
print("\n=== FINAL ANSWER ===")
print(result["answer"])
A standard prompt might have returned "₹16 change." The CoT approach walked through every step and caught that the customer couldn't afford the purchase at all. Reasoning chains prevent the model from pattern-matching to a plausible-sounding answer instead of the correct one.
Few-Shot CoT — The Combination That Dominates
Few-Shot CoT combines both techniques: you provide 2–3 examples where each example shows a full reasoning chain, not just an answer. This teaches the model both what to produce (format) and how to think (reasoning style).
The resident does not just observe the conclusion. She witnesses the clinical reasoning process. When the fourth patient arrives, she can reproduce that same systematic thinking herself.
That is Few-Shot CoT. Show the chain, not just the answer.
Few-Shot CoT: Full Code Example — Medical Triage Classifier
import anthropic
client = anthropic.Anthropic()
FEW_SHOT_COT_SYSTEM = """You are a clinical triage assistant.
For each patient description, reason through the symptoms step-by-step,
then assign exactly one triage level: CRITICAL, URGENT, or STANDARD.
Format your response as:
Reasoning:
Triage: """
FEW_SHOT_EXAMPLES = [
{
"role": "user",
"content": "Patient: 58-year-old male. Sudden severe chest pain radiating to left arm. Profuse sweating. Onset 20 minutes ago."
},
{
"role": "assistant",
"content": """Reasoning:
Step 1: Sudden severe chest pain in a 58yo male is a cardiac red flag.
Step 2: Radiation to the left arm strongly suggests myocardial involvement.
Step 3: Diaphoresis (profuse sweating) indicates sympathetic response — consistent with acute MI.
Step 4: 20-minute onset means active event, not resolved episode.
Conclusion: Classic STEMI presentation. Every minute counts for myocardial salvage.
Triage: CRITICAL"""
},
{
"role": "user",
"content": "Patient: 24-year-old female. Persistent headache for 2 days. No fever. No vomiting. No vision changes. Responsive and oriented."
},
{
"role": "assistant",
"content": """Reasoning:
Step 1: Persistent headache for 2 days — duration warrants attention but not emergency.
Step 2: No fever rules out meningitis/encephalitis as immediate concern.
Step 3: No vomiting or vision changes — no raised ICP indicators.
Step 4: Patient is alert and oriented — no neurological compromise.
Conclusion: Tension or migraine-type headache most likely. Needs evaluation, not emergency bypass.
Triage: URGENT"""
}
]
def triage_patient(description: str) -> str:
messages = FEW_SHOT_EXAMPLES + [
{"role": "user", "content": description}
]
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=400,
system=FEW_SHOT_COT_SYSTEM,
messages=messages
)
return response.content[0].text
# New patient
new_patient = "Patient: 6-year-old child. Mild cough and runny nose for 3 days. Playing normally. No difficulty breathing. Temp: 37.8°C."
print(triage_patient(new_patient))
Advanced CoT Variants
Self-Consistency CoT — Code Example
import anthropic
from collections import Counter
import re
client = anthropic.Anthropic()
def self_consistency_cot(problem: str, n_samples: int = 5) -> dict:
"""
Self-Consistency CoT:
Generate n reasoning chains, extract final answers, return majority vote.
"""
prompt = f"""Solve this maths problem step by step.
At the end, write: ANSWER:
Problem: {problem}
Let's think step by step."""
answers = []
reasoning_chains = []
for i in range(n_samples):
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=500,
temperature=0.7, # Temperature > 0 for diverse chains
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text
reasoning_chains.append(text)
match = re.search(r"ANSWER:\s*([\d.,]+)", text)
if match:
answers.append(match.group(1).strip())
# Majority vote
vote_counts = Counter(answers)
majority_answer = vote_counts.most_common(1)[0][0] if answers else "No answer"
return {
"all_answers": answers,
"vote_distribution": dict(vote_counts),
"final_answer": majority_answer,
"confidence": vote_counts[majority_answer] / n_samples if answers else 0.0
}
problem = "A tank fills in 6 hours. A drain empties it in 9 hours. If both are open, how many hours to fill an empty tank?"
result = self_consistency_cot(problem, n_samples=5)
print(f"All answers: {result['all_answers']}")
print(f"Vote distribution: {result['vote_distribution']}")
print(f"Final answer: {result['final_answer']}")
print(f"Confidence: {result['confidence']:.0%}")
Comparing All Three Techniques
| Property | Zero-Shot | Few-Shot | Chain-of-Thought |
|---|---|---|---|
| Best for | Simple, well-known tasks | Custom formats & labels | Multi-step reasoning & maths |
| Examples needed | None | 2–10 input/output pairs | 0 (Zero-CoT) or 2–5 with chains |
| Token cost | Lowest | Medium | Highest (long output) |
| Accuracy on reasoning | Lowest | Medium | Highest |
| Format consistency | Inconsistent | Very consistent | Varies — requires parsing |
| Latency | Fastest | Medium | Slowest |
| Combine with | Add examples if failing | Add CoT to each example | Few-shot + CoT = peak power |
Start with Zero-Shot. If accuracy is insufficient or format is inconsistent → upgrade to Few-Shot. If reasoning is still wrong on multi-step problems → add Chain-of-Thought (either "Let's think step by step" or full few-shot CoT chains). If you need maximum accuracy and cost is secondary → add Self-Consistency (5× samples + majority vote). This four-step ladder covers 95% of production scenarios.
Prompt Engineering in the Wild — Full Pipeline
Here is a complete, production-grade prompt engineering pipeline that automatically selects the best technique for a given task type:
import anthropic
from enum import Enum
from dataclasses import dataclass
from typing import Optional
client = anthropic.Anthropic()
class TaskType(Enum):
CLASSIFICATION = "classification"
REASONING = "reasoning"
EXTRACTION = "extraction"
GENERATION = "generation"
@dataclass
class PromptConfig:
task_type: TaskType
user_input: str
system_role: str = "You are a helpful AI assistant."
examples: list = None
use_cot: bool = False
output_fmt: Optional[str] = None
def build_prompt(config: PromptConfig) -> str:
"""Build the optimal prompt for the given config."""
parts = []
if config.output_fmt:
parts.append(f"Output format: {config.output_fmt}\n")
# Inject few-shot examples if provided
if config.examples:
parts.append("Examples:\n")
for i, ex in enumerate(config.examples, 1):
parts.append(f"Example {i}:\nInput: {ex['input']}\nOutput: {ex['output']}\n---")
parts.append("")
parts.append(f"Input: {config.user_input}")
if config.use_cot:
parts.append("\nLet's think step by step.")
return "\n".join(parts)
def run_pipeline(config: PromptConfig) -> str:
prompt = build_prompt(config)
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=600,
system=config.system_role,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# ── Example 1: Few-Shot Classification ───────────────────────
classification_config = PromptConfig(
task_type=TaskType.CLASSIFICATION,
system_role="You are a financial sentiment classifier. Output one label only.",
examples=[
{"input": "Revenue grew 23% YoY. Strong forward guidance.", "output": "BULLISH"},
{"input": "CEO resigned amid accounting irregularities.", "output": "BEARISH"},
{"input": "Q3 results in line with analyst consensus estimates.", "output": "NEUTRAL"},
],
user_input="The company beat earnings by 15% and raised its full-year dividend forecast.",
output_fmt="Exactly one of: BULLISH, BEARISH, or NEUTRAL"
)
# ── Example 2: CoT Reasoning ────────────────────────────────
reasoning_config = PromptConfig(
task_type=TaskType.REASONING,
system_role="You are a careful logical reasoner.",
user_input="If all Bloops are Razzles, and all Razzles are Lazzles, are all Bloops definitely Lazzles?",
use_cot=True
)
print("=== Classification Result ===")
print(run_pipeline(classification_config))
print("\n=== Reasoning Result ===")
print(run_pipeline(reasoning_config))