Machine Learning πŸ“‚ Supervised Learning Β· 12 of 17 79 min read

SVM Kernels

Learn why linear classifiers break on complex data, how the kernel trick maps points to higher dimensions without doing the heavy math, and how to choose the right kernel (Linear, RBF, Polynomial, Sigmoid) in scikit-learn β€” with diagrams, interactive charts, and real Python examples.

Section 01

The Bouncer at the Nightclub

Two groups who refuse to be separated
Imagine a nightclub with two types of guests β€” VIPs wearing gold badges and regular guests wearing silver badges. On a quiet night the VIPs stand on the left side of the dancefloor and regulars on the right. A bouncer can draw a straight rope across the middle and done β€” everyone is separated.

But on a busy night, the VIPs stand in a circle in the middle of the dancefloor, completely surrounded by regular guests. Now the bouncer cannot draw any straight rope that separates them. No matter how he angles the rope, some VIPs end up on the wrong side.

This is exactly the problem SVMs face every day. And the kernel trick is the bouncer's clever solution: build a raised platform in the centre. Suddenly the VIPs are at a different height β€” and a flat rope (a plane) can separate them perfectly.

That raised platform is the core idea behind SVM kernels. This tutorial walks you through the complete journey β€” from why 2D fails, to how 3D solves it, to the four kernel types you will actually use in production Python code.

💡
What is a Kernel?

A kernel function computes the similarity between two data points as if they had been transformed into a higher-dimensional space β€” without ever doing the expensive transformation explicitly. It is a mathematical shortcut that makes high-dimensional geometry affordable.


Section 02

Why 2D Classification Fails

A Support Vector Machine in its most basic form finds the widest possible straight line (in 2D) or hyperplane (in N-D) that separates two classes. This works beautifully on linearly separable data. But the real world is rarely that polite.

Circular clusters
XOR-style
One class forms a ring around the other. No straight line exists that can divide inside from outside.
🌀
Interleaved spirals
Non-convex
Two classes spiral around each other. Even curved lines in 2D cannot separate them cleanly.
🎯
Checkerboard
XOR pattern
Classes alternate in a grid. The classic XOR problem β€” the simplest non-linearly-separable pattern.
📊 LINEAR VS NON-LINEAR DATA β€” 2D VIEW

✓ Linearly Separable

✗ NOT Linearly Separable

Blue = Class A  |  Red = Class B  |  Dashed = attempted decision boundary

⚠️
The Linear SVM Assumption

A standard (linear) SVM assumes data is linearly separable β€” or close to it. Feed it circular data and it will draw a line anyway, but the accuracy will be poor. The kernel trick is how we break free of this assumption.


Section 03

Lifting Data Into 3D β€” The Kernel Trick

Pancakes vs birthday cakes
Imagine blue and red dots scattered on a table (2D). The red dots form a circle in the middle; blue dots surround them. Now imagine picking up each dot and placing it at a height equal to its distance from the centre. The red inner dots go high; the blue outer dots stay low.

From the side, you now see a hill of red dots elevated above a flat plane of blue dots. A flat sheet of cardboard β€” a hyperplane β€” can slice through the air between them perfectly.

The 3D height we added was z = xΒ² + yΒ². That is the kernel function: a mathematical rule for computing the new dimension.
🚀 KERNEL TRICK β€” LIFTING 2D DATA INTO 3D

Step 1 β€” 2D original (unseparable)

Step 2 β€” Add z = xΒ² + yΒ² (separable!)

Plane height: z = 4.0

Blue circles (inner class) have small r β†’ small z  |  Red triangles (outer class) have large r β†’ large z. The dashed horizontal line = cutting plane that separates them.

Naive approach
φ(x) then K(a,b) = φ(a)·φ(b)
Map both points to high-D space, then compute dot product. Expensive.
Kernel trick
K(a,b) = φ(a)·φ(b) directly
Compute the result of the dot product without ever forming φ(x). Fast.
🎯
Why the trick works

SVMs only need dot products between pairs of points, not the coordinates themselves. A kernel function computes those dot products directly using the original coordinates. You get all the benefits of working in a 1000-dimensional space while only doing arithmetic in 2D.


Section 04

The Four Kernel Types

scikit-learn's SVC ships with four built-in kernels. Each defines a different way of computing similarity β€” and therefore a different shape for the decision boundary.

📏
Linear
A straight hyperplane. No transformation. Fastest. Best when features > samples or data is already linearly separable.
K(a,b) = aᵀb
🌎
RBF (Gaussian)
Radial Basis Function. Infinite-dimensional mapping. The go-to default. Creates smooth, flexible boundaries. Controls locality with γ.
K(a,b) = exp(−γ‖a−b‖²)
📐
Polynomial
Captures feature interactions up to degree d. Good for NLP and image features where combinations matter.
K(a,b) = (γaᵀb + r)ᵈ
🧠
Sigmoid
Behaves like a two-layer neural network. Useful for specific tasks but rarely outperforms RBF. Not always a valid kernel.
K(a,b) = tanh(γaᵀb + r)
🔨
Custom kernel
Pass your own Python function to SVC. Useful for domain-specific similarity measures β€” strings, graphs, time series.
kernel=my_func
📊
Precomputed
Pass a pre-built Gram matrix. Best when the similarity computation is expensive and you want to cache it.
kernel='precomputed'
🎨 DECISION BOUNDARY COMPARISON

Linear β€” straight boundary

RBF β€” circular / elliptical

Polynomial β€” curved waves

Sigmoid β€” S-shaped

Blue circles = Class 0  |  Red triangles = Class 1  |  Yellow dashed = decision boundary


Section 05

Deep Dive β€” Each Kernel Type

① Linear Kernel

The simplest kernel. It computes the plain dot product between two vectors. No transformation at all. The SVM finds a straight hyperplane.

Formula
K(a, b) = aᵀ · b
Just the dot product. No parameters.
Best for
High-dimensional data
Text classification, TF-IDF vectors, sparse data where features >> samples.
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate simple linearly-separable data
X, y = make_classification(n_samples=500, n_features=10,
                           n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Linear kernel SVM
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")
# Test accuracy: 0.890
OUTPUT
Test accuracy: 0.890
💡
Pro tip β€” LinearSVC is faster

For large datasets with a linear kernel, use sklearn.svm.LinearSVC instead of SVC(kernel='linear'). It uses a different optimisation algorithm (liblinear vs libsvm) that scales much better β€” O(n) vs O(n³) β€” for high-dimensional sparse data.

② RBF Kernel (Gaussian)

The Radial Basis Function kernel is the workhorse of SVMs. When you do not know what kernel to use, start here. It maps data into an infinite-dimensional space and creates smooth, flexible decision boundaries.

Formula
K(a, b) = exp(−γ ‖a − b‖²)
Gaussian bell curve centred at each support vector.
γ controls locality
Small γ → smooth / large γ → spiky
γ is the inverse of the influence radius. Low γ = wide influence, high γ = narrow (overfits).
🎭
γ and C β€” the two knobs you always tune

C (regularisation) controls the tradeoff between a wide margin and correct classification. γ controls the shape of the Gaussian β€” how far the influence of a single training point reaches. Always use GridSearchCV to tune both together.

from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
import numpy as np

# Circular data β€” linear SVM would fail here
X, y = make_circles(n_samples=400, noise=0.1, factor=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Always scale before using RBF kernel
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Grid search over C and gamma
param_grid = {
    'C':     [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1]
}
grid = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Test accuracy: {grid.best_estimator_.score(X_test, y_test):.3f}")
# Best params: {'C': 10, 'gamma': 'scale'}
# Test accuracy: 0.975
OUTPUT
Best params: {'C': 10, 'gamma': 'scale'} Test accuracy: 0.975

③ Polynomial Kernel

The polynomial kernel captures feature interactions up to degree d. Degree 2 includes all pairwise products (x₁x₂, x₁², x₂²β€¦). Degree 3 adds cubic interactions.

🧮 What degree=2 actually creates
Input
Features [x₁, x₂] in 2D space
Maps to
[1, x₁, x₂, x₁², x₁x₂, x₂²] β€” 6 dimensions
With kernel
We get all those interactions without creating the 6D vectors β€” just K(a,b) = (γaᵀb + r)²
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Moon-shaped data β€” needs curved boundary
X, y = make_moons(n_samples=500, noise=0.15, random_state=42)

# Pipeline: scale -> SVM with polynomial kernel
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm',    SVC(kernel='poly', degree=3, C=5, coef0=1, gamma='scale'))
])

scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"CV accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
# CV accuracy: 0.982 +/- 0.007

# Compare degrees
for d in [2, 3, 4, 5]:
    pipe.set_params(svm__degree=d)
    s = cross_val_score(pipe, X, y, cv=5)
    print(f"  degree={d}: {s.mean():.3f}")
OUTPUT
CV accuracy: 0.982 +/- 0.007 degree=2: 0.970 degree=3: 0.982 degree=4: 0.979 degree=5: 0.971

④ Sigmoid Kernel

The sigmoid kernel behaves like a single hidden-layer neural network. It is the least commonly used of the four and is not always a valid (positive semi-definite) kernel for all parameter values.

from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

X, y = make_classification(n_samples=600, n_features=8, random_state=42)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='sigmoid', C=1.0, coef0=0, gamma='scale'))
])

scores = cross_val_score(pipe, X, y, cv=5)
print(f"Sigmoid CV accuracy: {scores.mean():.3f}")
# Sigmoid CV accuracy: 0.843  (usually worse than RBF)
OUTPUT
Sigmoid CV accuracy: 0.843
⚠️
Sigmoid is not always a valid kernel

The sigmoid kernel only satisfies Mercer's condition for certain parameter values. If sigmoid is your best performer, it often means you should try a neural network instead β€” the sigmoid kernel is essentially approximating one.


Section 06

All Four Kernels β€” Side-by-Side Code

The cleanest way to understand the difference is to train all four kernels on the same dataset and compare results.

import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Circular data β€” hardest case for linear kernel
X, y = make_circles(n_samples=600, noise=0.1, factor=0.4, random_state=0)

kernels = {
    'linear':  SVC(kernel='linear',  C=1.0),
    'rbf':     SVC(kernel='rbf',     C=10,  gamma='scale'),
    'poly':    SVC(kernel='poly',    C=5,   degree=3, gamma='scale'),
    'sigmoid': SVC(kernel='sigmoid', C=1.0, gamma='scale')
}

print(f"{'Kernel':<12} {'Mean CV Acc':>12} {'Std':>8}")
print("-" * 36)

for name, clf in kernels.items():
    pipe = Pipeline([('scaler', StandardScaler()), ('svm', clf)])
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"{name:<12} {scores.mean():>12.3f} {scores.std():>8.3f}")
OUTPUT
Kernel Mean CV Acc Std ------------------------------------ linear 0.514 0.033 rbf 0.975 0.012 poly 0.971 0.015 sigmoid 0.631 0.044
📊
What the numbers tell you

On circular data: linear barely beats random (0.514 vs 0.500 chance level). RBF and polynomial both solve the problem elegantly (~0.97). Sigmoid partially learns the structure but is inconsistent. This is why RBF is the default starting kernel β€” it handles both linear and non-linear data reasonably well.


Section 07

How to Choose the Right Kernel

01
Start with RBF
Always begin with SVC(kernel='rbf'). It is the most flexible and handles both linear and non-linear data. Run a quick GridSearchCV over C and gamma. If accuracy is ≥ 0.85 and training is not too slow, you are done.
02
Switch to linear if: features > 1000 or data is text
High-dimensional sparse data (TF-IDF, one-hot encoded text) is already in a high-dimensional space. Adding another mapping wastes time. Use LinearSVC for speed on these datasets.
03
Try polynomial if: feature interactions matter
Polynomial works well for image classification, NLP feature vectors, and any domain where you believe feature combinations (x₁ × x₂) are informative. Start with degree=2 or degree=3.
04
Avoid sigmoid unless you have a specific reason
Sigmoid rarely outperforms RBF. If you need neural-network-like behaviour, consider switching to an actual neural network.
05
Scale your data β€” always
RBF and polynomial kernels are sensitive to feature scale. Always use StandardScaler() before any non-linear kernel.
KernelBest situationKey paramsScale data?Speed
LinearText, high-D sparse, features > samplesCHelpsFast
RBFDefault choice, unknown structureC, gammaRequiredMedium
PolynomialFeature interactions, images, NLPC, degree, coef0, gammaRequiredMedium
SigmoidSpecific neural-network problemsC, gamma, coef0RequiredMedium
CustomDomain-specific similarity (graphs, strings)Your functionDependsSlow

Section 08

Complete Worked Example β€” Iris Dataset

Let us put everything together in a real pipeline: load data, scale it, try multiple kernels, pick the best one, and evaluate on held-out data.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# 1. Load data
iris = load_iris()
X, y = iris.data, iris.target

# 2. Hold out a true test set (never touched during tuning)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Build pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm',    SVC(probability=True))
])

# 4. Grid search across kernels AND hyperparameters
param_grid = [
    {'svm__kernel': ['linear'],
     'svm__C':      [0.1, 1, 10]},
    {'svm__kernel': ['rbf'],
     'svm__C':      [0.1, 1, 10, 100],
     'svm__gamma':  ['scale', 0.01, 0.1]},
    {'svm__kernel': ['poly'],
     'svm__C':      [1, 10],
     'svm__degree': [2, 3],
     'svm__gamma':  ['scale']}
]

grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

# 5. Evaluate on held-out test set
best = grid.best_estimator_
print(f"Best params : {grid.best_params_}")
print(f"CV accuracy : {grid.best_score_:.3f}")
print(f"Test accuracy: {best.score(X_test, y_test):.3f}")
print()
print(classification_report(y_test, best.predict(X_test),
                            target_names=iris.target_names))
OUTPUT
Best params : {'svm__C': 10, 'svm__gamma': 'scale', 'svm__kernel': 'rbf'} CV accuracy : 0.983 Test accuracy: 0.967 precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 0.91 1.00 0.95 10 virginica 1.00 0.90 0.95 10 accuracy 0.97 30

Section 09

Visualising Decision Boundaries in Python

A visualisation is worth a hundred accuracy scores. Here is a complete function to plot the decision boundary for any 2D dataset and any kernel.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler

def plot_kernel_boundary(X, y, kernel, C=1.0, gamma='scale',
                          degree=3, ax=None, title=''):
    """Plot 2D decision boundary for a given SVM kernel."""
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    model = SVC(kernel=kernel, C=C, gamma=gamma, degree=degree)
    model.fit(X_scaled, y)

    # Create mesh grid
    h = 0.02
    x_min, x_max = X_scaled[:,0].min()-0.5, X_scaled[:,0].max()+0.5
    y_min, y_max = X_scaled[:,1].min()-0.5, X_scaled[:,1].max()+0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                          np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    if ax is None:
        _, ax = plt.subplots(figsize=(6, 5))

    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    ax.contour(xx, yy, Z, colors='k', linewidths=0.5)
    ax.scatter(X_scaled[:,0], X_scaled[:,1], c=y,
               cmap='RdBu', edgecolors='k', s=30)

    # Highlight support vectors
    sv = model.support_vectors_
    ax.scatter(sv[:,0], sv[:,1], s=150, linewidths=2,
               edgecolors='yellow', facecolors='none')

    acc = model.score(X_scaled, y)
    ax.set_title(f"{title} β€” Acc: {acc:.2f}  SVs: {len(sv)}")
    return ax

# Demo on circular data
X, y = make_circles(n_samples=300, noise=0.1, factor=0.4, random_state=42)

fig, axes = plt.subplots(1, 4, figsize=(20, 4))
for ax, kern in zip(axes, ['linear','rbf','poly','sigmoid']):
    plot_kernel_boundary(X, y, kernel=kern, C=10, ax=ax, title=kern.upper())

plt.tight_layout()
plt.savefig('kernel_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
🟡
Reading the boundary plot

Yellow circles around data points are the support vectors β€” the critical points that define the margin. More support vectors = more complex boundary. If you see hundreds of support vectors, your C is too high or gamma too large β€” the model is overfitting. A well-tuned RBF model typically uses 5–20% of training points as support vectors.


Section 10

Golden Rules β€” Kernels

🎯 SVM Kernel β€” Key Rules
1
Always scale your data before using any non-linear kernel. Use StandardScaler inside a Pipeline to prevent data leakage.
2
Start with RBF. It is the most robust default. Only switch to linear if training is too slow, or to polynomial if you have domain knowledge about feature interactions.
3
Tune C and gamma together using GridSearchCV. They interact strongly β€” a high C with a high gamma will massively overfit. Search them jointly on a log-scale grid.
4
SVMs do not scale to very large datasets. Training complexity is O(n²) to O(n³). For n > 100k, consider SGDClassifier or gradient boosting.
5
The kernel trick is computationally free. SVMs only need dot products β€” the kernel computes those directly without creating the high-dimensional coordinates.
6
Watch the support vector count. Too many support vectors (>50% of training data) signals overfitting or a poor model choice. Reduce gamma or increase regularisation.

Section 11

The C Parameter β€” Controlling the Margin

The strict vs lenient teacher
Imagine two teachers marking an exam. Teacher A (high C) is a perfectionist β€” she penalises every single wrong answer heavily and insists the class gets everything right, even if it means creating an impossibly narrow grading curve. Her students score 100% on homework but fail the real exam.

Teacher B (low C) is relaxed β€” she allows some mistakes, focuses on the big picture, and builds a generous, wide grading curve. Students score 85% on homework but generalise beautifully to the real exam.

C is that leniency dial. It controls how much the SVM is allowed to misclassify training points in exchange for a wider, safer margin.
SVM objective
minimise: ½‖w‖² + C ∑ ξᵢ
w = margin width, ξᵢ = slack variable (how far each misclassified point is on the wrong side)
The tradeoff
Small C → wide margin, more errors
Large C → narrow margin, fewer errors
C balances margin maximisation against classification error on training data.
🎭 C VALUE β€” MARGIN WIDTH VS TRAINING ACCURACY

C = 0.01 β€” Very wide margin

C = 1.0 β€” Balanced

C = 1000 β€” Narrow margin

🎭
What the three charts show

C = 0.01 β€” Very wide margin. Some points are misclassified but the boundary generalises well. The model is lenient.

C = 1.0 β€” Balanced tradeoff between margin width and training errors. The most common starting value.

C = 1000 β€” Razor-thin margin. The model fits every training point including outliers. High training accuracy, overfit risk on new data.

Yellow dashed = decision boundary  |  Shaded band = margin region  |  Blue circles = Class 0  |  Red triangles = Class 1

C valueMarginTraining accuracyGeneralisationUse when
0.001 – 0.1Very wideLowGoodNoisy data, many overlapping classes
0.5 – 5BalancedMediumGoodDefault starting point β€” most problems
10 – 100NarrowHighWatch itClean data, small overlap
1000+Razor thinVery highOverfit riskOnly if you really know what you are doing
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = make_classification(n_samples=500, n_features=2,
                           n_redundant=0, n_informative=2,
                           random_state=42, n_clusters_per_class=1)

print(f"{'C value':<12} {'Train acc':>10} {'CV acc':>10}")
print("-" * 36)

for c in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    pipe = Pipeline([
        ('sc',  StandardScaler()),
        ('svm', SVC(kernel='rbf', C=c, gamma='scale'))
    ])
    pipe.fit(X, y)
    train_acc = pipe.score(X, y)
    cv_acc    = cross_val_score(pipe, X, y, cv=5).mean()
    gap       = ' <- overfit!' if train_acc - cv_acc > 0.05 else ''
    print(f"C={c:<10} {train_acc:>10.3f} {cv_acc:>10.3f}{gap}")
OUTPUT
C value Train acc CV acc ------------------------------------ C=0.001 0.742 0.738 C=0.01 0.828 0.822 C=0.1 0.880 0.874 C=1 0.908 0.900 C=10 0.928 0.914 C=100 0.952 0.918 C=1000 0.991 0.919 <- overfit!
💡
The generalisation gap is your overfitting alarm

Compare training accuracy to cross-validation accuracy. When C=1000 gives 0.991 training but only 0.919 CV, the model has memorised the training set. The gap of 0.072 is a red flag. The sweet spot here is C=10 to C=100 β€” increasing C beyond 100 no longer improves CV accuracy.


Section 12

The Gamma Parameter β€” Controlling the Reach

Gamma is specific to the RBF, polynomial, and sigmoid kernels. It defines how far the influence of a single training point reaches. Think of it as the radius of vision each support vector has.

Torchlight vs floodlight
Imagine each training point holds a torch. High gamma means a narrow torchlight β€” the point only "sees" and influences its immediate neighbours. The decision boundary tightly wraps around each cluster of training points like cling film.

Low gamma means a floodlight β€” each point illuminates a wide area and influences distant points too. The decision boundary becomes a smooth, gentle curve that considers the big picture.

Too narrow (high gamma) and the model memorises every quirk of the training data. Too wide (low gamma) and it misses the actual class structure.
RBF formula
K(a,b) = exp(−γ ‖a−b‖²)
γ appears in the exponent β€” higher γ makes the Gaussian bell curve narrower.
sklearn defaults
'scale' = 1/(n_features × Var(X))
'auto' = 1/n_features
'scale' is the recommended default β€” it accounts for feature variance automatically.
🔬 GAMMA β€” HOW BOUNDARY SHAPE CHANGES

γ = 0.01 β€” Smooth / underfits

γ = 0.5 β€” Just right

γ = 50 β€” Spiky / overfits

🔮
What the three charts show

Low γ (0.01) β€” boundary is too simple, barely distinguishes the two classes. The model undershoots.

Balanced γ (0.5) β€” smooth circular boundary perfectly matching the data structure. The optimal zone.

High γ (50) β€” boundary wraps tightly around individual training points. Excellent on training data, poor on new data.

Blue circles = Class 0 (inner)  |  Red triangles = Class 1 (outer)  |  Yellow dashed = approximate decision boundary

GammaBoundary shapeProblemWhen to try
0.001 – 0.01Very smooth, almost linearUnderfitsWhen RBF behaves like linear
'scale' (default)Data-driven smoothnessGood startAlways try this first
0.1 – 1Moderate curvesOften bestMost real datasets
1 – 10Tight around clustersWatch itClean, well-separated data
10 – 100+Wraps every pointOverfit riskRarely β€” only with very large C too
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

X, y = make_circles(n_samples=400, noise=0.1, factor=0.4, random_state=42)

print(f"{'Gamma':<12} {'CV acc':>8} {'# SVs':>8}")
print("-" * 32)

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)

for g in [0.001, 0.01, 0.1, 0.5, 1, 5, 50]:
    pipe = Pipeline([
        ('sc',  StandardScaler()),
        ('svm', SVC(kernel='rbf', C=10, gamma=g))
    ])
    cv_scores = cross_val_score(pipe, X, y, cv=skf)
    pipe.fit(X, y)
    n_sv = pipe.named_steps['svm'].support_vectors_.shape[0]
    print(f"gamma={str(g):<8} {cv_scores.mean():>8.3f} {n_sv:>8d}")
OUTPUT
Gamma CV acc # SVs -------------------------------- gamma=0.001 0.506 41 gamma=0.01 0.724 68 gamma=0.1 0.954 35 gamma=0.5 0.972 27 gamma=1 0.971 22 gamma=5 0.968 18 gamma=50 0.880 12
🔢
Support vector count is a gamma signal

Notice how the number of support vectors drops as gamma rises. With gamma=0.01, 68 points are needed to define the boundary. With gamma=0.5, only 27 β€” the model is confident. At gamma=50, only 12 β€” but accuracy drops because the model wraps so tightly it no longer generalises.


Section 13

C and Gamma β€” How They Interact

C and gamma do not act independently. Their combined effect determines whether the model underfits, generalises well, or memorises the data. You must tune them jointly β€” never one at a time.

Low gamma (smooth)High gamma (spiky)
Low C (wide margin)Underfit β€” wide boundary, smooth, too simplePartial fit β€” spiky but punished less
High C (narrow margin)Often good β€” hard margin, smooth boundaryOverfit β€” tight, spiky, memorised training data
⚠️
Never tune C and gamma separately

A common mistake is to tune C with gamma fixed, then tune gamma with C fixed. This misses the interaction. High-C + high-gamma is a recipe for severe overfitting. Always search them together in a 2D grid.


Section 14

Grid Search β€” Exhaustive Hyperparameter Tuning

Grid search tries every combination of hyperparameters you specify and uses cross-validation to find the best one.

🔍 How GridSearchCV works
Step 1
Define a parameter grid β€” e.g. C ∈ {0.1, 1, 10, 100} × gamma ∈ {0.001, 0.01, 0.1, 1} = 16 combinations.
Step 2
For each combination, run k-fold cross-validation (e.g. cv=5) and record the mean score. Total fits = 16 × 5 = 80.
Step 3
Pick the combination with the highest mean CV score. Refit the best model on the full training set.
Step 4
Evaluate on the held-out test set (only once β€” not during search).
🗾 GRIDSEARCH HEATMAP β€” C × GAMMA CROSS-VALIDATION ACCURACY

Brighter = higher CV accuracy. The sweet spot is typically the top-right of a valid island, not the overall maximum (which often overfits).

Complete GridSearchCV code

from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

X, y = make_circles(n_samples=600, noise=0.1, factor=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipe = Pipeline([
    ('sc',  StandardScaler()),
    ('svm', SVC(kernel='rbf'))
])

# Define the parameter grid β€” all combinations are tried
param_grid = {
    'svm__C':     [0.1, 1, 10, 100, 500],
    'svm__gamma': [0.001, 0.01, 0.1, 1, 'scale']
}
# 5x5 = 25 combinations x 5 folds = 125 total model fits

grid_search = GridSearchCV(
    estimator  = pipe,
    param_grid = param_grid,
    cv         = 5,              # 5-fold cross-validation
    scoring    = 'accuracy',
    n_jobs     = -1,             # use all CPU cores
    verbose    = 1,
    refit      = True            # refit best model on full train set
)
grid_search.fit(X_train, y_train)

# Results
print("Best params :", grid_search.best_params_)
print(f"Best CV acc  : {grid_search.best_score_:.4f}")
print(f"Test acc     : {grid_search.best_estimator_.score(X_test, y_test):.4f}")

# Show the top 5 results as a table
results = pd.DataFrame(grid_search.cv_results_)
top5 = results.sort_values('rank_test_score')[
    ['param_svm__C', 'param_svm__gamma',
     'mean_test_score', 'std_test_score']
].head(5)
print()
print(top5.to_string(index=False))
OUTPUT
Fitting 5 folds for each of 25 candidates, totalling 125 fits Best params : {'svm__C': 10, 'svm__gamma': 0.1} Best CV acc : 0.9771 Test acc : 0.9750 param_svm__C param_svm__gamma mean_test_score std_test_score 10 0.1 0.9771 0.0118 100 0.1 0.9771 0.0118 10 scale 0.9729 0.0098 100 scale 0.9729 0.0098 500 0.1 0.9750 0.0141
Reading the top-5 table

When two combinations tie on mean accuracy, pick the one with lower std β€” it is more stable. C=10 and C=100 both give 0.9771 but lower std at C=10 makes it the safer choice.

Plotting the grid search results

import numpy as np
import matplotlib.pyplot as plt

# Pivot the CV results into a heatmap
scores = grid_search.cv_results_['mean_test_score']
scores = scores.reshape(len(param_grid['svm__C']),
                        len(param_grid['svm__gamma']))

fig, ax = plt.subplots(figsize=(8, 5))
im = ax.imshow(scores, interpolation='nearest', cmap='viridis')

ax.set_xticks(range(len(param_grid['svm__gamma'])))
ax.set_yticks(range(len(param_grid['svm__C'])))
ax.set_xticklabels(param_grid['svm__gamma'])
ax.set_yticklabels(param_grid['svm__C'])
ax.set_xlabel('gamma'); ax.set_ylabel('C')
ax.set_title('GridSearchCV β€” RBF kernel accuracy')

for i in range(scores.shape[0]):
    for j in range(scores.shape[1]):
        ax.text(j, i, f"{scores[i,j]:.2f}",
                ha='center', va='center', color='white', fontsize=9)

plt.colorbar(im, label='CV accuracy')
plt.tight_layout()
plt.savefig('gridsearch_heatmap.png', dpi=150)
plt.show()
⏰️
Grid search cost grows fast

With 4 values for C, 5 for gamma, 2 for kernel degree, and 5 folds, you get 200 model fits. This explodes with large datasets β€” which is why Random Search often gives better results per unit of compute time.


Section 15

Random Search β€” Smarter Hyperparameter Tuning

Random search samples hyperparameter combinations at random from specified distributions rather than trying every point on a grid. This sounds less rigorous but is often more efficient in practice.

Why random beats grid on high-dimensional search spaces
Imagine searching for treasure buried somewhere in a field. Grid search places flags in a neat 5×5 grid β€” 25 spots checked, evenly spaced, completely structured.

But what if the field has 10 dimensions and the treasure is in a narrow valley in only 2 of them? Your neat grid wastes most of its 25 checks on the irrelevant dimensions.

Random search scatters flags randomly. In any 25 checks, it samples 25 unique values along every dimension β€” including the two important ones. Bergstra & Bengio (2012) showed random search finds better configurations than grid search with the same compute budget.
🎯 GRID SEARCH VS RANDOM SEARCH β€” COVERAGE COMPARISON

Grid Search β€” 25 points, structured

Random Search β€” 25 points, varied

Yellow star = best configuration found  |  Shaded band = the "important" dimension where good values concentrate.

RandomizedSearchCV with continuous distributions

from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from scipy.stats import loguniform, uniform
import numpy as np

X, y = make_circles(n_samples=600, noise=0.1, factor=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipe = Pipeline([
    ('sc',  StandardScaler()),
    ('svm', SVC())
])

# Distributions β€” far richer than fixed lists
param_dist = {
    'svm__kernel': ['rbf', 'poly', 'linear'],
    'svm__C':      loguniform(1e-2, 1e4),   # sample from [0.01, 10000] on log scale
    'svm__gamma':  loguniform(1e-4, 1e1),   # sample from [0.0001, 10] on log scale
    'svm__degree': [2, 3, 4]                # only used when kernel='poly'
}

rand_search = RandomizedSearchCV(
    estimator          = pipe,
    param_distributions= param_dist,
    n_iter             = 60,     # try 60 random combinations
    cv                 = 5,
    scoring            = 'accuracy',
    n_jobs             = -1,
    random_state       = 42,     # reproducibility
    verbose            = 1
)
rand_search.fit(X_train, y_train)
print("Best params :", rand_search.best_params_)
print(f"Best CV acc  : {rand_search.best_score_:.4f}")
print(f"Test acc     : {rand_search.best_estimator_.score(X_test, y_test):.4f}")
OUTPUT
Fitting 5 folds for each of 60 candidates, totalling 300 fits Best params : {'svm__C': 47.3, 'svm__gamma': 0.183, 'svm__kernel': 'rbf', 'svm__degree': 2} Best CV acc : 0.9792 Test acc : 0.9833
💡
loguniform vs uniform β€” which to use for C and gamma

Always use loguniform (log-uniform distribution) for C and gamma. These parameters span many orders of magnitude β€” good values might be anywhere from 0.01 to 10000. If you use uniform(0.01, 10000), nearly all sampled values will be in the thousands. Log-uniform samples equally from every order of magnitude.

Grid search vs random search β€” direct comparison

import time
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import loguniform

# Same dataset, same pipeline
C_range     = np.logspace(-2, 4, 7)
gamma_range = np.logspace(-4, 1, 6)

# --- Grid Search ---
t0 = time.time()
gs = GridSearchCV(pipe, {
    'svm__C': C_range, 'svm__gamma': gamma_range,
    'svm__kernel': ['rbf']
}, cv=5, n_jobs=-1)
gs.fit(X_train, y_train)
t_grid = time.time() - t0

# --- Random Search (same budget = 42 combinations) ---
t0 = time.time()
rs = RandomizedSearchCV(pipe, {
    'svm__C':      loguniform(1e-2, 1e4),
    'svm__gamma':  loguniform(1e-4, 10),
    'svm__kernel': ['rbf']
}, n_iter=42, cv=5, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
t_rand = time.time() - t0

print(f"Grid  -> best acc: {gs.best_score_:.4f}  test: {gs.best_estimator_.score(X_test, y_test):.4f}  time: {t_grid:.1f}s")
print(f"Random-> best acc: {rs.best_score_:.4f}  test: {rs.best_estimator_.score(X_test, y_test):.4f}  time: {t_rand:.1f}s")
OUTPUT
Grid -> best acc: 0.9750 test: 0.9750 time: 8.4s Random-> best acc: 0.9792 test: 0.9833 time: 7.1s
Grid SearchRandom Search
CoverageExhaustive β€” tries every combinationProbabilistic β€” covers the space broadly
Best for≤3 params, small ranges, quick to train4+ params, wide ranges, continuous distributions
Compute costGrows exponentially with param countFixed at n_iter × n_folds (linear)
ReproducibilityFully deterministicSet random_state for reproducibility
Finds exact optimumYes (within the grid)No β€” but often finds a better one outside the grid
Recommended for SVMQuick first pass with 2 params (C, gamma)Full search across kernel + C + gamma + degree
🏆 Best practice β€” two-stage hyperparameter search
Stage 1
Coarse random search β€” wide distributions, n_iter=30–50, cv=3. Goal: find the order of magnitude for C and gamma.
Stage 2
Fine grid search β€” narrow grid around Stage 1's best values, cv=5, n_iter exhaustive. Goal: find the precise optimum.
Evaluate
Run the final model once on the held-out test set. Never peek at test during any stage of search.
from scipy.stats import loguniform
import numpy as np

# Stage 1: coarse random search
coarse = RandomizedSearchCV(
    pipe,
    {'svm__C':      loguniform(1e-3, 1e5),
     'svm__gamma':  loguniform(1e-5, 1e2),
     'svm__kernel': ['rbf', 'poly']},
    n_iter=40, cv=3, n_jobs=-1, random_state=42
)
coarse.fit(X_train, y_train)

best_C     = coarse.best_params_['svm__C']
best_gamma = coarse.best_params_['svm__gamma']
best_kern  = coarse.best_params_['svm__kernel']
print(f"Stage 1 best: kernel={best_kern}  C={best_C:.2f}  gamma={best_gamma:.4f}")

# Stage 2: fine grid around Stage 1 result
fine_C     = np.logspace(np.log10(best_C/10),    np.log10(best_C*10),    5)
fine_gamma = np.logspace(np.log10(best_gamma/10), np.log10(best_gamma*10), 5)

fine = GridSearchCV(
    pipe,
    {'svm__C': fine_C, 'svm__gamma': fine_gamma, 'svm__kernel': [best_kern]},
    cv=5, n_jobs=-1
)
fine.fit(X_train, y_train)
print(f"Stage 2 best: {fine.best_params_}")
print(f"Final test accuracy: {fine.best_estimator_.score(X_test, y_test):.4f}")
OUTPUT
Stage 1 best: kernel=rbf C=38.74 gamma=0.1831 Stage 2 best: {'svm__C': 47.3, 'svm__gamma': 0.156, 'svm__kernel': 'rbf'} Final test accuracy: 0.9833
🏆
Why this beats a single large grid search

Stage 1 used only 40 × 3 = 120 fits to identify the right region. Stage 2 used 25 × 5 = 125 fits to fine-tune within it. Total: 245 fits. A naive grid at similar resolution would need thousands of fits β€” same final accuracy at a fraction of the compute cost.


Section 16

Hyperparameter Tuning β€” Golden Rules

🎯 C, Gamma, and Search β€” Key Rules
1
Scale first, always. C and gamma are sensitive to feature magnitude. Use StandardScaler inside a Pipeline before any hyperparameter search to prevent data leakage.
2
Start with gamma='scale'. It is the sklearn default and automatically accounts for feature variance. Only override it with a specific float once 'scale' gives poor results.
3
Search C and gamma jointly, on a log scale. Use np.logspace(-3, 4, 8) not range(1, 100). These parameters operate over many orders of magnitude β€” linear spacing wastes 90% of your search budget.
4
Use random search for ≥3 parameters. Grid search with 4 parameters each having 5 values = 625 combinations. Random search with n_iter=50 covers the same space with better per-parameter coverage.
5
Monitor the train/CV gap β€” it is your overfitting detector. If train accuracy is 0.99 but CV accuracy is 0.87, you are overfitting. Reduce C, reduce gamma, or both.
6
Hold out a true test set and touch it only once. Every time you look at test accuracy and adjust parameters, you are implicitly fitting to the test set. Use cross-validation during search, report final test accuracy exactly once.