The Bouncer at the Nightclub
But on a busy night, the VIPs stand in a circle in the middle of the dancefloor, completely surrounded by regular guests. Now the bouncer cannot draw any straight rope that separates them. No matter how he angles the rope, some VIPs end up on the wrong side.
This is exactly the problem SVMs face every day. And the kernel trick is the bouncer's clever solution: build a raised platform in the centre. Suddenly the VIPs are at a different height β and a flat rope (a plane) can separate them perfectly.
That raised platform is the core idea behind SVM kernels. This tutorial walks you through the complete journey β from why 2D fails, to how 3D solves it, to the four kernel types you will actually use in production Python code.
A kernel function computes the similarity between two data points as if they had been transformed into a higher-dimensional space β without ever doing the expensive transformation explicitly. It is a mathematical shortcut that makes high-dimensional geometry affordable.
Why 2D Classification Fails
A Support Vector Machine in its most basic form finds the widest possible straight line (in 2D) or hyperplane (in N-D) that separates two classes. This works beautifully on linearly separable data. But the real world is rarely that polite.
✓ Linearly Separable
✗ NOT Linearly Separable
Blue = Class A | Red = Class B | Dashed = attempted decision boundary
A standard (linear) SVM assumes data is linearly separable β or close to it. Feed it circular data and it will draw a line anyway, but the accuracy will be poor. The kernel trick is how we break free of this assumption.
Lifting Data Into 3D β The Kernel Trick
From the side, you now see a hill of red dots elevated above a flat plane of blue dots. A flat sheet of cardboard β a hyperplane β can slice through the air between them perfectly.
The 3D height we added was
z = xΒ² + yΒ². That is the kernel
function: a mathematical rule for computing the new dimension.
Step 1 β 2D original (unseparable)
Step 2 β Add z = xΒ² + yΒ² (separable!)
Blue circles (inner class) have small r β small z | Red triangles (outer class) have large r β large z. The dashed horizontal line = cutting plane that separates them.
SVMs only need dot products between pairs of points, not the coordinates themselves. A kernel function computes those dot products directly using the original coordinates. You get all the benefits of working in a 1000-dimensional space while only doing arithmetic in 2D.
The Four Kernel Types
scikit-learn's SVC ships with four built-in kernels. Each defines a different way of computing similarity β and therefore a different shape for the decision boundary.
Linear β straight boundary
RBF β circular / elliptical
Polynomial β curved waves
Sigmoid β S-shaped
Blue circles = Class 0 | Red triangles = Class 1 | Yellow dashed = decision boundary
Deep Dive β Each Kernel Type
① Linear Kernel
The simplest kernel. It computes the plain dot product between two vectors. No transformation at all. The SVM finds a straight hyperplane.
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate simple linearly-separable data
X, y = make_classification(n_samples=500, n_features=10,
n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Linear kernel SVM
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")
# Test accuracy: 0.890
For large datasets with a linear kernel, use sklearn.svm.LinearSVC instead of SVC(kernel='linear'). It uses a different optimisation algorithm (liblinear vs libsvm) that scales much better β O(n) vs O(n³) β for high-dimensional sparse data.
② RBF Kernel (Gaussian)
The Radial Basis Function kernel is the workhorse of SVMs. When you do not know what kernel to use, start here. It maps data into an infinite-dimensional space and creates smooth, flexible decision boundaries.
C (regularisation) controls the tradeoff between a wide margin and correct classification. γ controls the shape of the Gaussian β how far the influence of a single training point reaches. Always use GridSearchCV to tune both together.
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
import numpy as np
# Circular data β linear SVM would fail here
X, y = make_circles(n_samples=400, noise=0.1, factor=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Always scale before using RBF kernel
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Grid search over C and gamma
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.01, 0.1, 1]
}
grid = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Test accuracy: {grid.best_estimator_.score(X_test, y_test):.3f}")
# Best params: {'C': 10, 'gamma': 'scale'}
# Test accuracy: 0.975
③ Polynomial Kernel
The polynomial kernel captures feature interactions up to degree d. Degree 2 includes all pairwise products (x₁x₂, x₁², x₂²β¦). Degree 3 adds cubic interactions.
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Moon-shaped data β needs curved boundary
X, y = make_moons(n_samples=500, noise=0.15, random_state=42)
# Pipeline: scale -> SVM with polynomial kernel
pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='poly', degree=3, C=5, coef0=1, gamma='scale'))
])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"CV accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
# CV accuracy: 0.982 +/- 0.007
# Compare degrees
for d in [2, 3, 4, 5]:
pipe.set_params(svm__degree=d)
s = cross_val_score(pipe, X, y, cv=5)
print(f" degree={d}: {s.mean():.3f}")
④ Sigmoid Kernel
The sigmoid kernel behaves like a single hidden-layer neural network. It is the least commonly used of the four and is not always a valid (positive semi-definite) kernel for all parameter values.
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
X, y = make_classification(n_samples=600, n_features=8, random_state=42)
pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='sigmoid', C=1.0, coef0=0, gamma='scale'))
])
scores = cross_val_score(pipe, X, y, cv=5)
print(f"Sigmoid CV accuracy: {scores.mean():.3f}")
# Sigmoid CV accuracy: 0.843 (usually worse than RBF)
The sigmoid kernel only satisfies Mercer's condition for certain parameter values. If sigmoid is your best performer, it often means you should try a neural network instead β the sigmoid kernel is essentially approximating one.
All Four Kernels β Side-by-Side Code
The cleanest way to understand the difference is to train all four kernels on the same dataset and compare results.
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Circular data β hardest case for linear kernel
X, y = make_circles(n_samples=600, noise=0.1, factor=0.4, random_state=0)
kernels = {
'linear': SVC(kernel='linear', C=1.0),
'rbf': SVC(kernel='rbf', C=10, gamma='scale'),
'poly': SVC(kernel='poly', C=5, degree=3, gamma='scale'),
'sigmoid': SVC(kernel='sigmoid', C=1.0, gamma='scale')
}
print(f"{'Kernel':<12} {'Mean CV Acc':>12} {'Std':>8}")
print("-" * 36)
for name, clf in kernels.items():
pipe = Pipeline([('scaler', StandardScaler()), ('svm', clf)])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"{name:<12} {scores.mean():>12.3f} {scores.std():>8.3f}")
On circular data: linear barely beats random (0.514 vs 0.500 chance level). RBF and polynomial both solve the problem elegantly (~0.97). Sigmoid partially learns the structure but is inconsistent. This is why RBF is the default starting kernel β it handles both linear and non-linear data reasonably well.
How to Choose the Right Kernel
SVC(kernel='rbf'). It is the most flexible and handles both linear and non-linear data. Run a quick GridSearchCV over C and gamma. If accuracy is ≥ 0.85 and training is not too slow, you are done.LinearSVC for speed on these datasets.StandardScaler() before any non-linear kernel.| Kernel | Best situation | Key params | Scale data? | Speed |
|---|---|---|---|---|
| Linear | Text, high-D sparse, features > samples | C | Helps | Fast |
| RBF | Default choice, unknown structure | C, gamma | Required | Medium |
| Polynomial | Feature interactions, images, NLP | C, degree, coef0, gamma | Required | Medium |
| Sigmoid | Specific neural-network problems | C, gamma, coef0 | Required | Medium |
| Custom | Domain-specific similarity (graphs, strings) | Your function | Depends | Slow |
Complete Worked Example β Iris Dataset
Let us put everything together in a real pipeline: load data, scale it, try multiple kernels, pick the best one, and evaluate on held-out data.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# 1. Load data
iris = load_iris()
X, y = iris.data, iris.target
# 2. Hold out a true test set (never touched during tuning)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 3. Build pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(probability=True))
])
# 4. Grid search across kernels AND hyperparameters
param_grid = [
{'svm__kernel': ['linear'],
'svm__C': [0.1, 1, 10]},
{'svm__kernel': ['rbf'],
'svm__C': [0.1, 1, 10, 100],
'svm__gamma': ['scale', 0.01, 0.1]},
{'svm__kernel': ['poly'],
'svm__C': [1, 10],
'svm__degree': [2, 3],
'svm__gamma': ['scale']}
]
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)
# 5. Evaluate on held-out test set
best = grid.best_estimator_
print(f"Best params : {grid.best_params_}")
print(f"CV accuracy : {grid.best_score_:.3f}")
print(f"Test accuracy: {best.score(X_test, y_test):.3f}")
print()
print(classification_report(y_test, best.predict(X_test),
target_names=iris.target_names))
Visualising Decision Boundaries in Python
A visualisation is worth a hundred accuracy scores. Here is a complete function to plot the decision boundary for any 2D dataset and any kernel.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler
def plot_kernel_boundary(X, y, kernel, C=1.0, gamma='scale',
degree=3, ax=None, title=''):
"""Plot 2D decision boundary for a given SVM kernel."""
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = SVC(kernel=kernel, C=C, gamma=gamma, degree=degree)
model.fit(X_scaled, y)
# Create mesh grid
h = 0.02
x_min, x_max = X_scaled[:,0].min()-0.5, X_scaled[:,0].max()+0.5
y_min, y_max = X_scaled[:,1].min()-0.5, X_scaled[:,1].max()+0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
if ax is None:
_, ax = plt.subplots(figsize=(6, 5))
ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
ax.contour(xx, yy, Z, colors='k', linewidths=0.5)
ax.scatter(X_scaled[:,0], X_scaled[:,1], c=y,
cmap='RdBu', edgecolors='k', s=30)
# Highlight support vectors
sv = model.support_vectors_
ax.scatter(sv[:,0], sv[:,1], s=150, linewidths=2,
edgecolors='yellow', facecolors='none')
acc = model.score(X_scaled, y)
ax.set_title(f"{title} β Acc: {acc:.2f} SVs: {len(sv)}")
return ax
# Demo on circular data
X, y = make_circles(n_samples=300, noise=0.1, factor=0.4, random_state=42)
fig, axes = plt.subplots(1, 4, figsize=(20, 4))
for ax, kern in zip(axes, ['linear','rbf','poly','sigmoid']):
plot_kernel_boundary(X, y, kernel=kern, C=10, ax=ax, title=kern.upper())
plt.tight_layout()
plt.savefig('kernel_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
Yellow circles around data points are the support vectors β the critical points that define the margin. More support vectors = more complex boundary. If you see hundreds of support vectors, your C is too high or gamma too large β the model is overfitting. A well-tuned RBF model typically uses 5β20% of training points as support vectors.
Golden Rules β Kernels
StandardScaler inside a Pipeline to prevent data leakage.GridSearchCV. They interact strongly β a high C with a high gamma will massively overfit. Search them jointly on a log-scale grid.SGDClassifier or gradient boosting.The C Parameter β Controlling the Margin
Teacher B (low C) is relaxed β she allows some mistakes, focuses on the big picture, and builds a generous, wide grading curve. Students score 85% on homework but generalise beautifully to the real exam.
C is that leniency dial. It controls how much the SVM is allowed to misclassify training points in exchange for a wider, safer margin.
Large C → narrow margin, fewer errors
C = 0.01 β Very wide margin
C = 1.0 β Balanced
C = 1000 β Narrow margin
C = 0.01 β Very wide margin. Some points are misclassified but the boundary generalises well. The model is lenient.
C = 1.0 β Balanced tradeoff between margin width and training errors. The most common starting value.
C = 1000 β Razor-thin margin. The model fits every training point including outliers. High training accuracy, overfit risk on new data.
Yellow dashed = decision boundary | Shaded band = margin region | Blue circles = Class 0 | Red triangles = Class 1
| C value | Margin | Training accuracy | Generalisation | Use when |
|---|---|---|---|---|
0.001 β 0.1 | Very wide | Low | Good | Noisy data, many overlapping classes |
0.5 β 5 | Balanced | Medium | Good | Default starting point β most problems |
10 β 100 | Narrow | High | Watch it | Clean data, small overlap |
1000+ | Razor thin | Very high | Overfit risk | Only if you really know what you are doing |
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
X, y = make_classification(n_samples=500, n_features=2,
n_redundant=0, n_informative=2,
random_state=42, n_clusters_per_class=1)
print(f"{'C value':<12} {'Train acc':>10} {'CV acc':>10}")
print("-" * 36)
for c in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
pipe = Pipeline([
('sc', StandardScaler()),
('svm', SVC(kernel='rbf', C=c, gamma='scale'))
])
pipe.fit(X, y)
train_acc = pipe.score(X, y)
cv_acc = cross_val_score(pipe, X, y, cv=5).mean()
gap = ' <- overfit!' if train_acc - cv_acc > 0.05 else ''
print(f"C={c:<10} {train_acc:>10.3f} {cv_acc:>10.3f}{gap}")
Compare training accuracy to cross-validation accuracy. When C=1000 gives 0.991 training but only 0.919 CV, the model has memorised the training set. The gap of 0.072 is a red flag. The sweet spot here is C=10 to C=100 β increasing C beyond 100 no longer improves CV accuracy.
The Gamma Parameter β Controlling the Reach
Gamma is specific to the RBF, polynomial, and sigmoid kernels. It defines how far the influence of a single training point reaches. Think of it as the radius of vision each support vector has.
Low gamma means a floodlight β each point illuminates a wide area and influences distant points too. The decision boundary becomes a smooth, gentle curve that considers the big picture.
Too narrow (high gamma) and the model memorises every quirk of the training data. Too wide (low gamma) and it misses the actual class structure.
'auto' = 1/n_features
γ = 0.01 β Smooth / underfits
γ = 0.5 β Just right
γ = 50 β Spiky / overfits
Low γ (0.01) β boundary is too simple, barely distinguishes the two classes. The model undershoots.
Balanced γ (0.5) β smooth circular boundary perfectly matching the data structure. The optimal zone.
High γ (50) β boundary wraps tightly around individual training points. Excellent on training data, poor on new data.
Blue circles = Class 0 (inner) | Red triangles = Class 1 (outer) | Yellow dashed = approximate decision boundary
| Gamma | Boundary shape | Problem | When to try |
|---|---|---|---|
0.001 β 0.01 | Very smooth, almost linear | Underfits | When RBF behaves like linear |
'scale' (default) | Data-driven smoothness | Good start | Always try this first |
0.1 β 1 | Moderate curves | Often best | Most real datasets |
1 β 10 | Tight around clusters | Watch it | Clean, well-separated data |
10 β 100+ | Wraps every point | Overfit risk | Rarely β only with very large C too |
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
X, y = make_circles(n_samples=400, noise=0.1, factor=0.4, random_state=42)
print(f"{'Gamma':<12} {'CV acc':>8} {'# SVs':>8}")
print("-" * 32)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for g in [0.001, 0.01, 0.1, 0.5, 1, 5, 50]:
pipe = Pipeline([
('sc', StandardScaler()),
('svm', SVC(kernel='rbf', C=10, gamma=g))
])
cv_scores = cross_val_score(pipe, X, y, cv=skf)
pipe.fit(X, y)
n_sv = pipe.named_steps['svm'].support_vectors_.shape[0]
print(f"gamma={str(g):<8} {cv_scores.mean():>8.3f} {n_sv:>8d}")
Notice how the number of support vectors drops as gamma rises. With gamma=0.01, 68 points are needed to define the boundary. With gamma=0.5, only 27 β the model is confident. At gamma=50, only 12 β but accuracy drops because the model wraps so tightly it no longer generalises.
C and Gamma β How They Interact
C and gamma do not act independently. Their combined effect determines whether the model underfits, generalises well, or memorises the data. You must tune them jointly β never one at a time.
| Low gamma (smooth) | High gamma (spiky) | |
|---|---|---|
| Low C (wide margin) | Underfit β wide boundary, smooth, too simple | Partial fit β spiky but punished less |
| High C (narrow margin) | Often good β hard margin, smooth boundary | Overfit β tight, spiky, memorised training data |
A common mistake is to tune C with gamma fixed, then tune gamma with C fixed. This misses the interaction. High-C + high-gamma is a recipe for severe overfitting. Always search them together in a 2D grid.
Grid Search β Exhaustive Hyperparameter Tuning
Grid search tries every combination of hyperparameters you specify and uses cross-validation to find the best one.
Brighter = higher CV accuracy. The sweet spot is typically the top-right of a valid island, not the overall maximum (which often overfits).
Complete GridSearchCV code
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
X, y = make_circles(n_samples=600, noise=0.1, factor=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
pipe = Pipeline([
('sc', StandardScaler()),
('svm', SVC(kernel='rbf'))
])
# Define the parameter grid β all combinations are tried
param_grid = {
'svm__C': [0.1, 1, 10, 100, 500],
'svm__gamma': [0.001, 0.01, 0.1, 1, 'scale']
}
# 5x5 = 25 combinations x 5 folds = 125 total model fits
grid_search = GridSearchCV(
estimator = pipe,
param_grid = param_grid,
cv = 5, # 5-fold cross-validation
scoring = 'accuracy',
n_jobs = -1, # use all CPU cores
verbose = 1,
refit = True # refit best model on full train set
)
grid_search.fit(X_train, y_train)
# Results
print("Best params :", grid_search.best_params_)
print(f"Best CV acc : {grid_search.best_score_:.4f}")
print(f"Test acc : {grid_search.best_estimator_.score(X_test, y_test):.4f}")
# Show the top 5 results as a table
results = pd.DataFrame(grid_search.cv_results_)
top5 = results.sort_values('rank_test_score')[
['param_svm__C', 'param_svm__gamma',
'mean_test_score', 'std_test_score']
].head(5)
print()
print(top5.to_string(index=False))
When two combinations tie on mean accuracy, pick the one with lower std β it is more stable. C=10 and C=100 both give 0.9771 but lower std at C=10 makes it the safer choice.
Plotting the grid search results
import numpy as np
import matplotlib.pyplot as plt
# Pivot the CV results into a heatmap
scores = grid_search.cv_results_['mean_test_score']
scores = scores.reshape(len(param_grid['svm__C']),
len(param_grid['svm__gamma']))
fig, ax = plt.subplots(figsize=(8, 5))
im = ax.imshow(scores, interpolation='nearest', cmap='viridis')
ax.set_xticks(range(len(param_grid['svm__gamma'])))
ax.set_yticks(range(len(param_grid['svm__C'])))
ax.set_xticklabels(param_grid['svm__gamma'])
ax.set_yticklabels(param_grid['svm__C'])
ax.set_xlabel('gamma'); ax.set_ylabel('C')
ax.set_title('GridSearchCV β RBF kernel accuracy')
for i in range(scores.shape[0]):
for j in range(scores.shape[1]):
ax.text(j, i, f"{scores[i,j]:.2f}",
ha='center', va='center', color='white', fontsize=9)
plt.colorbar(im, label='CV accuracy')
plt.tight_layout()
plt.savefig('gridsearch_heatmap.png', dpi=150)
plt.show()
With 4 values for C, 5 for gamma, 2 for kernel degree, and 5 folds, you get 200 model fits. This explodes with large datasets β which is why Random Search often gives better results per unit of compute time.
Random Search β Smarter Hyperparameter Tuning
Random search samples hyperparameter combinations at random from specified distributions rather than trying every point on a grid. This sounds less rigorous but is often more efficient in practice.
But what if the field has 10 dimensions and the treasure is in a narrow valley in only 2 of them? Your neat grid wastes most of its 25 checks on the irrelevant dimensions.
Random search scatters flags randomly. In any 25 checks, it samples 25 unique values along every dimension β including the two important ones. Bergstra & Bengio (2012) showed random search finds better configurations than grid search with the same compute budget.
Grid Search β 25 points, structured
Random Search β 25 points, varied
Yellow star = best configuration found | Shaded band = the "important" dimension where good values concentrate.
RandomizedSearchCV with continuous distributions
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from scipy.stats import loguniform, uniform
import numpy as np
X, y = make_circles(n_samples=600, noise=0.1, factor=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
pipe = Pipeline([
('sc', StandardScaler()),
('svm', SVC())
])
# Distributions β far richer than fixed lists
param_dist = {
'svm__kernel': ['rbf', 'poly', 'linear'],
'svm__C': loguniform(1e-2, 1e4), # sample from [0.01, 10000] on log scale
'svm__gamma': loguniform(1e-4, 1e1), # sample from [0.0001, 10] on log scale
'svm__degree': [2, 3, 4] # only used when kernel='poly'
}
rand_search = RandomizedSearchCV(
estimator = pipe,
param_distributions= param_dist,
n_iter = 60, # try 60 random combinations
cv = 5,
scoring = 'accuracy',
n_jobs = -1,
random_state = 42, # reproducibility
verbose = 1
)
rand_search.fit(X_train, y_train)
print("Best params :", rand_search.best_params_)
print(f"Best CV acc : {rand_search.best_score_:.4f}")
print(f"Test acc : {rand_search.best_estimator_.score(X_test, y_test):.4f}")
Always use loguniform (log-uniform distribution) for C and gamma. These parameters span many orders of magnitude β good values might be anywhere from 0.01 to 10000. If you use uniform(0.01, 10000), nearly all sampled values will be in the thousands. Log-uniform samples equally from every order of magnitude.
Grid search vs random search β direct comparison
import time
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import loguniform
# Same dataset, same pipeline
C_range = np.logspace(-2, 4, 7)
gamma_range = np.logspace(-4, 1, 6)
# --- Grid Search ---
t0 = time.time()
gs = GridSearchCV(pipe, {
'svm__C': C_range, 'svm__gamma': gamma_range,
'svm__kernel': ['rbf']
}, cv=5, n_jobs=-1)
gs.fit(X_train, y_train)
t_grid = time.time() - t0
# --- Random Search (same budget = 42 combinations) ---
t0 = time.time()
rs = RandomizedSearchCV(pipe, {
'svm__C': loguniform(1e-2, 1e4),
'svm__gamma': loguniform(1e-4, 10),
'svm__kernel': ['rbf']
}, n_iter=42, cv=5, n_jobs=-1, random_state=42)
rs.fit(X_train, y_train)
t_rand = time.time() - t0
print(f"Grid -> best acc: {gs.best_score_:.4f} test: {gs.best_estimator_.score(X_test, y_test):.4f} time: {t_grid:.1f}s")
print(f"Random-> best acc: {rs.best_score_:.4f} test: {rs.best_estimator_.score(X_test, y_test):.4f} time: {t_rand:.1f}s")
| Grid Search | Random Search | |
|---|---|---|
| Coverage | Exhaustive β tries every combination | Probabilistic β covers the space broadly |
| Best for | ≤3 params, small ranges, quick to train | 4+ params, wide ranges, continuous distributions |
| Compute cost | Grows exponentially with param count | Fixed at n_iter × n_folds (linear) |
| Reproducibility | Fully deterministic | Set random_state for reproducibility |
| Finds exact optimum | Yes (within the grid) | No β but often finds a better one outside the grid |
| Recommended for SVM | Quick first pass with 2 params (C, gamma) | Full search across kernel + C + gamma + degree |
from scipy.stats import loguniform
import numpy as np
# Stage 1: coarse random search
coarse = RandomizedSearchCV(
pipe,
{'svm__C': loguniform(1e-3, 1e5),
'svm__gamma': loguniform(1e-5, 1e2),
'svm__kernel': ['rbf', 'poly']},
n_iter=40, cv=3, n_jobs=-1, random_state=42
)
coarse.fit(X_train, y_train)
best_C = coarse.best_params_['svm__C']
best_gamma = coarse.best_params_['svm__gamma']
best_kern = coarse.best_params_['svm__kernel']
print(f"Stage 1 best: kernel={best_kern} C={best_C:.2f} gamma={best_gamma:.4f}")
# Stage 2: fine grid around Stage 1 result
fine_C = np.logspace(np.log10(best_C/10), np.log10(best_C*10), 5)
fine_gamma = np.logspace(np.log10(best_gamma/10), np.log10(best_gamma*10), 5)
fine = GridSearchCV(
pipe,
{'svm__C': fine_C, 'svm__gamma': fine_gamma, 'svm__kernel': [best_kern]},
cv=5, n_jobs=-1
)
fine.fit(X_train, y_train)
print(f"Stage 2 best: {fine.best_params_}")
print(f"Final test accuracy: {fine.best_estimator_.score(X_test, y_test):.4f}")
Stage 1 used only 40 × 3 = 120 fits to identify the right region. Stage 2 used 25 × 5 = 125 fits to fine-tune within it. Total: 245 fits. A naive grid at similar resolution would need thousands of fits β same final accuracy at a fraction of the compute cost.
Hyperparameter Tuning β Golden Rules
StandardScaler inside a Pipeline before any hyperparameter search to prevent data leakage.'scale' gives poor results.np.logspace(-3, 4, 8) not range(1, 100). These parameters operate over many orders of magnitude β linear spacing wastes 90% of your search budget.