The Story That Explains SVM
He does not draw just any line. He says: "I want the line that keeps the nearest settlement on each side as far away from the boundary as possible β a wide, safe no-man's-land on both sides." He looks only at the settlements nearest to the boundary β the ones that define where the line must go. He calls them his support villages. Every other settlement is irrelevant to where the border is drawn.
The wider the no-man's-land, the more confident he is that future travellers will be assigned to the correct kingdom even if their location is slightly uncertain. This maximum-width no-man's-land is the margin. The boundary line is the hyperplane. The support villages are the support vectors.
That is Support Vector Machine β find the boundary that maximises the gap between the two classes.
SVM finds the hyperplane that maximises the margin β the distance between the decision boundary and the nearest data points from each class. A larger margin means better generalisation to unseen data.
The Hyperplane β What SVM Is Drawing
In 2D, a decision boundary is a line. In 3D it is a plane. In higher dimensions it is a hyperplane β the same concept, just harder to visualise. SVM works identically regardless of dimensionality.
| Dimensions | Number of Features | Decision Boundary | Example |
|---|---|---|---|
| 2D | 2 features | A straight line | Height vs weight β classify sport |
| 3D | 3 features | A flat plane | Height, weight, age β classify disease |
| nD | n features | A hyperplane | 10,000 word features β classify spam |
Support Vectors β The Only Points That Matter
This is the defining property of SVM: the decision boundary is determined entirely by the handful of points closest to it. Everything else is noise as far as the boundary is concerned. This makes SVM remarkably robust to outliers that are far from the boundary, and also explains why SVM memory usage scales with support vectors β not with total data size.
After fitting an SVM, you can access the support vectors via
clf.support_vectors_, their indices via clf.support_,
and the count per class via clf.n_support_.
A model with very many support vectors is likely underfitting
(or the data is noisy). Very few support vectors usually means
a clean, well-separated dataset.
The SVM Objective β What Is Being Optimised
SVM is fundamentally an optimisation problem. It finds the weight vector w and bias b that define the widest possible margin while correctly classifying all training points.
The hard margin formulation requires all points to be perfectly classified with no violations β which fails whenever the data is noisy, overlapping, or not linearly separable. One misplaced point makes the optimisation infeasible. Real-world data is almost never perfectly separable. Enter the soft margin.
Hard Margin vs Soft Margin β The C Parameter
The soft margin SVM introduces slack variables ΞΎα΅’ that allow some points to violate the margin β at a cost controlled by the hyperparameter C.
| C Value | Margin Width | Violations Allowed | BiasβVariance | Risk |
|---|---|---|---|---|
| Very Small (0.001) | Wide margin | Many β very tolerant | High bias (underfitting) | Too simple, misclassifies many |
| Small (0.1) | Wide | Some violations OK | Leans toward underfitting | Smoother, more generalised |
| Medium (1.0) | Balanced | Moderate | Balanced β | Good default starting point |
| Large (100) | Narrow margin | Rarely allowed | High variance (overfitting) | Memorises training noise |
| β (Hard Margin) | Thinnest possible | Zero tolerance | Extreme variance | Fails on non-separable data |
Think of C as how much you care about misclassifying training points. High C = "I hate mistakes β fit the training data as tightly as possible." Low C = "Some mistakes are fine β give me a wide, general boundary." Always tune C with cross-validation. Start at 1.0 and search on a log scale: 0.001 β 0.01 β 0.1 β 1 β 10 β 100.
The Kernel Trick β Handling Non-Linear Data
The kernel trick does exactly this β it projects data into a higher-dimensional space where it becomes linearly separable, then finds a hyperplane there. The magic is that you never actually compute the high-dimensional coordinates explicitly β you only compute dot products between points using a kernel function K(xα΅’, xβ±Ό). Fast, exact, and elegant.
The Four Kernels β Choosing the Right One
| Kernel | Key Parameter | Effect of High Value | Effect of Low Value | Best For |
|---|---|---|---|---|
| Linear | C only | Tighter fit to training data | Wider margin, more regularised | Text, sparse high-D data |
| RBF | C and Ξ³ (gamma) | High Ξ³ β wiggly boundary, overfits | Low Ξ³ β smooth boundary, underfits | General-purpose, most datasets |
| Polynomial | C, degree d, Ξ³, r | High d β very complex boundary | Low d (=1) β reduces to linear | Image processing, NLP interactions |
| Sigmoid | C, Ξ³, r | Erratic behaviour | May not converge | Rarely used β prefer RBF or Linear |
In the RBF kernel, Ξ³ (gamma) controls the
"reach" of each training point's influence. High Ξ³ β each point
only influences its immediate neighbours β very jagged, localised boundary
β extreme overfitting. Low Ξ³ β each point influences the entire space β
overly smooth boundary β underfitting. In sklearn, gamma='scale'
(default, sets Ξ³ = 1/n_features Γ var(X)) is almost always
a better starting point than gamma='auto'.
Python Implementation
Basic SVC β Iris Dataset with All Kernels
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# β οΈ CRITICAL: Always scale features before SVM
# SVM is distance-based β unscaled features dominate unfairly
kernels = ['linear', 'rbf', 'poly', 'sigmoid']
for kernel in kernels:
pipe = Pipeline([
('scaler', StandardScaler()),
('svc', SVC(
kernel=kernel,
C=1.0,
gamma='scale', # 1/(n_features Γ var(X))
random_state=42
))
])
pipe.fit(X_train, y_train)
acc = pipe.score(X_test, y_test)
nsv = pipe['svc'].n_support_.sum() # total support vectors
print(f"Kernel: {kernel:8s} | Accuracy: {acc:.4f} | Support Vectors: {nsv}")
Inspecting the Fitted Model
# Fit a single model and inspect internals
pipe = Pipeline([
('scaler', StandardScaler()),
('svc', SVC(kernel='rbf', C=1.0, gamma='scale',
probability=True, random_state=42))
])
pipe.fit(X_train, y_train)
svc = pipe['svc']
print("Support vectors per class:", svc.n_support_)
print("Total support vectors: ", svc.n_support_.sum())
print("Support vector shape: ", svc.support_vectors_.shape)
# Predict probabilities (requires probability=True at fit time)
y_prob = pipe.predict_proba(X_test)
y_pred = pipe.predict(X_test)
print("\nSample predictions:")
for i in range(4):
cls = iris.target_names[y_pred[i]]
conf = y_prob[i].max() * 100
print(f" Predicted: {cls:12s} ({conf:.1f}% confidence)")
Hyperparameter Tuning β GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
pipe = Pipeline([
('scaler', StandardScaler()),
('svc', SVC(random_state=42))
])
# Search over C and gamma on a log scale
param_grid = {
'svc__kernel': ['rbf', 'linear'],
'svc__C': [0.01, 0.1, 1, 10, 100],
'svc__gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
}
grid = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=0
)
grid.fit(X_train, y_train)
print("Best parameters: ", grid.best_params_)
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Test accuracy: {grid.score(X_test, y_test):.4f}")
# Show top 5 combinations
import pandas as pd
results = pd.DataFrame(grid.cv_results_)
top5 = results.nsmallest(5, 'rank_test_score')[
['param_svc__kernel', 'param_svc__C',
'param_svc__gamma', 'mean_test_score']
]
print(top5.to_string(index=False))
LinearSVC β Faster for Large Datasets and Text
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report
# LinearSVC is much faster than SVC(kernel='linear') for large datasets
# Uses liblinear instead of libsvm β scales to millions of samples
categories = ['sci.space', 'comp.graphics',
'rec.sport.hockey', 'talk.politics.guns']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
text_pipeline = Pipeline([
('tfidf', TfidfVectorizer(
sublinear_tf=True,
max_features=50000,
ngram_range=(1, 2),
stop_words='english'
)),
('clf', LinearSVC(
C=1.0,
max_iter=2000,
random_state=42
))
])
text_pipeline.fit(train.data, train.target)
y_pred = text_pipeline.predict(test.data)
print(f"Test Accuracy: {text_pipeline.score(test.data, test.target):.4f}")
print(classification_report(
test.target, y_pred,
target_names=train.target_names
))
Use LinearSVC when: dataset has more than ~10,000 samples,
features are already high-dimensional and sparse (text), or you need
the fastest possible training. Use SVC(kernel='linear')
when: you need predict_proba(), or your dataset is small
enough that the overhead doesn't matter. For non-linear boundaries,
SVC(kernel='rbf') is the only option.
SVM for Regression β SVR
SVM can also do regression. Instead of finding a boundary that separates classes, SVR (Support Vector Regression) finds a tube of width 2Ξ΅ that contains as many training points as possible while staying as flat as possible. Points inside the tube contribute zero loss. Points outside the tube are penalised.
from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# SVR β always needs StandardScaler
svr_pipe = Pipeline([
('scaler', StandardScaler()),
('svr', SVR(
kernel='rbf',
C=100, # penalty for points outside tube
epsilon=0.1, # tube half-width β zero loss inside
gamma='scale'
))
])
svr_pipe.fit(X_train, y_train)
y_pred = svr_pipe.predict(X_test)
print(f"RΒ² Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")
print(f"Support Vectors: {svr_pipe['svr'].n_support_[0]}")
Ξ΅ (epsilon): width of the no-penalty tube. Larger Ξ΅ β fewer support vectors, smoother fit, less sensitive to noise. Smaller Ξ΅ β more support vectors, tighter fit. C: penalty for points outside the tube. High C β forces points inside β tight fit β overfitting risk. kernel + gamma: same as classification SVC.
Feature Importance with Linear SVM
Linear SVMs have a weight vector w that directly indicates how much each feature contributes to the decision boundary. This is equivalent to feature importance β the magnitude of each coefficient tells you how strongly that feature pushes a prediction toward one class.
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
pipe = Pipeline([
('scaler', StandardScaler()),
('svm', LinearSVC(C=1.0, max_iter=5000, random_state=42))
])
pipe.fit(X, y)
# Coefficients β positive = pushes toward class 1 (malignant)
coef = pipe['svm'].coef_[0]
feat_df = pd.DataFrame({
'feature': cancer.feature_names,
'coefficient': coef
}).sort_values('coefficient', key=lambda x: x.abs(), ascending=False)
print("Top features pushing toward MALIGNANT (positive coef):")
print(feat_df[feat_df['coefficient'] > 0].head(5).to_string(index=False))
print("\nTop features pushing toward BENIGN (negative coef):")
print(feat_df[feat_df['coefficient'] < 0].head(5).to_string(index=False))
coef_ is only available for LinearSVC and
SVC(kernel='linear'). For RBF, polynomial, or sigmoid kernels,
there is no direct weight vector in the original feature space β
the model lives in a high-dimensional transformed space.
Use permutation_importance from sklearn for non-linear SVMs.
SVM vs Other Classifiers β Complete Comparison
| Property | SVM (RBF) | Random Forest | Logistic Reg. | Neural Net |
|---|---|---|---|---|
| Training speed | Slow O(nΒ²βnΒ³) | Fast (parallel) | Very fast | Very slow |
| Inference speed | O(n_sv Γ features) | Moderate | Fastest | Fast |
| Scales to large N | Poorly (>100k painful) | Well | Very well | Best |
| High-dimensional sparse | Excellent (linear) | Poor | Excellent | Moderate |
| Feature scaling needed | Mandatory | Never | Yes | Yes |
| Probability outputs | Via Platt scaling only | Native (vote fraction) | Native, well calibrated | Native (softmax) |
| Non-linear boundaries | Excellent (kernels) | Excellent | Cannot β linear only | Best |
| Hyperparameter tuning | C + gamma β sensitive | Robust defaults | Simple (C only) | Many β very sensitive |
| Interpretability | Linear: moderate | Feature importance | High β coefficients | Black box |
When to Use SVM
Strengths vs Weaknesses
| Theoretically optimal β maximum margin guarantee |
| Excellent on high-dimensional sparse data (text) |
| Works when features > samples (rare advantage) |
| Robust to outliers far from the decision boundary |
| Memory efficient β only stores support vectors |
| Kernel trick handles complex non-linear boundaries |
| Strong generalisation with correct hyperparameters |
| Dual formulation enables efficient kernel computation |
| Slow training β O(nΒ²) to O(nΒ³) in samples |
| Feature scaling is mandatory β easy to forget |
| No native probability outputs β Platt scaling adds cost |
| Sensitive to C and gamma β tuning is non-trivial |
| Black box for non-linear kernels β not interpretable |
| Painful on >100k samples β does not scale |
| Multi-class requires one-vs-one or one-vs-rest tricks |
| No online/incremental learning support |
For datasets too large for SVC, use
SGDClassifier(loss='hinge') β this implements a linear SVM
trained with Stochastic Gradient Descent. It scales to millions of samples,
supports incremental/online learning, and produces nearly identical
results to LinearSVC on large datasets.
Set loss='modified_huber' if you also need probability estimates.
Real-World Applications
- Cancer classification from gene expression
- Protein structure prediction
- Works when genes >> patients
- First major ML breakthrough in genomics
- Handwriting recognition (MNIST)
- Face detection (pre-deep learning era)
- Object classification with HOG features
- Still used for small training sets
- Fraud detection and anomaly scoring
- Credit risk classification
- Intrusion detection systems
- One-class SVM for outlier detection
Golden Rules
StandardScaler inside a Pipeline to prevent data leakage.
Forgetting this is the single most common SVM mistake in practice.
GridSearchCV with
C β {0.01, 0.1, 1, 10, 100} and gamma β {'scale', 0.001, 0.01, 0.1}.
Never tune one while holding the other fixed.
gamma='scale' β never 'auto'.
The 'scale' default (Ξ³ = 1 / (n_features Γ var(X))) is almost
always better than 'auto' (Ξ³ = 1/n_features) because it
accounts for the actual variance of your data, not just its dimensionality.
LinearSVC uses liblinear and scales O(n) β far faster than
SVC(kernel='linear') which uses libsvm O(nΒ²βnΒ³).
For more than ~10,000 samples with a linear kernel, LinearSVC is the correct choice.
probability=True at construction time.
Enabling it after fitting is not possible β the model must be re-trained.
Be aware that probability=True uses 5-fold cross-validated Platt scaling
internally, making training roughly 5Γ slower. Use it only when calibrated
probabilities are genuinely needed.
StandardScaler in the pipeline as always.
sklearn.svm.OneClassSVM learns the boundary of normal data
without any negative class examples β predicts β1 for outliers,
+1 for inliers. Set nu (approximate fraction of outliers
expected) rather than C. This is one of the best off-the-shelf
anomaly detection algorithms available.