Machine Learning πŸ“‚ Supervised Learning Β· 11 of 17 37 min read

Support Vector Machine

Learn how SVM finds the optimal decision boundary by maximising the margin between classes, handles non-linear data with the kernel trick, and delivers exceptional performance in high-dimensional spaces like text and image classification.

Section 01

The Story That Explains SVM

The Border Guard Who Wanted the Widest No-Man's-Land
Two rival kingdoms β€” Kingdom Red and Kingdom Blue β€” share a disputed territory. A new border guard is assigned to draw the official boundary line between them. Many lines could separate the kingdoms without touching any settlement. But the guard is wise.

He does not draw just any line. He says: "I want the line that keeps the nearest settlement on each side as far away from the boundary as possible β€” a wide, safe no-man's-land on both sides." He looks only at the settlements nearest to the boundary β€” the ones that define where the line must go. He calls them his support villages. Every other settlement is irrelevant to where the border is drawn.

The wider the no-man's-land, the more confident he is that future travellers will be assigned to the correct kingdom even if their location is slightly uncertain. This maximum-width no-man's-land is the margin. The boundary line is the hyperplane. The support villages are the support vectors.

That is Support Vector Machine β€” find the boundary that maximises the gap between the two classes.
πŸ”‘
The Core Idea in One Sentence

SVM finds the hyperplane that maximises the margin β€” the distance between the decision boundary and the nearest data points from each class. A larger margin means better generalisation to unseen data.


Section 02

The Hyperplane β€” What SVM Is Drawing

In 2D, a decision boundary is a line. In 3D it is a plane. In higher dimensions it is a hyperplane β€” the same concept, just harder to visualise. SVM works identically regardless of dimensionality.

Dimensions Number of Features Decision Boundary Example
2D 2 features A straight line Height vs weight β†’ classify sport
3D 3 features A flat plane Height, weight, age β†’ classify disease
nD n features A hyperplane 10,000 word features β†’ classify spam
Hyperplane Equation
w Β· x + b = 0
w = weight vector (perpendicular to hyperplane), x = input feature vector, b = bias (shifts the boundary)
Classification Rule
Ε· = sign(w Β· x + b)
If wΒ·x + b > 0 β†’ Class +1. If wΒ·x + b < 0 β†’ Class βˆ’1. The sign of the dot product determines the predicted class.
πŸ“ Visual Diagram β€” Hyperplane, Margin, and Support Vectors
● ● ●
Class +1 points β€” sit on one side of the hyperplane. The closest point to the boundary defines the upper margin line: w Β· x + b = +1
β€” β€” β€”
The Hyperplane (decision boundary): w Β· x + b = 0 β€” equidistant between the two margin lines. Any point crossing this line is assigned to the opposite class.
β—‹ β—‹ β—‹
Class βˆ’1 points β€” sit on the other side. The closest point to the boundary defines the lower margin line: w Β· x + b = βˆ’1
Margin
The gap between the two margin lines = 2 / β€–wβ€–. SVM maximises this distance by minimising β€–wβ€–. Only the support vectors (points on the margin lines) determine where the boundary goes.

Section 03

Support Vectors β€” The Only Points That Matter

Remove 98% of the Data β€” The Model Does Not Change
Imagine you train an SVM on 10,000 data points to classify tumours as malignant or benign. After training, you could delete 9,960 of those points β€” the ones far from the boundary β€” and retrain from scratch on just the 40 remaining support vectors. You would get the exact same model, the exact same boundary, the exact same accuracy.

This is the defining property of SVM: the decision boundary is determined entirely by the handful of points closest to it. Everything else is noise as far as the boundary is concerned. This makes SVM remarkably robust to outliers that are far from the boundary, and also explains why SVM memory usage scales with support vectors β€” not with total data size.
πŸ“Œ
Inspecting Support Vectors in sklearn

After fitting an SVM, you can access the support vectors via clf.support_vectors_, their indices via clf.support_, and the count per class via clf.n_support_. A model with very many support vectors is likely underfitting (or the data is noisy). Very few support vectors usually means a clean, well-separated dataset.


Section 04

The SVM Objective β€” What Is Being Optimised

SVM is fundamentally an optimisation problem. It finds the weight vector w and bias b that define the widest possible margin while correctly classifying all training points.

Primal Objective (Minimise)
Β½ β€–wβ€–Β²
Minimising β€–wβ€– is equivalent to maximising the margin 2/β€–wβ€–. We minimise Β½β€–wβ€–Β² for mathematical convenience (makes the gradient linear).
Subject To (Hard Margin)
yα΅’ (w Β· xα΅’ + b) β‰₯ 1   βˆ€i
Every training point must be on the correct side of its margin line. The label yα΅’ ∈ {+1, βˆ’1} flips the constraint for each class automatically.
⚠️
Hard Margin Fails on Real Data

The hard margin formulation requires all points to be perfectly classified with no violations β€” which fails whenever the data is noisy, overlapping, or not linearly separable. One misplaced point makes the optimisation infeasible. Real-world data is almost never perfectly separable. Enter the soft margin.


Section 05

Hard Margin vs Soft Margin β€” The C Parameter

The soft margin SVM introduces slack variables ΞΎα΅’ that allow some points to violate the margin β€” at a cost controlled by the hyperparameter C.

Soft Margin Objective
min Β½β€–wβ€–Β² + C Ξ£ ΞΎα΅’
Minimise the margin width penalty plus the total violation cost. C controls the balance between these two competing goals.
Soft Margin Constraint
yα΅’(w Β· xα΅’ + b) β‰₯ 1 βˆ’ ΞΎα΅’   ΞΎα΅’ β‰₯ 0
Each point is allowed to violate its margin by slack ΞΎα΅’. Points inside the margin have ΞΎα΅’ > 0. Points on the wrong side have ΞΎα΅’ > 1.
C Value Margin Width Violations Allowed Bias–Variance Risk
Very Small (0.001) Wide margin Many β€” very tolerant High bias (underfitting) Too simple, misclassifies many
Small (0.1) Wide Some violations OK Leans toward underfitting Smoother, more generalised
Medium (1.0) Balanced Moderate Balanced βœ… Good default starting point
Large (100) Narrow margin Rarely allowed High variance (overfitting) Memorises training noise
∞ (Hard Margin) Thinnest possible Zero tolerance Extreme variance Fails on non-separable data
🎯
How to Think About C

Think of C as how much you care about misclassifying training points. High C = "I hate mistakes β€” fit the training data as tightly as possible." Low C = "Some mistakes are fine β€” give me a wide, general boundary." Always tune C with cross-validation. Start at 1.0 and search on a log scale: 0.001 β†’ 0.01 β†’ 0.1 β†’ 1 β†’ 10 β†’ 100.


Section 06

The Kernel Trick β€” Handling Non-Linear Data

The Crumpled Piece of Paper
Imagine two groups of points drawn on a piece of paper β€” Red points in a circle in the centre, Blue points surrounding them. No straight line can separate them in 2D. Now pick up the paper and crumple it into a 3D ball. Suddenly the Red points are in the middle of the ball and the Blue points form the outer shell. A flat plane can now cleanly cut through and separate them perfectly.

The kernel trick does exactly this β€” it projects data into a higher-dimensional space where it becomes linearly separable, then finds a hyperplane there. The magic is that you never actually compute the high-dimensional coordinates explicitly β€” you only compute dot products between points using a kernel function K(xα΅’, xβ±Ό). Fast, exact, and elegant.
01
Original Feature Space (non-linearly separable)
Data in 2D β€” a ring of Blue surrounding a cluster of Red. No straight line can separate them. A linear SVM would fail completely.
02
Apply Kernel Function β€” Implicit Mapping Ο†(x)
The kernel K(xα΅’, xβ±Ό) = Ο†(xα΅’) Β· Ο†(xβ±Ό) computes dot products in the transformed space without ever computing Ο†(x) explicitly. This is the kernel trick β€” cheap dot products, expensive mapping never required.
03
Higher-Dimensional Space (linearly separable)
In the transformed space the data separates cleanly. SVM finds the maximum-margin hyperplane there using the same algorithm as the linear case.
04
Project Boundary Back to Original Space
The linear hyperplane in high-D space appears as a curved decision boundary in the original space β€” circles, ellipses, or complex shapes depending on the kernel used.

Section 07

The Four Kernels β€” Choosing the Right One

πŸ“
Linear Kernel
K(xα΅’, xβ±Ό) = xα΅’ Β· xβ±Ό
No transformation β€” the standard dot product in original space. The boundary is a straight line/plane. Fastest to train. Works brilliantly for high-dimensional linearly separable data like TF-IDF text vectors where the feature space is already huge.
βœ… Fast, interpretable weight vector, best for text/NLP
❌ Fails when data is non-linearly separable
πŸ”΅
RBF Kernel
K(xα΅’, xβ±Ό) = exp(βˆ’Ξ³β€–xα΅’ βˆ’ xβ±Όβ€–Β²)
Radial Basis Function β€” the most popular kernel. Measures similarity as a Gaussian bell curve of distance. Points close together β†’ high similarity β†’ pulled to same class. Creates smooth, circular/elliptical decision boundaries. The default kernel in sklearn and a strong first choice for most classification tasks.
βœ… Works on most non-linear problems, smooth boundaries
❌ Two hyperparameters to tune (C and γ)
🌊
Polynomial Kernel
K(xᡒ, xⱼ) = (γ xᡒ · xⱼ + r)ᡈ
Raises the dot product to degree d. Degree 2 creates quadratic boundaries (parabolas, ellipses). Degree 3 creates cubic shapes. Used in image classification and natural language processing where feature interactions up to degree d are meaningful.
βœ… Captures feature interactions explicitly
❌ Many hyperparameters (C, d, γ, r), slow on large data
Kernel Key Parameter Effect of High Value Effect of Low Value Best For
Linear C only Tighter fit to training data Wider margin, more regularised Text, sparse high-D data
RBF C and Ξ³ (gamma) High Ξ³ β†’ wiggly boundary, overfits Low Ξ³ β†’ smooth boundary, underfits General-purpose, most datasets
Polynomial C, degree d, Ξ³, r High d β†’ very complex boundary Low d (=1) β†’ reduces to linear Image processing, NLP interactions
Sigmoid C, Ξ³, r Erratic behaviour May not converge Rarely used β€” prefer RBF or Linear
⚠️
The gamma Parameter β€” Most Dangerous Hyperparameter

In the RBF kernel, Ξ³ (gamma) controls the "reach" of each training point's influence. High Ξ³ β†’ each point only influences its immediate neighbours β†’ very jagged, localised boundary β†’ extreme overfitting. Low Ξ³ β†’ each point influences the entire space β†’ overly smooth boundary β†’ underfitting. In sklearn, gamma='scale' (default, sets Ξ³ = 1/n_features Γ— var(X)) is almost always a better starting point than gamma='auto'.


Section 08

Python Implementation

Basic SVC β€” Iris Dataset with All Kernels

from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ⚠️ CRITICAL: Always scale features before SVM
# SVM is distance-based β€” unscaled features dominate unfairly
kernels = ['linear', 'rbf', 'poly', 'sigmoid']

for kernel in kernels:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('svc', SVC(
            kernel=kernel,
            C=1.0,
            gamma='scale',   # 1/(n_features Γ— var(X))
            random_state=42
        ))
    ])
    pipe.fit(X_train, y_train)
    acc = pipe.score(X_test, y_test)
    nsv = pipe['svc'].n_support_.sum()  # total support vectors
    print(f"Kernel: {kernel:8s} | Accuracy: {acc:.4f} | Support Vectors: {nsv}")
OUTPUT
Kernel: linear | Accuracy: 0.9667 | Support Vectors: 12 Kernel: rbf | Accuracy: 0.9667 | Support Vectors: 18 Kernel: poly | Accuracy: 0.9667 | Support Vectors: 20 Kernel: sigmoid | Accuracy: 0.8333 | Support Vectors: 51

Inspecting the Fitted Model

# Fit a single model and inspect internals
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc',    SVC(kernel='rbf', C=1.0, gamma='scale',
                    probability=True, random_state=42))
])
pipe.fit(X_train, y_train)

svc = pipe['svc']

print("Support vectors per class:", svc.n_support_)
print("Total support vectors:    ", svc.n_support_.sum())
print("Support vector shape:     ", svc.support_vectors_.shape)

# Predict probabilities (requires probability=True at fit time)
y_prob = pipe.predict_proba(X_test)
y_pred = pipe.predict(X_test)

print("\nSample predictions:")
for i in range(4):
    cls = iris.target_names[y_pred[i]]
    conf = y_prob[i].max() * 100
    print(f"  Predicted: {cls:12s} ({conf:.1f}% confidence)")
OUTPUT
Support vectors per class: [ 4 7 7] Total support vectors: 18 Support vector shape: (18, 4) Sample predictions: Predicted: setosa (99.8% confidence) Predicted: versicolor (71.4% confidence) Predicted: virginica (88.2% confidence) Predicted: setosa (99.7% confidence)

Hyperparameter Tuning β€” GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc',    SVC(random_state=42))
])

# Search over C and gamma on a log scale
param_grid = {
    'svc__kernel': ['rbf', 'linear'],
    'svc__C':      [0.01, 0.1, 1, 10, 100],
    'svc__gamma':  ['scale', 'auto', 0.001, 0.01, 0.1],
}

grid = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=0
)
grid.fit(X_train, y_train)

print("Best parameters: ", grid.best_params_)
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Test accuracy:    {grid.score(X_test, y_test):.4f}")

# Show top 5 combinations
import pandas as pd
results = pd.DataFrame(grid.cv_results_)
top5 = results.nsmallest(5, 'rank_test_score')[
    ['param_svc__kernel', 'param_svc__C',
     'param_svc__gamma', 'mean_test_score']
]
print(top5.to_string(index=False))
OUTPUT
Best parameters: {'svc__C': 10, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'} Best CV accuracy: 0.9750 Test accuracy: 0.9667 param_svc__kernel param_svc__C param_svc__gamma mean_test_score rbf 10 scale 0.9750 rbf 1 scale 0.9667 linear 10 auto 0.9583 linear 1 scale 0.9583 rbf 100 scale 0.9583

LinearSVC β€” Faster for Large Datasets and Text

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report

# LinearSVC is much faster than SVC(kernel='linear') for large datasets
# Uses liblinear instead of libsvm β€” scales to millions of samples

categories = ['sci.space', 'comp.graphics',
              'rec.sport.hockey', 'talk.politics.guns']

train = fetch_20newsgroups(subset='train', categories=categories)
test  = fetch_20newsgroups(subset='test',  categories=categories)

text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        sublinear_tf=True,
        max_features=50000,
        ngram_range=(1, 2),
        stop_words='english'
    )),
    ('clf', LinearSVC(
        C=1.0,
        max_iter=2000,
        random_state=42
    ))
])

text_pipeline.fit(train.data, train.target)
y_pred = text_pipeline.predict(test.data)

print(f"Test Accuracy: {text_pipeline.score(test.data, test.target):.4f}")
print(classification_report(
    test.target, y_pred,
    target_names=train.target_names
))
OUTPUT
Test Accuracy: 0.9741 precision recall f1-score support sci.space 0.98 0.98 0.98 394 comp.graphics 0.95 0.96 0.95 389 rec.sport.hockey 0.99 0.99 0.99 399 talk.politics.guns 0.97 0.97 0.97 364 accuracy 0.97 1546
⚑
SVC vs LinearSVC β€” When to Use Each

Use LinearSVC when: dataset has more than ~10,000 samples, features are already high-dimensional and sparse (text), or you need the fastest possible training. Use SVC(kernel='linear') when: you need predict_proba(), or your dataset is small enough that the overhead doesn't matter. For non-linear boundaries, SVC(kernel='rbf') is the only option.


Section 09

SVM for Regression β€” SVR

SVM can also do regression. Instead of finding a boundary that separates classes, SVR (Support Vector Regression) finds a tube of width 2Ξ΅ that contains as many training points as possible while staying as flat as possible. Points inside the tube contribute zero loss. Points outside the tube are penalised.

SVR Objective
min Β½β€–wβ€–Β² + C Ξ£ (ΞΎα΅’ + ΞΎα΅’*)
Minimise the model complexity plus the total error outside the Ξ΅-tube. ΞΎα΅’ and ΞΎα΅’* are slack variables for points above and below the tube.
The Ξ΅-Insensitive Loss
L(y, Ε·) = max(0, |y βˆ’ Ε·| βˆ’ Ξ΅)
Zero loss for predictions within Ξ΅ of the true value. Linear penalty only outside the tube. Ξ΅ controls how wide the "no-penalty zone" is.
πŸ“ SVR β€” The Epsilon Tube Visualised
Tube Top
f(x) + Ξ΅ β€” the upper boundary of the no-penalty zone
Centre
f(x) = w Β· x + b β€” the regression prediction line (flat as possible)
Tube Bottom
f(x) βˆ’ Ξ΅ β€” the lower boundary. Points inside β†’ zero penalty.
Outside
Points beyond Β±Ξ΅ β€” penalised proportionally to their distance outside the tube. These become support vectors.
from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# SVR β€” always needs StandardScaler
svr_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR(
        kernel='rbf',
        C=100,           # penalty for points outside tube
        epsilon=0.1,     # tube half-width β€” zero loss inside
        gamma='scale'
    ))
])

svr_pipe.fit(X_train, y_train)
y_pred = svr_pipe.predict(X_test)

print(f"RΒ²  Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE:       ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")
print(f"Support Vectors: {svr_pipe['svr'].n_support_[0]}")
OUTPUT
RΒ² Score: 0.7254 MAE: $41,823 Support Vectors: 10,628
πŸ“Œ
SVR Hyperparameters β€” What Controls What

Ξ΅ (epsilon): width of the no-penalty tube. Larger Ξ΅ β†’ fewer support vectors, smoother fit, less sensitive to noise. Smaller Ξ΅ β†’ more support vectors, tighter fit. C: penalty for points outside the tube. High C β†’ forces points inside β†’ tight fit β†’ overfitting risk. kernel + gamma: same as classification SVC.


Section 10

Feature Importance with Linear SVM

Linear SVMs have a weight vector w that directly indicates how much each feature contributes to the decision boundary. This is equivalent to feature importance β€” the magnitude of each coefficient tells you how strongly that feature pushes a prediction toward one class.

from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm',    LinearSVC(C=1.0, max_iter=5000, random_state=42))
])
pipe.fit(X, y)

# Coefficients β€” positive = pushes toward class 1 (malignant)
coef = pipe['svm'].coef_[0]

feat_df = pd.DataFrame({
    'feature':    cancer.feature_names,
    'coefficient': coef
}).sort_values('coefficient', key=lambda x: x.abs(), ascending=False)

print("Top features pushing toward MALIGNANT (positive coef):")
print(feat_df[feat_df['coefficient'] > 0].head(5).to_string(index=False))

print("\nTop features pushing toward BENIGN (negative coef):")
print(feat_df[feat_df['coefficient'] < 0].head(5).to_string(index=False))
OUTPUT
Top features pushing toward MALIGNANT (positive coef): feature coefficient worst concave points 1.832 worst radius 1.541 mean concave points 1.422 worst perimeter 1.198 Top features pushing toward BENIGN (negative coef): feature coefficient worst texture -0.872 mean texture -0.641 mean smoothness -0.538
⚠️
Coefficients Only Work for Linear Kernel

coef_ is only available for LinearSVC and SVC(kernel='linear'). For RBF, polynomial, or sigmoid kernels, there is no direct weight vector in the original feature space β€” the model lives in a high-dimensional transformed space. Use permutation_importance from sklearn for non-linear SVMs.


Section 11

SVM vs Other Classifiers β€” Complete Comparison

Property SVM (RBF) Random Forest Logistic Reg. Neural Net
Training speed Slow O(n²–nΒ³) Fast (parallel) Very fast Very slow
Inference speed O(n_sv Γ— features) Moderate Fastest Fast
Scales to large N Poorly (>100k painful) Well Very well Best
High-dimensional sparse Excellent (linear) Poor Excellent Moderate
Feature scaling needed Mandatory Never Yes Yes
Probability outputs Via Platt scaling only Native (vote fraction) Native, well calibrated Native (softmax)
Non-linear boundaries Excellent (kernels) Excellent Cannot β€” linear only Best
Hyperparameter tuning C + gamma β€” sensitive Robust defaults Simple (C only) Many β€” very sensitive
Interpretability Linear: moderate Feature importance High β€” coefficients Black box

Section 12

When to Use SVM

βœ…
Text Classification
LinearSVC with TF-IDF is one of the strongest text classifiers available. The linear kernel thrives in the sparse, ultra-high-dimensional space that TF-IDF creates β€” often beating far more complex models.
spam Β· news topics Β· sentiment
βœ…
Small to Medium Datasets
SVM shines when you have under 100,000 samples. Its slow O(n²–nΒ³) training complexity becomes a problem at scale β€” but on smaller datasets, its strong theoretical guarantees pay off well.
under 50k samples
βœ…
Clear Margin of Separation
When the two classes are cleanly separable with a wide gap, SVM will find the optimal boundary with minimal data. It is the theoretically optimal classifier for linearly separable data.
clean binary problems
βœ…
High-Dimensional Features
SVM handles datasets where the number of features exceeds the number of samples β€” a scenario where many algorithms fail. Gene expression data (few patients, thousands of genes) is a classic SVM use case.
genomics Β· bioinformatics Β· NLP
❌
Very Large Datasets
Training SVC on 1 million samples is impractical β€” it could take hours or days. Use LinearSVC for large linear problems, or switch to SGDClassifier with a hinge loss for near-identical results at massive scale.
use SGDClassifier instead
❌
Heavy Overlapping Classes
When classes heavily overlap with no clear margin, SVM requires very aggressive tuning of C and struggles to find stable support vectors. Gradient Boosting or neural networks will typically perform better.
use XGBoost/LightGBM instead

Section 13

Strengths vs Weaknesses

βœ… Strengths
Theoretically optimal β€” maximum margin guarantee
Excellent on high-dimensional sparse data (text)
Works when features > samples (rare advantage)
Robust to outliers far from the decision boundary
Memory efficient β€” only stores support vectors
Kernel trick handles complex non-linear boundaries
Strong generalisation with correct hyperparameters
Dual formulation enables efficient kernel computation
❌ Weaknesses
Slow training β€” O(nΒ²) to O(nΒ³) in samples
Feature scaling is mandatory β€” easy to forget
No native probability outputs β€” Platt scaling adds cost
Sensitive to C and gamma β€” tuning is non-trivial
Black box for non-linear kernels β€” not interpretable
Painful on >100k samples β€” does not scale
Multi-class requires one-vs-one or one-vs-rest tricks
No online/incremental learning support
πŸ”„
SGDClassifier β€” SVM at Scale

For datasets too large for SVC, use SGDClassifier(loss='hinge') β€” this implements a linear SVM trained with Stochastic Gradient Descent. It scales to millions of samples, supports incremental/online learning, and produces nearly identical results to LinearSVC on large datasets. Set loss='modified_huber' if you also need probability estimates.


Section 14

Real-World Applications

Bioinformatics
🧬
  • Cancer classification from gene expression
  • Protein structure prediction
  • Works when genes >> patients
  • First major ML breakthrough in genomics
Image Recognition
πŸ–ΌοΈ
  • Handwriting recognition (MNIST)
  • Face detection (pre-deep learning era)
  • Object classification with HOG features
  • Still used for small training sets
Finance and Security
πŸ›‘οΈ
  • Fraud detection and anomaly scoring
  • Credit risk classification
  • Intrusion detection systems
  • One-class SVM for outlier detection

Section 15

Golden Rules

βš”οΈ Support Vector Machine β€” Non-Negotiable Rules
1
Always scale features before SVM β€” without exception. SVM computes distances and dot products. An unscaled feature with range [0, 10,000] will dominate a feature with range [0, 1] completely. Use StandardScaler inside a Pipeline to prevent data leakage. Forgetting this is the single most common SVM mistake in practice.
2
Tune C and gamma together on a log scale. These two hyperparameters interact strongly β€” a good C at one gamma may be terrible at another. Use GridSearchCV with C ∈ {0.01, 0.1, 1, 10, 100} and gamma ∈ {'scale', 0.001, 0.01, 0.1}. Never tune one while holding the other fixed.
3
Start with gamma='scale' β€” never 'auto'. The 'scale' default (Ξ³ = 1 / (n_features Γ— var(X))) is almost always better than 'auto' (Ξ³ = 1/n_features) because it accounts for the actual variance of your data, not just its dimensionality.
4
Use LinearSVC for text and large datasets, SVC for everything else. LinearSVC uses liblinear and scales O(n) β€” far faster than SVC(kernel='linear') which uses libsvm O(n²–nΒ³). For more than ~10,000 samples with a linear kernel, LinearSVC is the correct choice.
5
For probabilities, set probability=True at construction time. Enabling it after fitting is not possible β€” the model must be re-trained. Be aware that probability=True uses 5-fold cross-validated Platt scaling internally, making training roughly 5Γ— slower. Use it only when calibrated probabilities are genuinely needed.
6
If training takes too long, switch to SGDClassifier(loss='hinge'). This implements a linear SVM with stochastic gradient descent β€” same decision boundary, same theoretical guarantees, but O(n) training instead of O(nΒ²). Add StandardScaler in the pipeline as always.
7
Monitor the number of support vectors. If nearly every training point becomes a support vector, your SVM is underfitting β€” increase C or try a more flexible kernel. If the number of support vectors grows as you add data, you likely have a noisy dataset and should investigate data quality before further tuning.
8
Use One-Class SVM for anomaly detection. sklearn.svm.OneClassSVM learns the boundary of normal data without any negative class examples β€” predicts βˆ’1 for outliers, +1 for inliers. Set nu (approximate fraction of outliers expected) rather than C. This is one of the best off-the-shelf anomaly detection algorithms available.