Support Vector Machine

Section 01

The Story That Explains SVM

📖 Real World Analogy

The Border Guard Who Wanted the Widest No-Man's-Land

Two rival kingdoms — Kingdom Red and Kingdom Blue — share a disputed territory. A new border guard is assigned to draw the official boundary line between them. Many lines could separate the kingdoms without touching any settlement. But the guard is wise.

He does not draw just any line. He says: "I want the line that keeps the nearest settlement on each side as far away from the boundary as possible — a wide, safe no-man's-land on both sides." He looks only at the settlements nearest to the boundary — the ones that define where the line must go. He calls them his support villages. Every other settlement is irrelevant to where the border is drawn.

The wider the no-man's-land, the more confident he is that future travellers will be assigned to the correct kingdom even if their location is slightly uncertain. This maximum-width no-man's-land is the margin. The boundary line is the hyperplane. The support villages are the support vectors.

That is Support Vector Machine — find the boundary that maximises the gap between the two classes.

🔑

The Core Idea in One Sentence

SVM finds the hyperplane that maximises the margin — the distance between the decision boundary and the nearest data points from each class. A larger margin means better generalisation to unseen data.

Section 02

The Hyperplane — What SVM Is Drawing

In 2D, a decision boundary is a line. In 3D it is a plane. In higher dimensions it is a hyperplane — the same concept, just harder to visualise. SVM works identically regardless of dimensionality.

Dimensions	Number of Features	Decision Boundary	Example
2D	2 features	A straight line	Height vs weight → classify sport
3D	3 features	A flat plane	Height, weight, age → classify disease
nD	n features	A hyperplane	10,000 word features → classify spam

Hyperplane Equation

w · x + b = 0

w = weight vector (perpendicular to hyperplane), x = input feature vector, b = bias (shifts the boundary)

Classification Rule

ŷ = sign(w · x + b)

If w·x + b > 0 → Class +1. If w·x + b < 0 → Class −1. The sign of the dot product determines the predicted class.

📐 Visual Diagram — Hyperplane, Margin, and Support Vectors

● ● ●

Class +1 points — sit on one side of the hyperplane. The closest point to the boundary defines the upper margin line: w · x + b = +1

— — —

The Hyperplane (decision boundary): w · x + b = 0 — equidistant between the two margin lines. Any point crossing this line is assigned to the opposite class.

○ ○ ○

Class −1 points — sit on the other side. The closest point to the boundary defines the lower margin line: w · x + b = −1

Margin

The gap between the two margin lines = 2 / ‖w‖. SVM maximises this distance by minimising ‖w‖. Only the support vectors (points on the margin lines) determine where the boundary goes.

Section 03

Support Vectors — The Only Points That Matter

📖 Key Insight

Remove 98% of the Data — The Model Does Not Change

Imagine you train an SVM on 10,000 data points to classify tumours as malignant or benign. After training, you could delete 9,960 of those points — the ones far from the boundary — and retrain from scratch on just the 40 remaining support vectors. You would get the exact same model, the exact same boundary, the exact same accuracy.

This is the defining property of SVM: the decision boundary is determined entirely by the handful of points closest to it. Everything else is noise as far as the boundary is concerned. This makes SVM remarkably robust to outliers that are far from the boundary, and also explains why SVM memory usage scales with support vectors — not with total data size.

📌

Inspecting Support Vectors in sklearn

After fitting an SVM, you can access the support vectors via clf.support_vectors_, their indices via clf.support_, and the count per class via clf.n_support_. A model with very many support vectors is likely underfitting (or the data is noisy). Very few support vectors usually means a clean, well-separated dataset.

Section 04

The SVM Objective — What Is Being Optimised

SVM is fundamentally an optimisation problem. It finds the weight vector w and bias b that define the widest possible margin while correctly classifying all training points.

Primal Objective (Minimise)

½ ‖w‖²

Minimising ‖w‖ is equivalent to maximising the margin 2/‖w‖. We minimise ½‖w‖² for mathematical convenience (makes the gradient linear).

Subject To (Hard Margin)

yᵢ (w · xᵢ + b) ≥ 1 ∀i

Every training point must be on the correct side of its margin line. The label yᵢ ∈ {+1, −1} flips the constraint for each class automatically.

⚠️

Hard Margin Fails on Real Data

The hard margin formulation requires all points to be perfectly classified with no violations — which fails whenever the data is noisy, overlapping, or not linearly separable. One misplaced point makes the optimisation infeasible. Real-world data is almost never perfectly separable. Enter the soft margin.

Section 05

Hard Margin vs Soft Margin — The C Parameter

The soft margin SVM introduces slack variables ξᵢ that allow some points to violate the margin — at a cost controlled by the hyperparameter C.

Soft Margin Objective

min ½‖w‖² + C Σ ξᵢ

Minimise the margin width penalty plus the total violation cost. C controls the balance between these two competing goals.

Soft Margin Constraint

yᵢ(w · xᵢ + b) ≥ 1 − ξᵢ ξᵢ ≥ 0

Each point is allowed to violate its margin by slack ξᵢ. Points inside the margin have ξᵢ > 0. Points on the wrong side have ξᵢ > 1.

C Value	Margin Width	Violations Allowed	Bias–Variance	Risk
Very Small (0.001)	Wide margin	Many — very tolerant	High bias (underfitting)	Too simple, misclassifies many
Small (0.1)	Wide	Some violations OK	Leans toward underfitting	Smoother, more generalised
Medium (1.0)	Balanced	Moderate	Balanced ✅	Good default starting point
Large (100)	Narrow margin	Rarely allowed	High variance (overfitting)	Memorises training noise
∞ (Hard Margin)	Thinnest possible	Zero tolerance	Extreme variance	Fails on non-separable data

🎯

How to Think About C

Think of C as how much you care about misclassifying training points. High C = "I hate mistakes — fit the training data as tightly as possible." Low C = "Some mistakes are fine — give me a wide, general boundary." Always tune C with cross-validation. Start at 1.0 and search on a log scale: 0.001 → 0.01 → 0.1 → 1 → 10 → 100.

Section 06

The Kernel Trick — Handling Non-Linear Data

📖 Story

The Crumpled Piece of Paper

Imagine two groups of points drawn on a piece of paper — Red points in a circle in the centre, Blue points surrounding them. No straight line can separate them in 2D. Now pick up the paper and crumple it into a 3D ball. Suddenly the Red points are in the middle of the ball and the Blue points form the outer shell. A flat plane can now cleanly cut through and separate them perfectly.

The kernel trick does exactly this — it projects data into a higher-dimensional space where it becomes linearly separable, then finds a hyperplane there. The magic is that you never actually compute the high-dimensional coordinates explicitly — you only compute dot products between points using a kernel function K(xᵢ, xⱼ). Fast, exact, and elegant.

Original Feature Space (non-linearly separable)

Data in 2D — a ring of Blue surrounding a cluster of Red. No straight line can separate them. A linear SVM would fail completely.

Apply Kernel Function — Implicit Mapping φ(x)

The kernel K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ) computes dot products in the transformed space without ever computing φ(x) explicitly. This is the kernel trick — cheap dot products, expensive mapping never required.

Higher-Dimensional Space (linearly separable)

In the transformed space the data separates cleanly. SVM finds the maximum-margin hyperplane there using the same algorithm as the linear case.

Project Boundary Back to Original Space

The linear hyperplane in high-D space appears as a curved decision boundary in the original space — circles, ellipses, or complex shapes depending on the kernel used.

Section 07

The Four Kernels — Choosing the Right One

📏

Linear Kernel

K(xᵢ, xⱼ) = xᵢ · xⱼ

No transformation — the standard dot product in original space. The boundary is a straight line/plane. Fastest to train. Works brilliantly for high-dimensional linearly separable data like TF-IDF text vectors where the feature space is already huge.

✅ Fast, interpretable weight vector, best for text/NLP

❌ Fails when data is non-linearly separable

🔵

RBF Kernel

K(xᵢ, xⱼ) = exp(−γ‖xᵢ − xⱼ‖²)

Radial Basis Function — the most popular kernel. Measures similarity as a Gaussian bell curve of distance. Points close together → high similarity → pulled to same class. Creates smooth, circular/elliptical decision boundaries. The default kernel in sklearn and a strong first choice for most classification tasks.

✅ Works on most non-linear problems, smooth boundaries

❌ Two hyperparameters to tune (C and γ)

🌊

Polynomial Kernel

K(xᵢ, xⱼ) = (γ xᵢ · xⱼ + r)ᵈ

Raises the dot product to degree d. Degree 2 creates quadratic boundaries (parabolas, ellipses). Degree 3 creates cubic shapes. Used in image classification and natural language processing where feature interactions up to degree d are meaningful.

✅ Captures feature interactions explicitly

❌ Many hyperparameters (C, d, γ, r), slow on large data

Kernel	Key Parameter	Effect of High Value	Effect of Low Value	Best For
Linear	C only	Tighter fit to training data	Wider margin, more regularised	Text, sparse high-D data
RBF	C and γ (gamma)	High γ → wiggly boundary, overfits	Low γ → smooth boundary, underfits	General-purpose, most datasets
Polynomial	C, degree d, γ, r	High d → very complex boundary	Low d (=1) → reduces to linear	Image processing, NLP interactions
Sigmoid	C, γ, r	Erratic behaviour	May not converge	Rarely used — prefer RBF or Linear

⚠️

The gamma Parameter — Most Dangerous Hyperparameter

In the RBF kernel, γ (gamma) controls the "reach" of each training point's influence. High γ → each point only influences its immediate neighbours → very jagged, localised boundary → extreme overfitting. Low γ → each point influences the entire space → overly smooth boundary → underfitting. In sklearn, gamma='scale' (default, sets γ = 1/n_features × var(X)) is almost always a better starting point than gamma='auto'.

Section 08

Python Implementation

Basic SVC — Iris Dataset with All Kernels

from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ⚠️ CRITICAL: Always scale features before SVM
# SVM is distance-based — unscaled features dominate unfairly
kernels = ['linear', 'rbf', 'poly', 'sigmoid']

for kernel in kernels:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('svc', SVC(
            kernel=kernel,
            C=1.0,
            gamma='scale',   # 1/(n_features × var(X))
            random_state=42
        ))
    ])
    pipe.fit(X_train, y_train)
    acc = pipe.score(X_test, y_test)
    nsv = pipe['svc'].n_support_.sum()  # total support vectors
    print(f"Kernel: {kernel:8s} | Accuracy: {acc:.4f} | Support Vectors: {nsv}")

OUTPUT

Inspecting the Fitted Model

# Fit a single model and inspect internals
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc',    SVC(kernel='rbf', C=1.0, gamma='scale',
                    probability=True, random_state=42))
])
pipe.fit(X_train, y_train)

svc = pipe['svc']

print("Support vectors per class:", svc.n_support_)
print("Total support vectors:    ", svc.n_support_.sum())
print("Support vector shape:     ", svc.support_vectors_.shape)

# Predict probabilities (requires probability=True at fit time)
y_prob = pipe.predict_proba(X_test)
y_pred = pipe.predict(X_test)

print("\nSample predictions:")
for i in range(4):
    cls = iris.target_names[y_pred[i]]
    conf = y_prob[i].max() * 100
    print(f"  Predicted: {cls:12s} ({conf:.1f}% confidence)")

OUTPUT

Support vectors per class: [ 4 7 7] Total support vectors: 18 Support vector shape: (18, 4) Sample predictions: Predicted: setosa (99.8% confidence) Predicted: versicolor (71.4% confidence) Predicted: virginica (88.2% confidence) Predicted: setosa (99.7% confidence)

Hyperparameter Tuning — GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc',    SVC(random_state=42))
])

# Search over C and gamma on a log scale
param_grid = {
    'svc__kernel': ['rbf', 'linear'],
    'svc__C':      [0.01, 0.1, 1, 10, 100],
    'svc__gamma':  ['scale', 'auto', 0.001, 0.01, 0.1],
}

grid = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=0
)
grid.fit(X_train, y_train)

print("Best parameters: ", grid.best_params_)
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Test accuracy:    {grid.score(X_test, y_test):.4f}")

# Show top 5 combinations
import pandas as pd
results = pd.DataFrame(grid.cv_results_)
top5 = results.nsmallest(5, 'rank_test_score')[
    ['param_svc__kernel', 'param_svc__C',
     'param_svc__gamma', 'mean_test_score']
]
print(top5.to_string(index=False))

OUTPUT

Best parameters: {'svc__C': 10, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'} Best CV accuracy: 0.9750 Test accuracy: 0.9667 param_svc__kernel param_svc__C param_svc__gamma mean_test_score rbf 10 scale 0.9750 rbf 1 scale 0.9667 linear 10 auto 0.9583 linear 1 scale 0.9583 rbf 100 scale 0.9583

LinearSVC — Faster for Large Datasets and Text

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report

# LinearSVC is much faster than SVC(kernel='linear') for large datasets
# Uses liblinear instead of libsvm — scales to millions of samples

categories = ['sci.space', 'comp.graphics',
              'rec.sport.hockey', 'talk.politics.guns']

train = fetch_20newsgroups(subset='train', categories=categories)
test  = fetch_20newsgroups(subset='test',  categories=categories)

text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        sublinear_tf=True,
        max_features=50000,
        ngram_range=(1, 2),
        stop_words='english'
    )),
    ('clf', LinearSVC(
        C=1.0,
        max_iter=2000,
        random_state=42
    ))
])

text_pipeline.fit(train.data, train.target)
y_pred = text_pipeline.predict(test.data)

print(f"Test Accuracy: {text_pipeline.score(test.data, test.target):.4f}")
print(classification_report(
    test.target, y_pred,
    target_names=train.target_names
))

OUTPUT

Test Accuracy: 0.9741 precision recall f1-score support sci.space 0.98 0.98 0.98 394 comp.graphics 0.95 0.96 0.95 389 rec.sport.hockey 0.99 0.99 0.99 399 talk.politics.guns 0.97 0.97 0.97 364 accuracy 0.97 1546

⚡

SVC vs LinearSVC — When to Use Each

Use LinearSVC when: dataset has more than ~10,000 samples, features are already high-dimensional and sparse (text), or you need the fastest possible training. Use SVC(kernel='linear') when: you need predict_proba(), or your dataset is small enough that the overhead doesn't matter. For non-linear boundaries, SVC(kernel='rbf') is the only option.

Section 09

SVM for Regression — SVR

SVM can also do regression. Instead of finding a boundary that separates classes, SVR (Support Vector Regression) finds a tube of width 2ε that contains as many training points as possible while staying as flat as possible. Points inside the tube contribute zero loss. Points outside the tube are penalised.

SVR Objective

min ½‖w‖² + C Σ (ξᵢ + ξᵢ*)

Minimise the model complexity plus the total error outside the ε-tube. ξᵢ and ξᵢ* are slack variables for points above and below the tube.

The ε-Insensitive Loss

L(y, ŷ) = max(0, |y − ŷ| − ε)

Zero loss for predictions within ε of the true value. Linear penalty only outside the tube. ε controls how wide the "no-penalty zone" is.

📐 SVR — The Epsilon Tube Visualised

Tube Top

f(x) + ε — the upper boundary of the no-penalty zone

Centre

f(x) = w · x + b — the regression prediction line (flat as possible)

Tube Bottom

f(x) − ε — the lower boundary. Points inside → zero penalty.

Outside

Points beyond ±ε — penalised proportionally to their distance outside the tube. These become support vectors.

from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# SVR — always needs StandardScaler
svr_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR(
        kernel='rbf',
        C=100,           # penalty for points outside tube
        epsilon=0.1,     # tube half-width — zero loss inside
        gamma='scale'
    ))
])

svr_pipe.fit(X_train, y_train)
y_pred = svr_pipe.predict(X_test)

print(f"R²  Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE:       ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")
print(f"Support Vectors: {svr_pipe['svr'].n_support_[0]}")

OUTPUT

R² Score: 0.7254 MAE: $41,823 Support Vectors: 10,628

📌

SVR Hyperparameters — What Controls What

ε (epsilon): width of the no-penalty tube. Larger ε → fewer support vectors, smoother fit, less sensitive to noise. Smaller ε → more support vectors, tighter fit. C: penalty for points outside the tube. High C → forces points inside → tight fit → overfitting risk. kernel + gamma: same as classification SVC.

Section 10

Feature Importance with Linear SVM

Linear SVMs have a weight vector w that directly indicates how much each feature contributes to the decision boundary. This is equivalent to feature importance — the magnitude of each coefficient tells you how strongly that feature pushes a prediction toward one class.

from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm',    LinearSVC(C=1.0, max_iter=5000, random_state=42))
])
pipe.fit(X, y)

# Coefficients — positive = pushes toward class 1 (malignant)
coef = pipe['svm'].coef_[0]

feat_df = pd.DataFrame({
    'feature':    cancer.feature_names,
    'coefficient': coef
}).sort_values('coefficient', key=lambda x: x.abs(), ascending=False)

print("Top features pushing toward MALIGNANT (positive coef):")
print(feat_df[feat_df['coefficient'] > 0].head(5).to_string(index=False))

print("\nTop features pushing toward BENIGN (negative coef):")
print(feat_df[feat_df['coefficient'] < 0].head(5).to_string(index=False))

OUTPUT

Top features pushing toward MALIGNANT (positive coef): feature coefficient worst concave points 1.832 worst radius 1.541 mean concave points 1.422 worst perimeter 1.198 Top features pushing toward BENIGN (negative coef): feature coefficient worst texture -0.872 mean texture -0.641 mean smoothness -0.538

⚠️

Coefficients Only Work for Linear Kernel

coef_ is only available for LinearSVC and SVC(kernel='linear'). For RBF, polynomial, or sigmoid kernels, there is no direct weight vector in the original feature space — the model lives in a high-dimensional transformed space. Use permutation_importance from sklearn for non-linear SVMs.

Section 11

SVM vs Other Classifiers — Complete Comparison

Property	SVM (RBF)	Random Forest	Logistic Reg.	Neural Net
Training speed	Slow O(n²–n³)	Fast (parallel)	Very fast	Very slow
Inference speed	O(n_sv × features)	Moderate	Fastest	Fast
Scales to large N	Poorly (>100k painful)	Well	Very well	Best
High-dimensional sparse	Excellent (linear)	Poor	Excellent	Moderate
Feature scaling needed	Mandatory	Never	Yes	Yes
Probability outputs	Via Platt scaling only	Native (vote fraction)	Native, well calibrated	Native (softmax)
Non-linear boundaries	Excellent (kernels)	Excellent	Cannot — linear only	Best
Hyperparameter tuning	C + gamma — sensitive	Robust defaults	Simple (C only)	Many — very sensitive
Interpretability	Linear: moderate	Feature importance	High — coefficients	Black box

Section 12

When to Use SVM

✅

Text Classification

LinearSVC with TF-IDF is one of the strongest text classifiers available. The linear kernel thrives in the sparse, ultra-high-dimensional space that TF-IDF creates — often beating far more complex models.

spam · news topics · sentiment

✅

Small to Medium Datasets

SVM shines when you have under 100,000 samples. Its slow O(n²–n³) training complexity becomes a problem at scale — but on smaller datasets, its strong theoretical guarantees pay off well.

under 50k samples

✅

Clear Margin of Separation

When the two classes are cleanly separable with a wide gap, SVM will find the optimal boundary with minimal data. It is the theoretically optimal classifier for linearly separable data.

clean binary problems

✅

High-Dimensional Features

SVM handles datasets where the number of features exceeds the number of samples — a scenario where many algorithms fail. Gene expression data (few patients, thousands of genes) is a classic SVM use case.

genomics · bioinformatics · NLP

❌

Very Large Datasets

Training SVC on 1 million samples is impractical — it could take hours or days. Use LinearSVC for large linear problems, or switch to SGDClassifier with a hinge loss for near-identical results at massive scale.

use SGDClassifier instead

❌

Heavy Overlapping Classes

When classes heavily overlap with no clear margin, SVM requires very aggressive tuning of C and struggles to find stable support vectors. Gradient Boosting or neural networks will typically perform better.

use XGBoost/LightGBM instead

Section 13

Strengths vs Weaknesses

✅ Strengths

Theoretically optimal — maximum margin guarantee

Excellent on high-dimensional sparse data (text)

Works when features > samples (rare advantage)

Robust to outliers far from the decision boundary

Memory efficient — only stores support vectors

Kernel trick handles complex non-linear boundaries

Strong generalisation with correct hyperparameters

Dual formulation enables efficient kernel computation

❌ Weaknesses

Slow training — O(n²) to O(n³) in samples

Feature scaling is mandatory — easy to forget

No native probability outputs — Platt scaling adds cost

Sensitive to C and gamma — tuning is non-trivial

Black box for non-linear kernels — not interpretable

Painful on >100k samples — does not scale

Multi-class requires one-vs-one or one-vs-rest tricks

No online/incremental learning support

🔄

SGDClassifier — SVM at Scale

For datasets too large for SVC, use SGDClassifier(loss='hinge') — this implements a linear SVM trained with Stochastic Gradient Descent. It scales to millions of samples, supports incremental/online learning, and produces nearly identical results to LinearSVC on large datasets. Set loss='modified_huber' if you also need probability estimates.

Section 14

Real-World Applications

Bioinformatics

🧬

Cancer classification from gene expression
Protein structure prediction
Works when genes >> patients
First major ML breakthrough in genomics

Image Recognition

🖼️

Handwriting recognition (MNIST)
Face detection (pre-deep learning era)
Object classification with HOG features
Still used for small training sets

Finance and Security

🛡️

Fraud detection and anomaly scoring
Credit risk classification
Intrusion detection systems
One-class SVM for outlier detection

Section 15

Golden Rules

⚔️ Support Vector Machine — Non-Negotiable Rules

Always scale features before SVM — without exception. SVM computes distances and dot products. An unscaled feature with range [0, 10,000] will dominate a feature with range [0, 1] completely. Use StandardScaler inside a Pipeline to prevent data leakage. Forgetting this is the single most common SVM mistake in practice.

Tune C and gamma together on a log scale. These two hyperparameters interact strongly — a good C at one gamma may be terrible at another. Use GridSearchCV with C ∈ {0.01, 0.1, 1, 10, 100} and gamma ∈ {'scale', 0.001, 0.01, 0.1}. Never tune one while holding the other fixed.

Start with gamma='scale' — never 'auto'. The 'scale' default (γ = 1 / (n_features × var(X))) is almost always better than 'auto' (γ = 1/n_features) because it accounts for the actual variance of your data, not just its dimensionality.

Use LinearSVC for text and large datasets, SVC for everything else. LinearSVC uses liblinear and scales O(n) — far faster than SVC(kernel='linear') which uses libsvm O(n²–n³). For more than ~10,000 samples with a linear kernel, LinearSVC is the correct choice.

For probabilities, set probability=True at construction time. Enabling it after fitting is not possible — the model must be re-trained. Be aware that probability=True uses 5-fold cross-validated Platt scaling internally, making training roughly 5× slower. Use it only when calibrated probabilities are genuinely needed.

If training takes too long, switch to SGDClassifier(loss='hinge'). This implements a linear SVM with stochastic gradient descent — same decision boundary, same theoretical guarantees, but O(n) training instead of O(n²). Add StandardScaler in the pipeline as always.

Monitor the number of support vectors. If nearly every training point becomes a support vector, your SVM is underfitting — increase C or try a more flexible kernel. If the number of support vectors grows as you add data, you likely have a noisy dataset and should investigate data quality before further tuning.

Use One-Class SVM for anomaly detection. sklearn.svm.OneClassSVM learns the boundary of normal data without any negative class examples — predicts −1 for outliers, +1 for inliers. Set nu (approximate fraction of outliers expected) rather than C. This is one of the best off-the-shelf anomaly detection algorithms available.