Deep Learning πŸ“‚ Deep Learning Introduction Β· 3 of 3 22 min read

Rosenblatt's Perceptron Algorithm

Learn how deep learning differs from traditional machine learning with intuitive stories, diagrams, and Python code. Covers neural networks, feature engineering, backpropagation, and when to use each approach.

Section 01

The Story That Separates Deep Learning from ML

The Detective and the Oracle
Imagine two investigators are given a photograph and asked: "Is this a cat or a dog?"

The classical ML detective pulls out a notepad. He measures ear shape, snout length, fur texture, eye spacing β€” each feature hand-crafted by a domain expert. Then he feeds those numbers into a formula and gives his verdict.

The deep learning oracle simply stares at the raw pixels for a very long time. Nobody told her what "ear" or "snout" means. She found those concepts on her own, buried inside millions of examples. Now she just knows β€” and she's usually right.

That difference β€” hand-crafted features vs. learned features β€” is the single most important distinction between classical ML and deep learning.

Classical Machine Learning is a toolkit of mathematical models (logistic regression, SVMs, decision trees, Random Forest) that learn patterns from structured, human-prepared features. A data scientist must decide which features to extract before training even begins.

Deep Learning is a sub-field of ML that uses layered artificial neural networks to learn hierarchical feature representations directly from raw data. The network builds its own internal vocabulary β€” edges, shapes, textures, concepts β€” layer by layer, without being told what to look for.

💡
Key Relationship

Deep Learning is not a replacement for Machine Learning β€” it is a specialised subset of it. All deep learning is machine learning, but not all machine learning is deep learning. Think of ML as the continent and deep learning as its largest, fastest-growing city.


Section 02

The Hierarchy β€” How They Fit Together

Before going further, a quick map so you never confuse the terms:

🧭 The AI Family Tree
Level 1
Artificial Intelligence β€” any technique that lets machines mimic human intelligence (rules, search, logic)
Level 2
Machine Learning β€” AI systems that learn from data instead of following hand-written rules
Level 3
Deep Learning β€” ML using multi-layered neural networks to learn hierarchical representations automatically
Level 4
Foundation Models / LLMs β€” very large deep learning models (GPT, BERT, Gemini) trained on internet-scale data

Section 03

Feature Engineering β€” The Dividing Line

The Chef vs The Recipe Robot
A classical ML pipeline is like hiring a team of chefs who each contribute their specialist knowledge: one measures sweetness, another sniffs for salt, a third checks texture. Only after all their measurements are combined does the machine taste the dish.

A deep learning pipeline is like a robot that tastes the raw ingredients directly β€” no chefs needed. Given enough dishes to taste, it eventually learns what "too salty" and "perfectly balanced" mean on its own. It's slower to train but it never needs a chef again.
⚙ Classical ML β€” You Engineer Features
StepWho Does It
Collect raw data (images, text, audio)Engineer
Extract meaningful features by handDomain Expert
Scale / normalise / encode featuresData Scientist
Feed clean feature vectors into modelAlgorithm
Model maps features β†’ predictionAlgorithm
⚡ Deep Learning β€” Network Engineers Features
StepWho Does It
Collect raw data (images, text, audio)Engineer
Feed raw data directly into networkAlgorithm
Layer 1 learns low-level features (edges)Network
Layer N learns high-level concepts (faces)Network
Final layer maps concepts β†’ predictionNetwork
⚠️
The Cost of "Automatic" Features

Deep learning trades manual effort for compute and data. You no longer write the features β€” but you need thousands or millions of labelled examples and significant GPU hours to learn them. Classical ML can work well with just hundreds of rows and a laptop.


Section 04

Inside a Neural Network β€” The Core Mechanism

A neural network is built from layers of neurons. Each neuron receives inputs, multiplies them by learned weights, adds a bias, and passes the result through an activation function. Stack enough of these layers and the network can approximate any function β€” a property called the Universal Approximation Theorem.

🧠 What Happens Inside One Neuron
Input
Receive numbers from the previous layer: x₁, xβ‚‚, x₃ …
Weight
Multiply each input by a learnable weight: w₁·x₁ + wβ‚‚Β·xβ‚‚ + w₃·x₃ + b
Activate
Pass the sum through a non-linear function (ReLU, sigmoid, tanh) to allow complex pattern learning
Output
Send the activated result to every neuron in the next layer
Neuron Output
y = Οƒ(Wx + b)
W = weight matrix, x = inputs, b = bias, Οƒ = activation function
ReLU Activation
ReLU(z) = max(0, z)
Most common hidden-layer activation. Kills negative values, passes positives unchanged
Loss (Cross-Entropy)
L = βˆ’Ξ£ yΒ·log(Ε·)
Measures how wrong the predictions are. The network minimises this
Weight Update (SGD)
w ← w βˆ’ Ξ·Β·βˆ‚L/βˆ‚w
Ξ· = learning rate. Gradient descent nudges weights in the direction that reduces loss
🔐
Why "Deep"?

The word deep refers to the number of hidden layers β€” not to some philosophical insight. A network with 2 hidden layers is "shallow". Modern networks like ResNet-50 have 50 layers, and GPT-4 has over 100. Each extra layer allows the network to build on the abstractions of the layer below it.


Section 05

How Learning Happens β€” Backpropagation

The Blame Game
Imagine a factory assembly line of 10 workers. A defective product comes out at the end. The manager asks: "Who is responsible?"

In backpropagation, the network does exactly this β€” but mathematically. It measures the error at the output, then propagates blame backwards through each layer, assigning a gradient (a share of responsibility) to every weight. Weights that contributed heavily to the error get adjusted more; innocent weights barely move. After millions of examples, the weights converge to values that produce correct answers.
01
Forward Pass
Input data flows forward through every layer. Each layer transforms the data until a final prediction is produced at the output layer.
02
Compute Loss
Compare the network's prediction against the true label using a loss function (e.g., cross-entropy for classification). This gives a single number measuring how wrong it is.
03
Backward Pass (Backprop)
Using the chain rule of calculus, the gradient of the loss with respect to every weight is computed β€” starting from the output and moving back through each layer.
04
Weight Update
An optimiser (SGD, Adam, RMSProp) uses the gradients to nudge every weight slightly in the direction that reduces the loss. Repeat millions of times.
05
Convergence
After enough iterations (epochs), the loss flattens out. The network has learned the mapping from inputs to correct outputs. Training stops.

Section 06

ML vs Deep Learning β€” Side-by-Side Comparison

Property Classical ML Deep Learning
Feature extraction Manual β€” by domain expert Automatic β€” learned from data
Data requirement Works with hundreds of rows Needs thousands–millions of examples
Compute requirement CPU, laptop-scale GPU / TPU, hours to weeks
Interpretability Often explainable (trees, linear) Black box β€” hard to explain
Best data types Tabular / structured Images, text, audio, video
Performance on unstructured data Poor without heavy preprocessing State-of-the-art
Performance on tabular data Excellent (XGBoost still wins often) Competitive but rarely better
Training time Seconds to minutes Hours to weeks
Inference speed Very fast Fast (but larger models are slow)
Transfer learning Not typically possible Yes β€” pre-trained models reused widely
🏆
The Practitioner's Rule of Thumb

Start with classical ML (XGBoost, Random Forest) for tabular data β€” it is faster, more interpretable, and often just as accurate. Move to deep learning when your data is images, audio, text, or any domain where human feature engineering is too expensive or impossible.


Section 07

Real-World Examples β€” Where Each Wins

📈
Classical ML Wins
Structured / Tabular Data
Credit scoring, fraud detection, customer churn, house price prediction, medical risk scoring. XGBoost and Random Forest consistently beat neural networks on tables with <100K rows. Features are meaningful numbers, not raw pixels or words.
🖼️
Deep Learning Wins
Unstructured / Raw Data
Image classification, object detection, speech recognition, machine translation, sentiment analysis, generative AI. Any task where the input is raw pixels, waveforms, or words β€” hand-crafted features are far too expensive or simply impossible to define.
⚖️
The Gray Zone
Hybrid Approaches
Recommender systems, time-series forecasting, and NLP on structured logs sit in the middle. Deep learning models like Transformers have started challenging XGBoost even on tabular data (TabTransformer, FT-Transformer) β€” the boundary is actively shifting.

Section 08

Diagram β€” Layers Learning Representations

The power of depth is best understood by watching what each layer actually learns in a computer vision network.

👀 What CNN Layers Learn (Image Classification)
Layer 1–2
Low-level features: edges, corners, colour gradients β€” simple visual primitives no different from what you'd write by hand with an edge-detection filter
Layer 3–5
Mid-level features: textures, patterns, simple shapes β€” combinations of the edges from earlier layers forming something like "fur" or "scales"
Layer 6–10
High-level features: object parts β€” eyes, wheels, doors, whiskers β€” concepts a domain expert would have had to manually define in classical ML
Final Layer
Semantic classes: "cat", "car", "face" β€” the fully composed concept assembled from all prior layers, ready for the output prediction
🔭
Transfer Learning Follows Directly from This

Because early layers learn universal features (edges, textures exist in all natural images), a network trained on ImageNet can be fine-tuned on your 500-image medical dataset by freezing the early layers and only retraining the final classifier. Classical ML models cannot do this β€” they carry no reusable internal representation.


Section 09

Python Code β€” Classical ML vs Deep Learning on the Same Task

Let's train both approaches on the MNIST handwritten digit dataset (28Γ—28 pixel greyscale images, 10 classes, 60K train / 10K test). The contrast shows exactly where the work lives in each paradigm.

Part A β€” Classical ML (Random Forest on flattened pixels)

# ── Classical ML approach: flatten image β†’ feature vector β†’ model ──
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from tensorflow.keras.datasets import mnist

# Load MNIST (28Γ—28 grayscale images)
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# ── Feature Engineering (manual): flatten 28Γ—28 = 784 pixel values
X_train_flat = X_train.reshape(-1, 784) / 255.0  # normalise 0–1
X_test_flat  = X_test.reshape(-1, 784)  / 255.0

# ── No deeper feature design β€” we hand the raw pixels to the model
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
rf.fit(X_train_flat, y_train)

y_pred = rf.predict(X_test_flat)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred):.4f}")
OUTPUT
Random Forest Accuracy: 0.9705

Part B β€” Deep Learning (CNN β€” learns its own features)

# ── Deep Learning approach: raw pixels β†’ CNN learns features itself ──
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist

# Load and reshape for CNN (needs channel dimension)
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train[..., np.newaxis] / 255.0  # shape: (60000, 28, 28, 1)
X_test  = X_test[...,  np.newaxis] / 255.0

# ── Architecture: no manual features β€” the Conv layers find them ──
model = models.Sequential([
    # Block 1 β€” learns edges and simple patterns
    layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
    layers.MaxPooling2D((2,2)),

    # Block 2 β€” learns higher-order shapes
    layers.Conv2D(64, (3,3), activation='relu'),
    layers.MaxPooling2D((2,2)),

    # Flatten and classify
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(10, activation='softmax')  # 10 digit classes
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.fit(X_train, y_train,
         epochs=5,
         batch_size=128,
         validation_split=0.1,
         verbose=1)

test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"CNN Accuracy: {test_acc:.4f}")
OUTPUT
Epoch 1/5 β€” loss: 0.2341 β€” accuracy: 0.9301 β€” val_accuracy: 0.9812 Epoch 2/5 β€” loss: 0.0742 β€” accuracy: 0.9776 β€” val_accuracy: 0.9876 Epoch 3/5 β€” loss: 0.0544 β€” accuracy: 0.9833 β€” val_accuracy: 0.9904 Epoch 4/5 β€” loss: 0.0432 β€” accuracy: 0.9869 β€” val_accuracy: 0.9913 Epoch 5/5 β€” loss: 0.0356 β€” accuracy: 0.9892 β€” val_accuracy: 0.9921 CNN Accuracy: 0.9921
📊
What the Numbers Tell You

Random Forest reached 97.05% by treating every pixel as an independent feature β€” a surprisingly strong baseline, but it has no understanding of spatial structure.

The CNN reached 99.21% by learning that nearby pixels form edges, edges form curves, and curves form digit shapes β€” exactly the hierarchy no classical model can discover alone. That 2% gap is the sound of spatial understanding.


Section 10

When to Use Which β€” Decision Guide

🧭 Choosing the Right Approach β€” Non-Negotiable Signals
1
If your data is a clean spreadsheet (<500K rows, no images or text), start with XGBoost or RandomForest. Classical ML will be faster to train, easier to explain, and usually just as accurate. Only switch to deep learning if you've squeezed every drop from the tree-based models.
2
If your input is images, audio, or raw text, use deep learning from the start. Pre-trained CNNs (ResNet, EfficientNet) and Transformers (BERT, DistilBERT) will outperform any hand-crafted feature pipeline by a wide margin.
3
If you have fewer than 1,000 labelled examples, be very careful with deep learning β€” it will overfit. Use transfer learning (fine-tune a pre-trained model) or stick with classical ML with strong regularisation.
4
If the model needs to be explainable (medical diagnosis, loan decisions, legal applications), classical ML is almost always the right choice. SHAP values and decision tree paths are far more trustworthy than neural network saliency maps.
5
If you are on a tight compute or time budget, classical ML wins. A Random Forest trains in seconds; a ResNet-50 from scratch takes hours on a GPU. For production systems where latency matters, smaller ML models often win on inference time too.
6
If your task is generative (create images, write text, synthesise audio), deep learning is the only option β€” there is no classical ML equivalent for a diffusion model or a language model.

Section 11

Common Deep Learning Architectures β€” Quick Map

🖼️
CNN
Convolutional Neural Network
Uses spatial convolution filters to detect local patterns in images. Dominant in computer vision: classification, detection, segmentation. Key models: LeNet, VGG, ResNet, EfficientNet.
🕐
RNN / LSTM
Recurrent Neural Network
Processes sequences by maintaining a hidden state across time steps. Used in time-series and early NLP. LSTMs added gating to solve the vanishing gradient problem. Largely replaced by Transformers for text.
🤖
Transformer
Attention-Based Architecture
Uses self-attention to relate all positions in a sequence simultaneously. Dominates NLP and vision. Powers BERT, GPT, T5, ViT. The architecture behind every modern LLM including Claude.
🌟
The Simplest Summary You Can Share

Classical ML: you hand the algorithm facts β†’ it learns a decision.
Deep Learning: you hand the algorithm raw sensory data β†’ it learns what facts to extract, then learns the decision.

Deep learning adds an extra meta-learning step that makes it powerful on unstructured data β€” and expensive on everything else.

You have completed Deep Learning Introduction. View all sections β†’