Complete Math Foundation

Math for AI Engineers

Everything you need to understand LLMs, embeddings, RAG, and neural networks — explained simply with real examples.

No proofs
No university extras
Applied to AI directly
3 Parts · 30 Topics
Part 01

Linear Algebra

1.1 Scalars

A scalar is simply a single number. That's it. When you say "the temperature is 37 degrees" or "the confidence is 0.92" — those are scalars. They have magnitude (size) but no direction.

Types of scalars

TypeExampleIn AI
Positive integer5, 42, 1000Number of layers in a model
Negative-3, -0.5Negative reward in RL
Decimal (float)0.92, 3.14Probability score, loss value
Zero0Padding, zero gradient

Scalar operations

# Simple scalar operations — just normal numbers a = 5 b = 3 a + b = 8 # addition a - b = 2 # subtraction a * b = 15 # multiplication a / b = 1.67 # division
Real AI Example

When an LLM gives you a confidence score like 0.94 — that is a scalar.
When your loss function returns 2.37 during training — that is a scalar.
When you set learning rate = 0.001 — that is a scalar.

Key Difference
A scalar is just a plain number. A vector is a list of numbers. A matrix is a grid of numbers. That's the core idea of linear algebra.
1.2 Vectors

A vector is an ordered list of numbers. Think of it as a point in space that also has a direction. In AI, vectors are everywhere — your text, images, and audio are all converted into vectors called embeddings.

Writing vectors

# Row vector (horizontal) v = [3, 4] # Column vector (vertical) v = [3] [4] # 3D vector v = [1, 2, 3] # An embedding vector (real AI example, 4 dimensions) word_king = [0.81, 0.34, -0.23, 0.92]

Vector Addition

Add matching positions together. Both vectors must have the same length.

u = [1, 2, 3] v = [4, 5, 6] u + v = [1+4, 2+5, 3+6] = [ 5, 7, 9]
Example 2 — Subtraction
u = [9, 6, 3] v = [2, 1, 1] u - v = [9-2, 6-1, 3-1] = [ 7, 5, 2]

Scalar Multiplication

Multiply every element of the vector by the scalar. This scales the vector — makes it longer or shorter without changing direction (if scalar is positive).

v = [2, 3, 5] scalar = 3 3 * v = [3*2, 3*3, 3*5] = [ 6, 9, 15] 0.5 * v = [0.5*2, 0.5*3, 0.5*5] = [ 1, 1.5, 2.5]

Magnitude (Length) of a Vector — L2 Norm

The magnitude tells you how long the vector is. Formula: square each element, sum them, take the square root.

v = [3, 4]

||v|| = √(3² + 4²) = √(9 + 16) = √25 = 5
Example — 3D vector
v = [1, 2, 2] ||v|| = √(1² + 2² + 2²) = √(1 + 4 + 4) = √9 = 3

Unit Vector

A unit vector has magnitude exactly 1. To make a unit vector from any vector, divide each element by the magnitude.

v = [3, 4] ||v|| = 5 unit_v = v / ||v|| = [3/5, 4/5] = [0.6, 0.8] # Verify: √(0.6² + 0.8²) = √(0.36 + 0.64) = √1 = 1 ✓
Interactive — Vector Magnitude Calculator
3
4
v = [3, 4]
||v|| = √(3² + 4²) = √25 = 5.00
Why this matters in AI
In a vector database (like Pinecone or Weaviate), your text is stored as a vector of 768 or 1536 numbers. When you search, you find vectors close to your query vector. Magnitude and direction determine what "close" means.
Quick Check
What is the magnitude of vector v = [6, 8]?
12
14
10
48
√(36 + 64) = √100 = 10 ✓ Remember: square each number, add, square root.
1.3 Matrices

A matrix is a 2D grid of numbers arranged in rows and columns. Think of it as a spreadsheet. Matrices are how neural networks store their learned weights.

Matrix notation — rows × columns

# 2×2 matrix (2 rows, 2 columns) A = [1, 2] [3, 4] # 2×3 matrix (2 rows, 3 columns) B = [1, 2, 3] [4, 5, 6] # 3×2 matrix (3 rows, 2 columns) C = [10, 20] [30, 40] [50, 60]

Special Matrices

# Identity Matrix — like the number 1 for matrices # Any matrix × Identity = same matrix I = [1, 0] [0, 1] # Zero Matrix — like the number 0 O = [0, 0] [0, 0]

Matrix Addition and Subtraction

Both matrices must be the same size. Add/subtract matching positions.

A = [1, 2] B = [5, 6] [3, 4] [7, 8] A + B = [1+5, 2+6] A - B = [1-5, 2-6] [3+7, 4+8] [3-7, 4-8] = [6, 8] = [-4, -4] [10, 12] [-4, -4]
Why this matters in AI
In a neural network, the weights of one layer are stored in a matrix. When you train the model, you update the weight matrix by subtracting a small gradient matrix from it — that's matrix subtraction in action.
1.4 Transpose

Transposing a matrix means flipping rows into columns. The first row becomes the first column, second row becomes the second column, and so on. Written as Aᵀ or A^T.

# Original 2×3 matrix A = [1, 2, 3] [4, 5, 6] # Transposed — becomes 3×2 matrix A^T = [1, 4] [2, 5] [3, 6]
Example 2 — square matrix
A = [1, 2] A^T = [1, 3] [3, 4] [2, 4] # Row 1 became Column 1 # Row 2 became Column 2
Transpose of a vector
A row vector [1, 2, 3] transposed becomes a column vector:
[1] [2] [3]
This matters for matrix multiplication — you'll use transpose to align dimensions.
In Transformers — QKᵀ
The attention formula is QKᵀ — that T is a transpose. Q is a matrix of queries, K is a matrix of keys. To multiply them, K must be transposed. This is one of the most important operations in modern AI.
1.5 Dot Product

The dot product takes two vectors of the same length and returns a single scalar. Multiply matching positions, then add everything up.

u = [1, 2, 3] v = [4, 5, 6]

u · v = (1×4) + (2×5) + (3×6)
= 4 + 10 + 18
= 32
Example 2 — with negatives
u = [2, -1, 3] v = [4, 5, -2] u · v = (2×4) + (-1×5) + (3×-2) = 8 + -5 + -6 = -3

What the value tells you

Dot ProductMeaningVectors are...
Large positivePointing in same directionSimilar
ZeroPerpendicular (90°)Unrelated
Large negativePointing oppositeOpposite meaning
Embedding similarity — Real AI use case
# Simplified embeddings for words king = [0.8, 0.3, 0.9] queen = [0.7, 0.4, 0.8] banana = [0.1, 0.9, 0.1] king · queen = (0.8×0.7) + (0.3×0.4) + (0.9×0.8) = 0.56 + 0.12 + 0.72 = 1.40 ← HIGH: similar! king · banana = (0.8×0.1) + (0.3×0.9) + (0.9×0.1) = 0.08 + 0.27 + 0.09 = 0.44 ← LOW: different!
Interactive — Dot Product Calculator
Vector u = [u1, u2, u3]
1
2
3
Vector v = [v1, v2, v3]
4
5
6
1.6 Cosine Similarity

The dot product has a problem — bigger vectors (just because they're long, not because they're similar) get higher scores. Cosine similarity fixes this by dividing the dot product by both magnitudes. It compares direction only, ignoring size.

cos(θ) = (u · v) / (||u|| × ||v||)

Result is always between -1 and 1:

ScoreMeaningExample
1.0Identical direction"king" vs "king"
0.85+Very similar"king" vs "queen"
~0.0Unrelated"king" vs "banana"
-1.0Opposite meaning"hot" vs "cold"
Full worked example
u = [3, 4] v = [6, 8] ← v is just 2× u, same direction # Step 1: dot product u · v = (3×6) + (4×8) = 18 + 32 = 50 # Step 2: magnitudes ||u|| = √(9+16) = 5 ||v|| = √(36+64) = 10 # Step 3: divide cos(θ) = 50 / (5 × 10) = 50 / 50 = 1.0 # Result: 1.0 → Identical direction! Makes sense.
Interactive — Cosine Similarity Explorer
Vector u
3
4
Vector v
6
8
In RAG Systems
When you ask a question, it's converted to an embedding vector. Cosine similarity compares your question vector to every stored chunk vector. The highest scores are retrieved as context for the LLM. This is literally how RAG works.
1.7 Matrix Multiplication

Matrix multiplication is the most important operation in neural networks. Every forward pass is a series of matrix multiplications. It's NOT element-wise — it's rows × columns.

The Rule: inner dimensions must match

# (m × n) × (n × p) = (m × p) # The inner n's must match. Output is m × p. (2×3) × (3×4) = (2×4) ✓ inner 3s match (2×3) × (4×3) = ERROR ✗ 3 ≠ 4, doesn't work (5×2) × (2×1) = (5×1) ✓ inner 2s match

How to compute it — Row × Column

A = [1, 2] B = [5, 6] [3, 4] [7, 8] # Output[0][0] = Row 0 of A · Col 0 of B # = [1,2] · [5,7] = 1×5 + 2×7 = 5+14 = 19 # Output[0][1] = Row 0 of A · Col 1 of B # = [1,2] · [6,8] = 1×6 + 2×8 = 6+16 = 22 # Output[1][0] = Row 1 of A · Col 0 of B # = [3,4] · [5,7] = 3×5 + 4×7 = 15+28 = 43 # Output[1][1] = Row 1 of A · Col 1 of B # = [3,4] · [6,8] = 3×6 + 4×8 = 18+32 = 50 A × B = [19, 22] [43, 50]
Example 2 — 2×3 times 3×2
A = [1, 0, 2] B = [3, 1] [4, 1, 0] [2, 0] [1, 4] # (2×3) × (3×2) = (2×2) Out[0][0] = [1,0,2]·[3,2,1] = 3+0+2 = 5 Out[0][1] = [1,0,2]·[1,0,4] = 1+0+8 = 9 Out[1][0] = [4,1,0]·[3,2,1] = 12+2+0 = 14 Out[1][1] = [4,1,0]·[1,0,4] = 4+0+0 = 4 Result = [5, 9] [14, 4]

Matrix × Vector — The Neural Network Operation

# Neural net: output = W × input + bias # W is weights (matrix), input is a vector W = [2, 1] [0, 3] [1, 1] ← 3×2 weight matrix x = [4] [5] ← input vector (2×1) W × x = [2×4+1×5] = [13] [0×4+3×5] = [15] ← output vector [1×4+1×5] = [ 9]
Order Matters! AB ≠ BA
Unlike numbers where 3×4 = 4×3, with matrices order matters. AB and BA often give completely different results — and sometimes BA doesn't even work because the dimensions don't match.
Interactive — 2×2 Matrix Multiplier
Matrix A (2×2)
×
Matrix B (2×2)
1.8 L1 and L2 Norms

A norm is a way to measure the "size" or "length" of a vector. There are different ways to measure size, and L1 and L2 are the two most important ones.

L1 Norm — Manhattan Distance

Add up the absolute values of all elements. Imagine walking in a city — you can only go left/right or up/down, not diagonally.

v = [3, -4, 2]

||v||₁ = |3| + |-4| + |2| = 3 + 4 + 2 = 9

L2 Norm — Euclidean Distance

This is the "straight line" distance. Square everything, sum, take square root. This is the default "magnitude" we've been using.

v = [3, -4, 2]

||v||₂ = √(3² + (-4)² + 2²) = √(9+16+4) = √29 ≈ 5.39
NormFormulaWhen used
L1Sum of |values|Sparse features, LASSO regularization
L2√(Sum of squares)Embeddings, most distance calculations
In embeddings
Most embedding models (OpenAI, Sentence Transformers) normalize their output vectors to have L2 norm = 1. That means when you do cosine similarity, the denominator is just 1×1 = 1, so cosine similarity = dot product for normalized vectors. This is why vector databases can be so fast.
1.9 Normalization

Normalization means adjusting values so they fit into a standard range or distribution. In AI, this happens constantly — before training, during training, and when using embeddings.

Vector Normalization (Unit Norm)

Divide every element by the L2 norm. Result: a vector with magnitude = 1, same direction.

v = [3, 4] ||v||₂ = 5 normalized = [3/5, 4/5] = [0.6, 0.8] # Verify: √(0.6² + 0.8²) = √(0.36 + 0.64) = √1 = 1 ✓

Min-Max Normalization

Scale values to fit between 0 and 1. Useful for training data preparation.

data = [10, 20, 30, 40, 50] min = 10, max = 50 formula: (x - min) / (max - min) normalized = [(10-10)/40, (20-10)/40, (30-10)/40, ...] = [0.0, 0.25, 0.5, 0.75, 1.0]

Why Normalization Matters for RAG

Embedding normalization
OpenAI's text-embedding-ada-002 and most modern embedding models return vectors that are already L2-normalized (magnitude = 1). This means when you query a vector DB, dot product = cosine similarity, making search extremely fast. If you're comparing embeddings from un-normalized models, always normalize first.
1.10 Linear Combinations

A linear combination is what you get when you take several vectors, multiply each by a scalar (a weight), and add the results together. It is the most fundamental operation in all of linear algebra — and it is exactly what a neural network layer does.

Given vectors v₁, v₂, v₃ and scalars c₁, c₂, c₃:

result = c₁·v₁ + c₂·v₂ + c₃·v₃

Worked Example

# Two vectors v1 = [1, 0] v2 = [0, 1] # Weights (scalars) c1 = 3 c2 = 5 # Linear combination result = 3 × [1, 0] + 5 × [0, 1] = [3, 0] + [0, 5] = [3, 5]

Geometric meaning

Geometrically, each vector is like an arrow pointing in a direction. Multiplying by a scalar stretches or shrinks that arrow. Adding the arrows together gives you a new arrow — the combination. By changing the weights, you can reach any point in space that those vectors can describe.

Weighted sum of word embeddings
# Sentence embedding = weighted sum of word embeddings word_the = [0.1, 0.2] word_cat = [0.9, 0.4] word_sat = [0.5, 0.8] # Equal weights (simple average is a linear combination) sentence = (1/3)·word_the + (1/3)·word_cat + (1/3)·word_sat = [0.500, 0.467]
Why this matters in AI
Every single neuron in a neural network computes a linear combination. The weights of the network are the scalars c₁, c₂, ... and the inputs are the vectors. Training a neural network is just finding the right scalars to get the output you want. It's linear combinations, all the way down.
1.11 Span

The span of a set of vectors is the complete collection of all possible linear combinations you can make from them. In plain English: every point you could ever reach by stretching, shrinking, and adding those vectors together.

The span of {v₁, v₂, ..., vₖ} is:

span{v₁, ..., vₖ} = { c₁v₁ + c₂v₂ + ... + cₖvₖ | c₁,...,cₖ ∈ ℝ }

Geometric intuition

# One vector in 2D → span is a line through the origin v1 = [1, 2] span{v1} = all multiples: [2,4], [-1,-2], [0.5,1], ... → a LINE # Two independent vectors in 2D → span is the entire plane v1 = [1, 0] v2 = [0, 1] span{v1, v2} = any [x, y] you want → the entire 2D PLANE
Why span matters
When you store a document as an embedding vector in a high-dimensional space, that embedding only "lives" in a certain region of the space — the span of the training data's embedding directions. PCA and SVD work by finding which directions in that span carry the most information.
Span in PCA and SVD
When you have a dataset with 1000 features (dimensions), most of the meaningful variation lives in a much smaller subspace — often just 10–50 directions. PCA finds the span of those directions. Everything inside the span is captured; everything outside is discarded as noise. This is how dimensionality reduction works.
1.12 Basis

A basis is the smallest set of vectors whose span covers an entire space — with no redundancy. Every vector in the space can be written as exactly one linear combination of the basis vectors. You can think of a basis as the coordinate system for a space.

Standard Basis

The standard basis is the most familiar one. In 2D, it uses two simple unit vectors pointing along the axes:

# Standard basis for 2D space e1 = [1, 0] ← points along the x-axis e2 = [0, 1] ← points along the y-axis # Any 2D vector is just a combination of these [3, 5] = 3·e1 + 5·e2 [-2, 7] = -2·e1 + 7·e2 # Standard basis for 3D space e1 = [1, 0, 0] e2 = [0, 1, 0] e3 = [0, 0, 1]

What makes a valid basis?

RequirementMeaning
Spans the spaceEvery point in the space is reachable
Linearly independentNo vector in the set is redundant
MinimalYou can't remove any vector and still span the space
Non-standard basis example
# This is also a valid basis for 2D space b1 = [1, 1] b2 = [1, -1] # We can still reach any 2D point # [3, 5] = 4·b1 + (-1)·b2 = [4,4] + [-1,1] = [3,5] ✓ # In PCA, the principal components ARE the new basis pc1 = [0.71, 0.71] ← direction of most variance pc2 = [-0.71, 0.71] ← perpendicular direction
Basis in coordinate systems and PCA
When you do PCA on a dataset, you're essentially finding a new, better basis — one that aligns with the directions of greatest variation in your data. Projecting data onto this new basis is what dimensionality reduction means. The principal components are the new basis vectors for a compressed representation of your data.
1.13 Dimension

The dimension of a space is simply the number of vectors in its basis — which is also the number of independent directions you need to describe every point in that space. It tells you how many coordinates it takes to pin down a location.

dim(space) = number of basis vectors required

2D plane → dimension 2
3D space → dimension 3
GPT-4 embedding space → dimension 1536

High-dimensional spaces in AI

# Each model uses vectors of a fixed dimension text-embedding-ada-002 → 1536-dimensional vectors BERT base → 768-dimensional vectors GPT-2 hidden states → 768-dimensional vectors Llama 3 8B hidden states→ 4096-dimensional vectors # Each token processed by an LLM is a point # in a 4096-dimensional space (for Llama 3 8B) token_"cat" = [0.23, -0.81, ..., 0.14] ← 4096 numbers

Why high dimensions are strange

The curse of dimensionality
In high-dimensional spaces, almost everything is far from everything else. If you have 1000 data points in a 1536-dimensional embedding space, they are likely spread very thin. This is why nearest-neighbor search in vector databases is a hard engineering problem — and why approximate methods like HNSW and IVF indexes exist.
Embeddings and LLM vector spaces
Every word, sentence, or document that an LLM processes exists as a point in a high-dimensional space. The dimension of that space is a design choice made by the model creators — higher dimensions can capture more nuance but cost more memory and compute. Dimensionality reduction (PCA, UMAP) collapses these spaces down to 2D or 3D so humans can visualize them.
1.14 Linear Independence

A set of vectors is linearly independent if none of the vectors can be written as a linear combination of the others. In plain terms: no vector in the group is redundant — each one adds new directional information that the others cannot provide.

Vectors v₁, v₂, ..., vₖ are linearly independent if:

c₁v₁ + c₂v₂ + ... + cₖvₖ = 0
has ONLY the solution c₁ = c₂ = ... = cₖ = 0

Dependent vs Independent — concrete examples

--- LINEARLY DEPENDENT (bad — one is redundant) --- v1 = [1, 2] v2 = [2, 4] ← v2 = 2 × v1, adds nothing new v3 = [3, 6] ← v3 = 3 × v1, also redundant # These three vectors only span a LINE, not a plane --- LINEARLY INDEPENDENT (good — each adds new info) --- v1 = [1, 0] v2 = [0, 1] ← cannot be made from v1 # These two vectors span the full 2D plane

What independence means in practice

SituationMeaningProblem
Dependent featuresOne feature = combination of othersRedundant info, hurts training
Dependent weight rowsMatrix is singular / rank-deficientCannot invert, unstable solution
Independent featuresEach feature adds unique information— (this is what you want)
Feature engineering example
# These two features are linearly DEPENDENT temperature_celsius = [0, 20, 100] temperature_fahrenheit = [32, 68, 212] # fahrenheit = celsius × 1.8 + 32 → exact linear formula # Giving both to a model adds zero new information # and can cause numerical instability # Solution: drop one. Use only Celsius.
In Feature Engineering and PCA
When you have correlated features in a dataset (e.g. height in cm and height in inches), those features are linearly dependent. Models can still learn from them, but it wastes capacity and can create numerical problems. PCA specifically works by finding independent directions in your data — the principal components are always linearly independent (orthogonal). Checking for and removing dependent features is a core step in good data preparation.
Quick Check
Are the vectors v1 = [1, 2, 3] and v2 = [2, 4, 6] linearly independent?
Yes — they have different values
No — v2 = 2 × v1, so v2 is redundant
It depends on their magnitudes
Only in 3D space are they dependent
Correct! v2 is exactly 2 × v1. They point in the same direction and span only a line, not a plane. Linearly independent vectors must each add a genuinely new direction.
1.15 Tensors

A tensor is the general container for numbers in any number of dimensions. Scalars, vectors, and matrices are all just special cases of tensors. In deep learning, every single piece of data — inputs, weights, gradients, activations — lives inside a tensor.

0D tensor = Scalar → a single number
1D tensor = Vector → a list of numbers
2D tensor = Matrix → a grid of numbers
3D tensor → a stack of matrices
ND tensor → any number of dimensions

Visualising the dimensions

# 0D — Scalar loss = 2.37 # shape: () # 1D — Vector embedding = [0.2, 0.8, -0.3, 0.5] # shape: (4,) # 2D — Matrix weight_matrix = [ # shape: (3, 4) [0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8], [0.9, 1.0, 1.1, 1.2] ] # 3D — Batch of sequences (batch × time × features) tokens = shape (32, 512, 768) # ↑ 32 sequences, each 512 tokens long, # each token represented by 768 numbers # 4D — Batch of images (batch × height × width × channels) images = shape (16, 224, 224, 3) # ↑ 16 images, 224×224 pixels, 3 color channels (RGB)

Tensors in PyTorch and TensorFlow

import torch # Create tensors scalar = torch.tensor(3.14) # shape: torch.Size([]) vector = torch.tensor([1.0, 2.0, 3.0]) # shape: torch.Size([3]) matrix = torch.zeros(3, 4) # shape: torch.Size([3, 4]) batch = torch.randn(32, 512, 768) # shape: torch.Size([32, 512, 768]) # The shape tells you exactly the tensor's dimensions print(batch.shape) # torch.Size([32, 512, 768]) print(batch.ndim) # 3 print(batch.numel()) # 32 × 512 × 768 = 12,582,912 numbers

Tensor operations are parallelised on GPUs

Why tensors = speed
When you do A @ B on two tensors in PyTorch, it does not loop through elements one by one. GPUs execute thousands of multiplications simultaneously on tensor data. This is why a model that would take weeks on a CPU trains in hours on a GPU — the data structure itself enables massive parallelism.
Tensors through an LLM layer
# Input: a batch of tokenised text x = shape (8, 512) # 8 sequences, 512 tokens each # After embedding layer x_embed = shape (8, 512, 768) # each token → 768-dim vector # After attention layer — same shape, but transformed x_attn = shape (8, 512, 768) # Logits over vocabulary logits = shape (8, 512, 50257) # one score per vocab token
Every PyTorch and TensorFlow model uses tensors
There is no deep learning without tensors. The model's weights are tensors. The input data is a tensor. The loss is a scalar tensor. The gradient is a tensor with the same shape as the weights. Understanding tensor shapes is the first practical skill you need when debugging neural networks — when you see a shape mismatch error, you are seeing a tensor dimension problem.
Quick Check
A batch of 64 images, each 128×128 pixels with 3 colour channels, has what tensor shape?
(128, 128, 3)
(64, 3, 128)
(64, 128, 128, 3)
(64, 49152)
Correct! The shape is (batch=64, height=128, width=128, channels=3). Each dimension represents one axis of the tensor. PyTorch often uses (batch, channels, height, width) order instead, but the number of dimensions is the same.

Part 02

Statistics & Probability

2.1 Mean (Average)

The mean is the center of your data. Add all values, divide by how many there are. It answers: "what is the typical value?"

Mean = (x₁ + x₂ + x₃ + ... + xₙ) / n
= Σx / n
Example 1 — model accuracy scores
scores = [0.85, 0.90, 0.78, 0.92, 0.88] mean = (0.85 + 0.90 + 0.78 + 0.92 + 0.88) / 5 = 4.33 / 5 = 0.866 Average accuracy = 86.6%
Example 2 — training losses
losses = [3.2, 2.8, 2.1, 1.9] mean = (3.2 + 2.8 + 2.1 + 1.9) / 4 = 10.0 / 4 = 2.5
2.2 Variance

Variance measures how spread out your data is from the mean. High variance = data is scattered. Low variance = data is clustered.

Step 1: find the mean
Step 2: subtract mean from each value, square it
Step 3: average those squared differences

Variance = Σ(xᵢ - mean)² / n
Worked Example
data = [2, 4, 4, 4, 5, 5, 7, 9] # Step 1: mean mean = (2+4+4+4+5+5+7+9) / 8 = 40/8 = 5 # Step 2: squared differences (2-5)² = 9 (4-5)² = 1 (three times) (5-5)² = 0 (twice) (7-5)² = 4 (9-5)² = 16 # Step 3: average variance = (9+1+1+1+0+0+4+16) / 8 = 32/8 = 4
2.3 Standard Deviation

Standard deviation is just the square root of variance. It's easier to interpret because it's in the same units as your original data.

std = √Variance
# From the variance example above variance = 4 std = √4 = 2 # Mean was 5, std is 2 # Most data falls between: 5-2=3 and 5+2=7 # That matches our data [2, 4, 4, 4, 5, 5, 7, 9]
MetricWhat it tells youAI use case
MeanCenter valueAverage loss, average accuracy
VarianceSpread² (squared units)Used in batch normalization
Std DevSpread (same units)Detecting outliers in features
2.4 Probability

Probability measures how likely an event is to happen, on a scale from 0 (impossible) to 1 (certain).

P(A) = (Favorable Outcomes) / (Total Outcomes)
Example 1 — coin flip
P(heads) = 1 / 2 = 0.5 # 50% P(tails) = 1 / 2 = 0.5
Example 2 — AI classification output
# An email spam classifier output P(spam) = 0.87 # 87% chance it's spam P(not spam) = 0.13 # 13% chance it's not # They must sum to 1.0 (complement rule) 0.87 + 0.13 = 1.0

Probability Rules

# Complement Rule: P(not A) = 1 - P(A) P(rains) = 0.3 P(no rain) = 1 - 0.3 = 0.7 # Addition Rule (OR): P(A or B) = P(A) + P(B) - P(A and B) # If A and B can't both happen (mutually exclusive): P(A or B) = P(A) + P(B) # Multiplication Rule (AND, independent events): # P(A and B) = P(A) × P(B) P(two heads) = P(H) × P(H) = 0.5 × 0.5 = 0.25
2.5 Conditional Probability

Conditional probability is the probability of event A happening, given that event B has already happened. The "|" means "given".

P(A|B) = P(A and B) / P(B)
Example 1 — spam detection
# We want: P(spam | email contains "FREE MONEY") # = What's the chance it's spam, GIVEN it has "FREE MONEY"? # From data: P(spam AND has "FREE MONEY") = 0.40 P(email has "FREE MONEY") = 0.45 P(spam | "FREE MONEY") = 0.40 / 0.45 = 0.89 # 89% chance it's spam if it contains "FREE MONEY"
Example 2 — language model context
# An LLM predicts next word with conditional probability # P(next word | all previous words) P("world" | "hello") = 0.72 # likely next word P("pizza" | "hello") = 0.01 # unlikely next word # This is literally how language models work! # They compute P(next_token | previous_tokens)
2.6 Bayes' Theorem

Bayes' Theorem tells you how to update your belief when you get new evidence. You start with a prior belief, see evidence, and get a better (posterior) belief.

P(A|B) = P(B|A) × P(A) / P(B)

P(A|B) = posterior — updated belief
P(B|A) = likelihood — how well evidence fits
P(A) = prior — initial belief
P(B) = evidence probability (normalizer)
Example — Medical Test
# Disease affects 1% of population P(disease) = 0.01 # Prior P(no disease) = 0.99 # Test is 95% accurate P(positive | disease) = 0.95 # Likelihood P(positive | no disease) = 0.05 # False positive rate # Total probability of a positive test P(positive) = P(pos|disease)×P(disease) + P(pos|no disease)×P(no disease) = 0.95×0.01 + 0.05×0.99 = 0.0095 + 0.0495 = 0.059 # Bayes: P(disease | positive test) P(disease | positive) = (0.95 × 0.01) / 0.059 = 0.0095 / 0.059 = 0.161 ← only 16% chance! # Surprise! A positive test only means 16% chance # because the disease is rare (prior was very low)
In AI
Bayes' theorem underlies Naive Bayes classifiers, many recommendation systems, and the general principle of "updating beliefs with evidence" that appears throughout probabilistic ML. LLMs also implicitly reason in a Bayesian manner — combining prior knowledge with context.
2.7 Log Probability

When probabilities get very small (like 0.000001), computers struggle to represent them. Log probability converts these tiny numbers into manageable negatives. It also turns multiplication into addition (much faster for computers).

What is log?

# log base e (natural log, written as ln or log) # Asks: "e to what power gives me this number?" log(1) = 0 # e⁰ = 1 log(2.718) ≈ 1 # e¹ ≈ 2.718 log(0.5) ≈ -0.693 # negative! log of fraction is negative log(0.01) ≈ -4.6 # small probability = large negative number

Why this is useful

# Without log: multiply tiny probabilities P(sentence) = 0.001 × 0.002 × 0.003 × 0.001 = 0.000000000006 ← computer loses precision! # With log: add log probabilities log P(sentence) = log(0.001) + log(0.002) + log(0.003) + log(0.001) = -6.9 + -6.2 + -5.8 + -6.9 = -25.8 ← clean number, no precision loss

Log in Cross-Entropy Loss

# When your model predicts the right class with high confidence P(correct) = 0.95 loss = -log(0.95) = -(-0.051) = 0.051 ← very low loss, good! # When model is wrong / uncertain P(correct) = 0.10 loss = -log(0.10) = -(-2.3) = 2.3 ← high loss, penalized!
Key insight
Log probability is always ≤ 0. The closer the probability is to 1.0, the closer log is to 0. The closer to 0.0, the more negative. Negative log probability is the loss! Maximize P → minimize -log(P).
2.8 Entropy & Cross-Entropy Loss

Entropy measures how much uncertainty or randomness exists in a probability distribution. High entropy = very uncertain. Low entropy = confident.

# Entropy formula H(p) = -Σ p(x) × log(p(x)) # Example 1: fair coin (high entropy = uncertain) p = [0.5, 0.5] H = -(0.5×log(0.5) + 0.5×log(0.5)) = -(0.5×-0.693 + 0.5×-0.693) = 0.693 ← maximum uncertainty for 2 outcomes # Example 2: biased coin (low entropy = more certain) p = [0.9, 0.1] H = -(0.9×log(0.9) + 0.1×log(0.1)) ≈ -(0.9×-0.105 + 0.1×-2.302) ≈ 0.325 ← less uncertain

Cross-Entropy Loss — The most common loss in AI

Cross-entropy compares what the model predicted vs what the true answer was. This is what you minimize during training of almost every classifier, and every LLM.

Cross-Entropy Loss = -Σ true(x) × log(predicted(x))
Classification Example — 3 classes
# True label: cat (class 0) true = [1, 0, 0 ] # one-hot: only cat is 1 # cat dog bird # Model prediction A (confident and correct) pred_A = [0.90, 0.07, 0.03] loss_A = -(1×log(0.90) + 0×log(0.07) + 0×log(0.03)) = -(1×-0.105) = 0.105 ← low loss, good prediction! # Model prediction B (wrong and confident) pred_B = [0.05, 0.90, 0.05] loss_B = -(1×log(0.05)) = -(1×-2.996) = 2.996 ← high loss, bad prediction!
In LLM Training
Every token prediction during LLM training uses cross-entropy loss. The model predicts a probability for each word in its vocabulary. Cross-entropy measures how wrong that prediction is. The entire training loop of GPT, Claude, Llama — it's all just minimizing cross-entropy loss over trillions of tokens.
2.9 Softmax

A neural network outputs raw numbers (called logits) — they can be anything, like [2.1, -0.5, 1.8]. Softmax converts these into proper probabilities that add up to 1.0.

softmax(xᵢ) = eˣⁱ / Σ(eˣʲ) for all j

e ≈ 2.718 (Euler's number)
Step-by-step Softmax
logits = [2.0, 1.0, 0.1] # raw model output # Step 1: raise e to each power e^2.0 = 7.389 e^1.0 = 2.718 e^0.1 = 1.105 # Step 2: sum them total = 7.389 + 2.718 + 1.105 = 11.212 # Step 3: divide each by total softmax = [7.389/11.212, 2.718/11.212, 1.105/11.212] = [ 0.659, 0.242, 0.099] # Check: 0.659 + 0.242 + 0.099 = 1.0 ✓
Interactive — Softmax Visualizer
2.0
1.0
0.1
Softmax in Transformers
In the attention mechanism, softmax converts attention scores (dot products of Q and K) into attention weights — probabilities that sum to 1. These tell each token how much to "attend" to every other token. Softmax is in every transformer layer.

Part 03

Calculus

When to study this
You don't need calculus to build RAG pipelines, use LLM APIs, or even understand most of what an LLM does. Learn this section when you start training or fine-tuning neural networks.
3.1 Derivatives

A derivative measures how much a function's output changes when you change the input slightly. Think of it as the slope of a curve at a specific point.

If f(x) = x², and x = 3, the derivative tells you: "if I nudge x a tiny bit, how much does the output change?"

Basic rules
Constant: d/dx(5) = 0 — constants don't change
Power rule: d/dx(xⁿ) = n×xⁿ⁻¹
Examples:
d/dx(x²) = 2x
d/dx(x³) = 3x²
d/dx(x) = 1
Example — applying power rule
f(x) = x² f'(x) = 2x # the derivative f'(3) = 2×3 = 6 # slope at x=3 is 6 f'(5) = 2×5 = 10 # slope at x=5 is 10 f'(0) = 2×0 = 0 # slope at x=0 is 0 (minimum!)
Example 2 — more terms
f(x) = 3x³ + 2x² + 5x + 7 f'(x) = 9x² + 4x + 5 # Each term differentiated separately: # 3x³ → 3×3×x² = 9x² # 2x² → 2×2×x¹ = 4x # 5x → 5×1×x⁰ = 5 # 7 → 0 (constant)
In AI Training
The loss function is just f(weights). Its derivative tells us "which direction does the loss increase if we change each weight?" We go in the opposite direction to reduce the loss. This is gradient descent.
3.2 Chain Rule

When functions are nested inside each other, you use the chain rule to find the derivative. This is the mathematical heart of backpropagation.

If y = f(g(x)), then:

dy/dx = f'(g(x)) × g'(x)

"derivative of outer × derivative of inner"
Example 1
y = (3x + 1)² # Outer function: f(u) = u² → f'(u) = 2u # Inner function: g(x) = 3x+1 → g'(x) = 3 dy/dx = 2(3x+1) × 3 = 6(3x+1) # At x=2: dy/dx = 6(7) = 42
Example 2 — neural network layer
# In a neural network: output = activation(weights × input) # loss depends on output, output depends on weights loss = L(activation(W × x)) # Chain rule gives us: dLoss/dW = dLoss/dOutput × dOutput/dActivation × dActivation/dW # This chain is what backpropagation computes! # It "chains" derivatives through each layer backwards
3.3 Partial Derivatives

When a function has multiple variables, a partial derivative measures the change with respect to one variable at a time, treating others as constants.

f(x, y) = x² + 3y + xy # Partial derivative with respect to x (treat y as constant) ∂f/∂x = 2x + y # Partial derivative with respect to y (treat x as constant) ∂f/∂y = 3 + x
Evaluating at a point
f(x, y) = x² + 3y + xy ∂f/∂x = 2x + y ∂f/∂y = 3 + x # At point (x=2, y=3): ∂f/∂x = 2×2 + 3 = 7 # slope in x direction is 7 ∂f/∂y = 3 + 2 = 5 # slope in y direction is 5
In AI
A neural network's loss depends on millions of weights (w₁, w₂, ... wₙ). We compute ∂Loss/∂w₁, ∂Loss/∂w₂, etc. — one partial derivative per weight. This tells us exactly how to nudge each weight to reduce the loss. Backpropagation automates this computation efficiently.
3.4 Gradients

The gradient is just a vector that collects all the partial derivatives of a function. It points in the direction of steepest increase of the function.

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

∇ is called "nabla" — just means "gradient of"
f(x, y) = x² + y² ∇f = [∂f/∂x, ∂f/∂y] = [2x, 2y] # At point (3, 4): ∇f = [6, 8] # "go right 6, up 8" → steepest increase # negate it → [-6, -8] → steepest DECREASE (where we want to go!)
Gradient intuition
Imagine you're on a hill. The gradient says "this is the steepest direction uphill." To find the bottom of the hill (minimum loss), you go in the opposite direction of the gradient. That's gradient descent.
3.5 Gradient Descent

Gradient descent is the algorithm that trains neural networks. It repeatedly computes the gradient of the loss and takes a small step in the opposite direction to reduce the loss.

w = w - α × ∇Loss

w = weight (parameter)
α = learning rate (how big a step)
∇Loss = gradient of the loss w.r.t. w
1
Forward pass: Run the model on input data, get a prediction
2
Compute loss: Compare prediction to true label using cross-entropy loss
3
Backward pass: Use chain rule to compute gradient for every weight
4
Update weights: w = w - α × gradient — move each weight slightly in the right direction
5
Repeat thousands of times until loss is low enough
Worked Example — one step
# Simple 1-weight example w = 5.0 # current weight α = 0.1 # learning rate gradient = 3.0 # ∂Loss/∂w computed by backprop # Update: w_new = w - α × gradient = 5.0 - 0.1 × 3.0 = 5.0 - 0.3 = 4.7 # moved slightly toward minimum # Next iteration: compute new gradient at w=4.7, repeat

The Learning Rate matters a lot

Learning Rate (α)ProblemExample
Too large (0.9)Overshoots minimum, oscillatesJumps past the answer
Just right (0.001)Steady convergenceStandard for Adam optimizer
Too small (0.000001)Very slow trainingTakes too long to converge
Interactive — Gradient Descent Simulator
0.10
6
The Full Picture
Every time you call model.fit() or run a training loop, this is what happens millions of times. GPT-4 was trained by running gradient descent over trillions of tokens, updating billions of weights. The math is exactly w = w - α∇Loss, just at enormous scale.
3.6 Least Squares

Least squares is the mathematical method that finds the best-fit line (or hyperplane) through a set of data points. It is the direct mathematical foundation of Linear Regression — and it connects linear algebra to calculus in one elegant formula.

The problem: overdetermined systems

Imagine you have 100 data points (x, y) and you want to find a straight line y = mx + b that fits them all. In matrix form this is Aw = y, where A is your data matrix and w holds the weights you want to learn. But with 100 equations and only 2 unknowns, there is no exact solution — the system is overdetermined. Least squares finds the best possible answer.

# Example: fitting y = w·x + b to 4 data points # (x=1,y=2), (x=2,y=3.9), (x=3,y=6.1), (x=4,y=7.8) # Matrix form: A·w = y A = [1, 1] ← each row: [x_i, 1] [2, 1] [3, 1] [4, 1] w = [m] ← unknowns: slope m, intercept b [b] y = [2.0] ← actual target values [3.9] [6.1] [7.8] # 4 equations, 2 unknowns → overdetermined → no exact solution

Residuals — the error you're minimising

A residual is the difference between what your model predicts and what the actual value is. For each data point:

residual_i = y_i − ŷ_i = actual − predicted

Total error = r₁² + r₂² + r₃² + ... + rₙ²

Least Squares: minimise ||y − Aw||²
Residuals in action
# Suppose we guess w = [2.0, 0.1] (slope=2.0, intercept=0.1) predictions = A·w = [2.1, 4.1, 6.1, 8.1] actual = [2.0, 3.9, 6.1, 7.8] residuals = [-0.1, -0.2, 0.0, 0.3] # Sum of squared residuals = 0.01 + 0.04 + 0.00 + 0.09 = 0.14 # Least squares finds the w that makes this as small as possible

The Normal Equation — the closed-form solution

Taking the derivative of the squared error with respect to w, setting it to zero, and solving gives a direct formula for the optimal weights:

w* = (AᵀA)⁻¹ Aᵀ y

This is called the Normal Equation.
import numpy as np # Data X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]]) # A matrix y = np.array([2.0, 3.9, 6.1, 7.8]) # Normal equation: w = (XᵀX)⁻¹ Xᵀy w = np.linalg.inv(X.T @ X) @ X.T @ y print(w) # [1.94, 0.15] → slope ≈ 1.94, intercept ≈ 0.15 # Same as sklearn's LinearRegression under the hood from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X[:, :1], y) print(model.coef_, model.intercept_) # same result

Gradient descent vs Normal Equation

MethodHowBest when
Normal EquationDirect formula: w = (AᵀA)⁻¹AᵀySmall datasets, few features
Gradient DescentIterative: step by step downhillLarge datasets, neural networks
Why not always use the Normal Equation?
Inverting a matrix scales as O(n³) with the number of features. With 10,000 features, this becomes computationally expensive. For neural networks with millions of parameters, gradient descent is the only practical option. For simple linear regression with a few features, the Normal Equation is fast and exact.
This is the mathematics behind Linear Regression
When you train a Linear Regression model — whether in scikit-learn, from scratch in NumPy, or as the output layer of a neural network — the math underneath is least squares. The model finds the weights w that minimise the sum of squared residuals between predictions and targets. Every more complex model (Logistic Regression, Ridge, LASSO, neural networks) is built on this same fundamental idea, extended in different directions.
Quick Check
Why is a system with 100 data points and 2 unknowns called "overdetermined"?
Because the data is too large to process
Because there are more equations than unknowns, so no exact solution exists
Because the matrix A cannot be transposed
Because the residuals are too large
Correct! With 100 equations and only 2 unknowns, you have far more constraints than variables. No single line can pass through all 100 points exactly. Least squares finds the line that is closest to all of them — minimising the total squared error.

You're Ready 🎉

With this foundation, you can now study Classical ML and then Deep Learning with real understanding — not just memorizing formulas.

Next: Classical ML
Linear Regression · Logistic Regression · KNN · Decision Trees · Random Forest · XGBoost
Then: Deep Learning
Neural Networks · Backpropagation · Activation Functions · CNNs · RNNs
Then: AI Engineering
Transformers · LLMs · RAG · Agents · Fine-tuning · Evaluation