Complete Math Foundation

Math for AI Engineers

Everything you need to understand LLMs, embeddings, RAG, and neural networks — explained simply with real examples.

No proofs

No university extras

Applied to AI directly

3 Parts · 30 Topics

Part 01

Linear Algebra

1.1 Scalars

A scalar is simply a single number. That's it. When you say "the temperature is 37 degrees" or "the confidence is 0.92" — those are scalars. They have magnitude (size) but no direction.

Types of scalars

Type	Example	In AI
Positive integer	5, 42, 1000	Number of layers in a model
Negative	-3, -0.5	Negative reward in RL
Decimal (float)	0.92, 3.14	Probability score, loss value
Zero	0	Padding, zero gradient

Scalar operations

# Simple scalar operations — just normal numbers
a = 5
b = 3

a + b = 8       # addition
a - b = 2       # subtraction
a * b = 15      # multiplication
a / b = 1.67    # division
      

Real AI Example

When an LLM gives you a confidence score like 0.94 — that is a scalar.
When your loss function returns 2.37 during training — that is a scalar.
When you set learning rate = 0.001 — that is a scalar.

Key Difference

A scalar is just a plain number. A vector is a list of numbers. A matrix is a grid of numbers. That's the core idea of linear algebra.

1.2 Vectors

A vector is an ordered list of numbers. Think of it as a point in space that also has a direction. In AI, vectors are everywhere — your text, images, and audio are all converted into vectors called embeddings.

Writing vectors

# Row vector (horizontal)
v = [3, 4]

# Column vector (vertical)
v = [3]
    [4]

# 3D vector
v = [1, 2, 3]

# An embedding vector (real AI example, 4 dimensions)
word_king = [0.81, 0.34, -0.23, 0.92]
      

Vector Addition

Add matching positions together. Both vectors must have the same length.

u = [1, 2, 3]
v = [4, 5, 6]

u + v = [1+4, 2+5, 3+6]
      = [  5,   7,   9]
      

Example 2 — Subtraction

u = [9, 6, 3]
v = [2, 1, 1]

u - v = [9-2, 6-1, 3-1]
      = [  7,   5,   2]
        

Scalar Multiplication

Multiply every element of the vector by the scalar. This scales the vector — makes it longer or shorter without changing direction (if scalar is positive).

v = [2, 3, 5]
scalar = 3

3 * v = [3*2, 3*3, 3*5]
       = [  6,   9,  15]

0.5 * v = [0.5*2, 0.5*3, 0.5*5]
         = [    1,   1.5,   2.5]
      

Magnitude (Length) of a Vector — L2 Norm

The magnitude tells you how long the vector is. Formula: square each element, sum them, take the square root.

v = [3, 4] ||v|| = \sqrt(3² + 4²) = \sqrt(9 + 16) = \sqrt25 = 5

Example — 3D vector

v = [1, 2, 2]

||v|| = √(1² + 2² + 2²)
      = √(1  + 4  + 4)
      = √9
      = 3
        

Unit Vector

A unit vector has magnitude exactly 1. To make a unit vector from any vector, divide each element by the magnitude.

v = [3, 4]
||v|| = 5

unit_v = v / ||v||
       = [3/5, 4/5]
       = [0.6, 0.8]

# Verify: √(0.6² + 0.8²) = √(0.36 + 0.64) = √1 = 1 ✓
      

Interactive — Vector Magnitude Calculator

x component 3

y component 4

v = [3, 4]
||v|| = √(3² + 4²) = √25 = 5.00

Why this matters in AI

In a vector database (like Pinecone or Weaviate), your text is stored as a vector of 768 or 1536 numbers. When you search, you find vectors close to your query vector. Magnitude and direction determine what "close" means.

Quick Check

What is the magnitude of vector v = [6, 8]?

√(36 + 64) = √100 = 10 ✓ Remember: square each number, add, square root.

1.3 Matrices

A matrix is a 2D grid of numbers arranged in rows and columns. Think of it as a spreadsheet. Matrices are how neural networks store their learned weights.

Matrix notation — rows × columns

# 2×2 matrix (2 rows, 2 columns)
A = [1, 2]
    [3, 4]

# 2×3 matrix (2 rows, 3 columns)
B = [1, 2, 3]
    [4, 5, 6]

# 3×2 matrix (3 rows, 2 columns)
C = [10, 20]
    [30, 40]
    [50, 60]
      

Special Matrices

# Identity Matrix — like the number 1 for matrices
# Any matrix × Identity = same matrix
I = [1, 0]
    [0, 1]

# Zero Matrix — like the number 0
O = [0, 0]
    [0, 0]
      

Matrix Addition and Subtraction

Both matrices must be the same size. Add/subtract matching positions.

A = [1, 2]    B = [5, 6]
    [3, 4]        [7, 8]

A + B = [1+5, 2+6]   A - B = [1-5, 2-6]
        [3+7, 4+8]           [3-7, 4-8]

      = [6,  8]            = [-4, -4]
        [10, 12]              [-4, -4]
      

Why this matters in AI

In a neural network, the weights of one layer are stored in a matrix. When you train the model, you update the weight matrix by subtracting a small gradient matrix from it — that's matrix subtraction in action.

1.4 Transpose

Transposing a matrix means flipping rows into columns. The first row becomes the first column, second row becomes the second column, and so on. Written as Aᵀ or A^T.

# Original 2×3 matrix
A   = [1, 2, 3]
      [4, 5, 6]

# Transposed — becomes 3×2 matrix
A^T = [1, 4]
      [2, 5]
      [3, 6]
      

Example 2 — square matrix

A   = [1, 2]    A^T = [1, 3]
      [3, 4]           [2, 4]

# Row 1 became Column 1
# Row 2 became Column 2
        

Transpose of a vector

A row vector [1, 2, 3] transposed becomes a column vector:

[1]
[2]
[3]
        

This matters for matrix multiplication — you'll use transpose to align dimensions.

In Transformers — QKᵀ

The attention formula is QKᵀ — that T is a transpose. Q is a matrix of queries, K is a matrix of keys. To multiply them, K must be transposed. This is one of the most important operations in modern AI.

1.5 Dot Product

The dot product takes two vectors of the same length and returns a single scalar. Multiply matching positions, then add everything up.

u = [1, 2, 3] v = [4, 5, 6] u \cdot v = (1\times4) + (2\times5) + (3\times6) = 4 + 10 + 18 = 32

Example 2 — with negatives

u = [2, -1,  3]
v = [4,  5, -2]

u · v = (2×4) + (-1×5) + (3×-2)
      = 8 + -5 + -6
      = -3
        

What the value tells you

Dot Product	Meaning	Vectors are...
Large positive	Pointing in same direction	Similar
Zero	Perpendicular (90°)	Unrelated
Large negative	Pointing opposite	Opposite meaning

Embedding similarity — Real AI use case

# Simplified embeddings for words
king   = [0.8, 0.3, 0.9]
queen  = [0.7, 0.4, 0.8]
banana = [0.1, 0.9, 0.1]

king · queen  = (0.8×0.7) + (0.3×0.4) + (0.9×0.8)
              = 0.56 + 0.12 + 0.72 = 1.40  ← HIGH: similar!

king · banana = (0.8×0.1) + (0.3×0.9) + (0.9×0.1)
              = 0.08 + 0.27 + 0.09 = 0.44  ← LOW: different!
        

Interactive — Dot Product Calculator

Vector u = [u1, u2, u3]

u11

u22

u33

Vector v = [v1, v2, v3]

v14

v25

v36

1.6 Cosine Similarity

The dot product has a problem — bigger vectors (just because they're long, not because they're similar) get higher scores. Cosine similarity fixes this by dividing the dot product by both magnitudes. It compares direction only, ignoring size.

cos(θ) = (u \cdot v) / (||u|| \times ||v||)

Result is always between -1 and 1:

Score	Meaning	Example
1.0	Identical direction	"king" vs "king"
0.85+	Very similar	"king" vs "queen"
~0.0	Unrelated	"king" vs "banana"
-1.0	Opposite meaning	"hot" vs "cold"

Full worked example

u = [3, 4]
v = [6, 8]   ← v is just 2× u, same direction

# Step 1: dot product
u · v = (3×6) + (4×8) = 18 + 32 = 50

# Step 2: magnitudes
||u|| = √(9+16) = 5
||v|| = √(36+64) = 10

# Step 3: divide
cos(θ) = 50 / (5 × 10) = 50 / 50 = 1.0

# Result: 1.0 → Identical direction! Makes sense.
        

Interactive — Cosine Similarity Explorer

Vector u

u13

u24

Vector v

v16

v28

In RAG Systems

When you ask a question, it's converted to an embedding vector. Cosine similarity compares your question vector to every stored chunk vector. The highest scores are retrieved as context for the LLM. This is literally how RAG works.

1.7 Matrix Multiplication

Matrix multiplication is the most important operation in neural networks. Every forward pass is a series of matrix multiplications. It's NOT element-wise — it's rows × columns.

The Rule: inner dimensions must match

# (m × n) × (n × p) = (m × p)
# The inner n's must match. Output is m × p.

(2×3) × (3×4) = (2×4)   ✓  inner 3s match
(2×3) × (4×3) = ERROR   ✗  3 ≠ 4, doesn't work
(5×2) × (2×1) = (5×1)   ✓  inner 2s match
      

How to compute it — Row × Column

A = [1, 2]    B = [5, 6]
    [3, 4]        [7, 8]

# Output[0][0] = Row 0 of A · Col 0 of B
#              = [1,2] · [5,7] = 1×5 + 2×7 = 5+14 = 19

# Output[0][1] = Row 0 of A · Col 1 of B
#              = [1,2] · [6,8] = 1×6 + 2×8 = 6+16 = 22

# Output[1][0] = Row 1 of A · Col 0 of B
#              = [3,4] · [5,7] = 3×5 + 4×7 = 15+28 = 43

# Output[1][1] = Row 1 of A · Col 1 of B
#              = [3,4] · [6,8] = 3×6 + 4×8 = 18+32 = 50

A × B = [19, 22]
        [43, 50]
      

Example 2 — 2×3 times 3×2

A = [1, 0, 2]    B = [3, 1]
    [4, 1, 0]        [2, 0]
                        [1, 4]

# (2×3) × (3×2) = (2×2)

Out[0][0] = [1,0,2]·[3,2,1] = 3+0+2 = 5
Out[0][1] = [1,0,2]·[1,0,4] = 1+0+8 = 9
Out[1][0] = [4,1,0]·[3,2,1] = 12+2+0 = 14
Out[1][1] = [4,1,0]·[1,0,4] = 4+0+0 = 4

Result = [5,  9]
         [14, 4]
        

Matrix × Vector — The Neural Network Operation

# Neural net: output = W × input + bias
# W is weights (matrix), input is a vector

W = [2, 1]
    [0, 3]
    [1, 1]    ← 3×2 weight matrix

x = [4]
    [5]        ← input vector (2×1)

W × x = [2×4+1×5]   = [13]
        [0×4+3×5]   = [15]  ← output vector
        [1×4+1×5]   = [ 9]
      

Order Matters! AB ≠ BA

Unlike numbers where 3×4 = 4×3, with matrices order matters. AB and BA often give completely different results — and sometimes BA doesn't even work because the dimensions don't match.

Interactive — 2×2 Matrix Multiplier

Matrix A (2×2)

Matrix B (2×2)

1.8 L1 and L2 Norms

A norm is a way to measure the "size" or "length" of a vector. There are different ways to measure size, and L1 and L2 are the two most important ones.

L1 Norm — Manhattan Distance

Add up the absolute values of all elements. Imagine walking in a city — you can only go left/right or up/down, not diagonally.

v = [3, -4, 2] ||v||₁ = |3| + |-4| + |2| = 3 + 4 + 2 = 9

L2 Norm — Euclidean Distance

This is the "straight line" distance. Square everything, sum, take square root. This is the default "magnitude" we've been using.

v = [3, -4, 2] ||v||₂ = \sqrt(3² + (-4)² + 2²) = \sqrt(9+16+4) = \sqrt29 \approx 5.39

Norm	Formula	When used
L1	Sum of \|values\|	Sparse features, LASSO regularization
L2	√(Sum of squares)	Embeddings, most distance calculations

In embeddings

Most embedding models (OpenAI, Sentence Transformers) normalize their output vectors to have L2 norm = 1. That means when you do cosine similarity, the denominator is just 1×1 = 1, so cosine similarity = dot product for normalized vectors. This is why vector databases can be so fast.

1.9 Normalization

Normalization means adjusting values so they fit into a standard range or distribution. In AI, this happens constantly — before training, during training, and when using embeddings.

Vector Normalization (Unit Norm)

Divide every element by the L2 norm. Result: a vector with magnitude = 1, same direction.

v = [3, 4]
||v||₂ = 5

normalized = [3/5, 4/5] = [0.6, 0.8]

# Verify: √(0.6² + 0.8²) = √(0.36 + 0.64) = √1 = 1 ✓

Min-Max Normalization

Scale values to fit between 0 and 1. Useful for training data preparation.

data = [10, 20, 30, 40, 50]
min = 10,  max = 50

formula:  (x - min) / (max - min)

normalized = [(10-10)/40, (20-10)/40, (30-10)/40, ...]
           = [0.0,  0.25,  0.5,  0.75,  1.0]
      

Why Normalization Matters for RAG

Embedding normalization

OpenAI's text-embedding-ada-002 and most modern embedding models return vectors that are already L2-normalized (magnitude = 1). This means when you query a vector DB, dot product = cosine similarity, making search extremely fast. If you're comparing embeddings from un-normalized models, always normalize first.

1.10 Linear Combinations

A linear combination is what you get when you take several vectors, multiply each by a scalar (a weight), and add the results together. It is the most fundamental operation in all of linear algebra — and it is exactly what a neural network layer does.

Given vectors v₁, v₂, v₃ and scalars c₁, c₂, c₃: result = c₁\cdotv₁ + c₂\cdotv₂ + c₃\cdotv₃

Worked Example

# Two vectors
v1 = [1, 0]
v2 = [0, 1]

# Weights (scalars)
c1 = 3
c2 = 5

# Linear combination
result = 3 × [1, 0] + 5 × [0, 1]
       = [3, 0] + [0, 5]
       = [3, 5]
      

Geometric meaning

Geometrically, each vector is like an arrow pointing in a direction. Multiplying by a scalar stretches or shrinks that arrow. Adding the arrows together gives you a new arrow — the combination. By changing the weights, you can reach any point in space that those vectors can describe.

Weighted sum of word embeddings

# Sentence embedding = weighted sum of word embeddings
word_the  = [0.1, 0.2]
word_cat  = [0.9, 0.4]
word_sat  = [0.5, 0.8]

# Equal weights (simple average is a linear combination)
sentence = (1/3)·word_the + (1/3)·word_cat + (1/3)·word_sat
         = [0.500, 0.467]
        

Why this matters in AI

Every single neuron in a neural network computes a linear combination. The weights of the network are the scalars c₁, c₂, ... and the inputs are the vectors. Training a neural network is just finding the right scalars to get the output you want. It's linear combinations, all the way down.

1.11 Span

The span of a set of vectors is the complete collection of all possible linear combinations you can make from them. In plain English: every point you could ever reach by stretching, shrinking, and adding those vectors together.

The span of {v₁, v₂, ..., vₖ} is: span{v₁, ..., vₖ} = { c₁v₁ + c₂v₂ + ... + cₖvₖ | c₁,...,cₖ \in ℝ }

Geometric intuition

# One vector in 2D → span is a line through the origin
v1 = [1, 2]
span{v1} = all multiples: [2,4], [-1,-2], [0.5,1], ...
         → a LINE

# Two independent vectors in 2D → span is the entire plane
v1 = [1, 0]
v2 = [0, 1]
span{v1, v2} = any [x, y] you want
             → the entire 2D PLANE
      

Why span matters

When you store a document as an embedding vector in a high-dimensional space, that embedding only "lives" in a certain region of the space — the span of the training data's embedding directions. PCA and SVD work by finding which directions in that span carry the most information.

Span in PCA and SVD

When you have a dataset with 1000 features (dimensions), most of the meaningful variation lives in a much smaller subspace — often just 10–50 directions. PCA finds the span of those directions. Everything inside the span is captured; everything outside is discarded as noise. This is how dimensionality reduction works.

1.12 Basis

A basis is the smallest set of vectors whose span covers an entire space — with no redundancy. Every vector in the space can be written as exactly one linear combination of the basis vectors. You can think of a basis as the coordinate system for a space.

Standard Basis

The standard basis is the most familiar one. In 2D, it uses two simple unit vectors pointing along the axes:

# Standard basis for 2D space
e1 = [1, 0]   ← points along the x-axis
e2 = [0, 1]   ← points along the y-axis

# Any 2D vector is just a combination of these
[3, 5] = 3·e1 + 5·e2
[-2, 7] = -2·e1 + 7·e2

# Standard basis for 3D space
e1 = [1, 0, 0]
e2 = [0, 1, 0]
e3 = [0, 0, 1]
      

What makes a valid basis?

Requirement	Meaning
Spans the space	Every point in the space is reachable
Linearly independent	No vector in the set is redundant
Minimal	You can't remove any vector and still span the space

Non-standard basis example

# This is also a valid basis for 2D space
b1 = [1, 1]
b2 = [1, -1]

# We can still reach any 2D point
# [3, 5] = 4·b1 + (-1)·b2 = [4,4] + [-1,1] = [3,5] ✓

# In PCA, the principal components ARE the new basis
pc1 = [0.71, 0.71]   ← direction of most variance
pc2 = [-0.71, 0.71]  ← perpendicular direction
        

Basis in coordinate systems and PCA

When you do PCA on a dataset, you're essentially finding a new, better basis — one that aligns with the directions of greatest variation in your data. Projecting data onto this new basis is what dimensionality reduction means. The principal components are the new basis vectors for a compressed representation of your data.

1.13 Dimension

The dimension of a space is simply the number of vectors in its basis — which is also the number of independent directions you need to describe every point in that space. It tells you how many coordinates it takes to pin down a location.

dim(space) = number of basis vectors required 2D plane \to dimension 2 3D space \to dimension 3 GPT-4 embedding space \to dimension 1536

High-dimensional spaces in AI

# Each model uses vectors of a fixed dimension
text-embedding-ada-002  → 1536-dimensional vectors
BERT base               → 768-dimensional vectors
GPT-2 hidden states     → 768-dimensional vectors
Llama 3 8B hidden states→ 4096-dimensional vectors

# Each token processed by an LLM is a point
# in a 4096-dimensional space (for Llama 3 8B)
token_"cat" = [0.23, -0.81, ..., 0.14]  ← 4096 numbers
      

Why high dimensions are strange

The curse of dimensionality

In high-dimensional spaces, almost everything is far from everything else. If you have 1000 data points in a 1536-dimensional embedding space, they are likely spread very thin. This is why nearest-neighbor search in vector databases is a hard engineering problem — and why approximate methods like HNSW and IVF indexes exist.

Embeddings and LLM vector spaces

Every word, sentence, or document that an LLM processes exists as a point in a high-dimensional space. The dimension of that space is a design choice made by the model creators — higher dimensions can capture more nuance but cost more memory and compute. Dimensionality reduction (PCA, UMAP) collapses these spaces down to 2D or 3D so humans can visualize them.

1.14 Linear Independence

A set of vectors is linearly independent if none of the vectors can be written as a linear combination of the others. In plain terms: no vector in the group is redundant — each one adds new directional information that the others cannot provide.

Vectors v₁, v₂, ..., vₖ are linearly independent if: c₁v₁ + c₂v₂ + ... + cₖvₖ = 0 has ONLY the solution c₁ = c₂ = ... = cₖ = 0

Dependent vs Independent — concrete examples

--- LINEARLY DEPENDENT (bad — one is redundant) ---
v1 = [1, 2]
v2 = [2, 4]   ← v2 = 2 × v1, adds nothing new
v3 = [3, 6]   ← v3 = 3 × v1, also redundant

# These three vectors only span a LINE, not a plane

--- LINEARLY INDEPENDENT (good — each adds new info) ---
v1 = [1, 0]
v2 = [0, 1]   ← cannot be made from v1

# These two vectors span the full 2D plane
      

What independence means in practice

Situation	Meaning	Problem
Dependent features	One feature = combination of others	Redundant info, hurts training
Dependent weight rows	Matrix is singular / rank-deficient	Cannot invert, unstable solution
Independent features	Each feature adds unique information	— (this is what you want)

Feature engineering example

# These two features are linearly DEPENDENT
temperature_celsius    = [0, 20, 100]
temperature_fahrenheit = [32, 68, 212]

# fahrenheit = celsius × 1.8 + 32 → exact linear formula
# Giving both to a model adds zero new information
# and can cause numerical instability

# Solution: drop one. Use only Celsius.
        

In Feature Engineering and PCA

When you have correlated features in a dataset (e.g. height in cm and height in inches), those features are linearly dependent. Models can still learn from them, but it wastes capacity and can create numerical problems. PCA specifically works by finding independent directions in your data — the principal components are always linearly independent (orthogonal). Checking for and removing dependent features is a core step in good data preparation.

Quick Check

Are the vectors v1 = [1, 2, 3] and v2 = [2, 4, 6] linearly independent?

Yes — they have different values

No — v2 = 2 × v1, so v2 is redundant

It depends on their magnitudes

Only in 3D space are they dependent

Correct! v2 is exactly 2 × v1. They point in the same direction and span only a line, not a plane. Linearly independent vectors must each add a genuinely new direction.

1.15 Tensors

A tensor is the general container for numbers in any number of dimensions. Scalars, vectors, and matrices are all just special cases of tensors. In deep learning, every single piece of data — inputs, weights, gradients, activations — lives inside a tensor.

0D tensor = Scalar \to a single number 1D tensor = Vector \to a list of numbers 2D tensor = Matrix \to a grid of numbers 3D tensor \to a stack of matrices ND tensor \to any number of dimensions

Visualising the dimensions

# 0D — Scalar
loss = 2.37              # shape: ()

# 1D — Vector
embedding = [0.2, 0.8, -0.3, 0.5]   # shape: (4,)

# 2D — Matrix
weight_matrix = [               # shape: (3, 4)
  [0.1, 0.2, 0.3, 0.4],
  [0.5, 0.6, 0.7, 0.8],
  [0.9, 1.0, 1.1, 1.2]
]

# 3D — Batch of sequences (batch × time × features)
tokens = shape (32, 512, 768)
# ↑ 32 sequences, each 512 tokens long,
#   each token represented by 768 numbers

# 4D — Batch of images (batch × height × width × channels)
images = shape (16, 224, 224, 3)
# ↑ 16 images, 224×224 pixels, 3 color channels (RGB)
      

Tensors in PyTorch and TensorFlow

import torch

# Create tensors
scalar = torch.tensor(3.14)                    # shape: torch.Size([])
vector = torch.tensor([1.0, 2.0, 3.0])          # shape: torch.Size([3])
matrix = torch.zeros(3, 4)                      # shape: torch.Size([3, 4])
batch  = torch.randn(32, 512, 768)               # shape: torch.Size([32, 512, 768])

# The shape tells you exactly the tensor's dimensions
print(batch.shape)    # torch.Size([32, 512, 768])
print(batch.ndim)     # 3
print(batch.numel())  # 32 × 512 × 768 = 12,582,912 numbers
      

Tensor operations are parallelised on GPUs

Why tensors = speed

When you do A @ B on two tensors in PyTorch, it does not loop through elements one by one. GPUs execute thousands of multiplications simultaneously on tensor data. This is why a model that would take weeks on a CPU trains in hours on a GPU — the data structure itself enables massive parallelism.

Tensors through an LLM layer

# Input: a batch of tokenised text
x       = shape (8, 512)        # 8 sequences, 512 tokens each

# After embedding layer
x_embed = shape (8, 512, 768)   # each token → 768-dim vector

# After attention layer — same shape, but transformed
x_attn  = shape (8, 512, 768)

# Logits over vocabulary
logits  = shape (8, 512, 50257) # one score per vocab token
        

Every PyTorch and TensorFlow model uses tensors

There is no deep learning without tensors. The model's weights are tensors. The input data is a tensor. The loss is a scalar tensor. The gradient is a tensor with the same shape as the weights. Understanding tensor shapes is the first practical skill you need when debugging neural networks — when you see a shape mismatch error, you are seeing a tensor dimension problem.

Quick Check

A batch of 64 images, each 128×128 pixels with 3 colour channels, has what tensor shape?

(128, 128, 3)

(64, 3, 128)

(64, 128, 128, 3)

(64, 49152)

Correct! The shape is (batch=64, height=128, width=128, channels=3). Each dimension represents one axis of the tensor. PyTorch often uses (batch, channels, height, width) order instead, but the number of dimensions is the same.

Part 02

Statistics & Probability

2.1 Mean (Average)

The mean is the center of your data. Add all values, divide by how many there are. It answers: "what is the typical value?"

Mean = (x₁ + x₂ + x₃ + ... + xₙ) / n = Σx / n

Example 1 — model accuracy scores

scores = [0.85, 0.90, 0.78, 0.92, 0.88]

mean = (0.85 + 0.90 + 0.78 + 0.92 + 0.88) / 5
     = 4.33 / 5
     = 0.866

Average accuracy = 86.6%
        

Example 2 — training losses

losses = [3.2, 2.8, 2.1, 1.9]

mean = (3.2 + 2.8 + 2.1 + 1.9) / 4
     = 10.0 / 4
     = 2.5
        

2.2 Variance

Variance measures how spread out your data is from the mean. High variance = data is scattered. Low variance = data is clustered.

Step 1: find the mean Step 2: subtract mean from each value, square it Step 3: average those squared differences Variance = Σ(xᵢ - mean)² / n

Worked Example

data = [2, 4, 4, 4, 5, 5, 7, 9]

# Step 1: mean
mean = (2+4+4+4+5+5+7+9) / 8 = 40/8 = 5

# Step 2: squared differences
(2-5)² = 9
(4-5)² = 1  (three times)
(5-5)² = 0  (twice)
(7-5)² = 4
(9-5)² = 16

# Step 3: average
variance = (9+1+1+1+0+0+4+16) / 8 = 32/8 = 4
        

2.3 Standard Deviation

Standard deviation is just the square root of variance. It's easier to interpret because it's in the same units as your original data.

std = \sqrtVariance

# From the variance example above
variance = 4
std = √4 = 2

# Mean was 5, std is 2
# Most data falls between: 5-2=3 and 5+2=7
# That matches our data [2, 4, 4, 4, 5, 5, 7, 9]
      

Metric	What it tells you	AI use case
Mean	Center value	Average loss, average accuracy
Variance	Spread² (squared units)	Used in batch normalization
Std Dev	Spread (same units)	Detecting outliers in features

2.4 Probability

Probability measures how likely an event is to happen, on a scale from 0 (impossible) to 1 (certain).

P(A) = (Favorable Outcomes) / (Total Outcomes)

Example 1 — coin flip

P(heads) = 1 / 2 = 0.5   # 50%
P(tails) = 1 / 2 = 0.5
        

Example 2 — AI classification output

# An email spam classifier output
P(spam)    = 0.87   # 87% chance it's spam
P(not spam) = 0.13  # 13% chance it's not

# They must sum to 1.0 (complement rule)
0.87 + 0.13 = 1.0 ✓
        

Probability Rules

# Complement Rule: P(not A) = 1 - P(A)
P(rains) = 0.3
P(no rain) = 1 - 0.3 = 0.7

# Addition Rule (OR): P(A or B) = P(A) + P(B) - P(A and B)
# If A and B can't both happen (mutually exclusive):
P(A or B) = P(A) + P(B)

# Multiplication Rule (AND, independent events):
# P(A and B) = P(A) × P(B)
P(two heads) = P(H) × P(H) = 0.5 × 0.5 = 0.25
      

2.5 Conditional Probability

Conditional probability is the probability of event A happening, given that event B has already happened. The "|" means "given".

P(A|B) = P(A and B) / P(B)

Example 1 — spam detection

# We want: P(spam | email contains "FREE MONEY")
# = What's the chance it's spam, GIVEN it has "FREE MONEY"?

# From data:
P(spam AND has "FREE MONEY")  = 0.40
P(email has "FREE MONEY")     = 0.45

P(spam | "FREE MONEY") = 0.40 / 0.45 = 0.89

# 89% chance it's spam if it contains "FREE MONEY"
        

Example 2 — language model context

# An LLM predicts next word with conditional probability
# P(next word | all previous words)

P("world" | "hello") = 0.72   # likely next word
P("pizza" | "hello") = 0.01   # unlikely next word

# This is literally how language models work!
# They compute P(next_token | previous_tokens)
        

2.6 Bayes' Theorem

Bayes' Theorem tells you how to update your belief when you get new evidence. You start with a prior belief, see evidence, and get a better (posterior) belief.

P(A|B) = P(B|A) \times P(A) / P(B) P(A|B) = posterior — updated belief P(B|A) = likelihood — how well evidence fits P(A) = prior — initial belief P(B) = evidence probability (normalizer)

Example — Medical Test

# Disease affects 1% of population
P(disease)          = 0.01  # Prior
P(no disease)       = 0.99

# Test is 95% accurate
P(positive | disease)    = 0.95  # Likelihood
P(positive | no disease) = 0.05  # False positive rate

# Total probability of a positive test
P(positive) = P(pos|disease)×P(disease) + P(pos|no disease)×P(no disease)
            = 0.95×0.01 + 0.05×0.99
            = 0.0095 + 0.0495 = 0.059

# Bayes: P(disease | positive test)
P(disease | positive) = (0.95 × 0.01) / 0.059
                      = 0.0095 / 0.059
                      = 0.161  ← only 16% chance!

# Surprise! A positive test only means 16% chance
# because the disease is rare (prior was very low)
        

In AI

Bayes' theorem underlies Naive Bayes classifiers, many recommendation systems, and the general principle of "updating beliefs with evidence" that appears throughout probabilistic ML. LLMs also implicitly reason in a Bayesian manner — combining prior knowledge with context.

2.7 Log Probability

When probabilities get very small (like 0.000001), computers struggle to represent them. Log probability converts these tiny numbers into manageable negatives. It also turns multiplication into addition (much faster for computers).

What is log?

# log base e (natural log, written as ln or log)
# Asks: "e to what power gives me this number?"

log(1)   = 0       # e⁰ = 1
log(2.718) ≈ 1     # e¹ ≈ 2.718
log(0.5) ≈ -0.693  # negative! log of fraction is negative
log(0.01) ≈ -4.6   # small probability = large negative number
      

Why this is useful

# Without log: multiply tiny probabilities
P(sentence) = 0.001 × 0.002 × 0.003 × 0.001
            = 0.000000000006   ← computer loses precision!

# With log: add log probabilities
log P(sentence) = log(0.001) + log(0.002) + log(0.003) + log(0.001)
                = -6.9  + -6.2  + -5.8  + -6.9
                = -25.8   ← clean number, no precision loss
      

Log in Cross-Entropy Loss

# When your model predicts the right class with high confidence
P(correct) = 0.95
loss = -log(0.95) = -(-0.051) = 0.051   ← very low loss, good!

# When model is wrong / uncertain
P(correct) = 0.10
loss = -log(0.10) = -(-2.3) = 2.3      ← high loss, penalized!
      

Key insight

Log probability is always ≤ 0. The closer the probability is to 1.0, the closer log is to 0. The closer to 0.0, the more negative. Negative log probability is the loss! Maximize P → minimize -log(P).

2.8 Entropy & Cross-Entropy Loss

Entropy measures how much uncertainty or randomness exists in a probability distribution. High entropy = very uncertain. Low entropy = confident.

# Entropy formula
H(p) = -Σ p(x) × log(p(x))

# Example 1: fair coin (high entropy = uncertain)
p = [0.5, 0.5]
H = -(0.5×log(0.5) + 0.5×log(0.5))
  = -(0.5×-0.693 + 0.5×-0.693)
  = 0.693  ← maximum uncertainty for 2 outcomes

# Example 2: biased coin (low entropy = more certain)
p = [0.9, 0.1]
H = -(0.9×log(0.9) + 0.1×log(0.1))
  ≈ -(0.9×-0.105 + 0.1×-2.302)
  ≈ 0.325  ← less uncertain
      

Cross-Entropy Loss — The most common loss in AI

Cross-entropy compares what the model predicted vs what the true answer was. This is what you minimize during training of almost every classifier, and every LLM.

Cross-Entropy Loss = -Σ true(x) \times log(predicted(x))

Classification Example — 3 classes

# True label: cat (class 0)
true      = [1,   0,    0   ]  # one-hot: only cat is 1
           # cat  dog   bird

# Model prediction A (confident and correct)
pred_A = [0.90, 0.07, 0.03]
loss_A = -(1×log(0.90) + 0×log(0.07) + 0×log(0.03))
       = -(1×-0.105)
       = 0.105   ← low loss, good prediction!

# Model prediction B (wrong and confident)
pred_B = [0.05, 0.90, 0.05]
loss_B = -(1×log(0.05))
       = -(1×-2.996)
       = 2.996   ← high loss, bad prediction!
        

In LLM Training

Every token prediction during LLM training uses cross-entropy loss. The model predicts a probability for each word in its vocabulary. Cross-entropy measures how wrong that prediction is. The entire training loop of GPT, Claude, Llama — it's all just minimizing cross-entropy loss over trillions of tokens.

2.9 Softmax

A neural network outputs raw numbers (called logits) — they can be anything, like [2.1, -0.5, 1.8]. Softmax converts these into proper probabilities that add up to 1.0.

softmax(xᵢ) = eˣⁱ / Σ(eˣʲ) for all j e \approx 2.718 (Euler's number)

Step-by-step Softmax

logits = [2.0, 1.0, 0.1]   # raw model output

# Step 1: raise e to each power
e^2.0 = 7.389
e^1.0 = 2.718
e^0.1 = 1.105

# Step 2: sum them
total = 7.389 + 2.718 + 1.105 = 11.212

# Step 3: divide each by total
softmax = [7.389/11.212, 2.718/11.212, 1.105/11.212]
        = [      0.659,           0.242,           0.099]

# Check: 0.659 + 0.242 + 0.099 = 1.0 ✓
        

Interactive — Softmax Visualizer

Logit 1 (class A) 2.0

Logit 2 (class B) 1.0

Logit 3 (class C) 0.1

Softmax in Transformers

In the attention mechanism, softmax converts attention scores (dot products of Q and K) into attention weights — probabilities that sum to 1. These tell each token how much to "attend" to every other token. Softmax is in every transformer layer.

Part 03

Calculus

When to study this

You don't need calculus to build RAG pipelines, use LLM APIs, or even understand most of what an LLM does. Learn this section when you start training or fine-tuning neural networks.

3.1 Derivatives

A derivative measures how much a function's output changes when you change the input slightly. Think of it as the slope of a curve at a specific point.

If f(x) = x², and x = 3, the derivative tells you: "if I nudge x a tiny bit, how much does the output change?"

Basic rules Constant: d/dx(5) = 0 — constants don't change Power rule: d/dx(xⁿ) = n\timesxⁿ⁻¹ Examples: d/dx(x²) = 2x d/dx(x³) = 3x² d/dx(x) = 1

Example — applying power rule

f(x) = x²

f'(x) = 2x   # the derivative

f'(3) = 2×3 = 6   # slope at x=3 is 6
f'(5) = 2×5 = 10  # slope at x=5 is 10
f'(0) = 2×0 = 0   # slope at x=0 is 0 (minimum!)
        

Example 2 — more terms

f(x) = 3x³ + 2x² + 5x + 7

f'(x) = 9x² + 4x + 5

# Each term differentiated separately:
# 3x³ → 3×3×x² = 9x²
# 2x² → 2×2×x¹ = 4x
# 5x  → 5×1×x⁰ = 5
# 7   → 0 (constant)
        

In AI Training

The loss function is just f(weights). Its derivative tells us "which direction does the loss increase if we change each weight?" We go in the opposite direction to reduce the loss. This is gradient descent.

3.2 Chain Rule

When functions are nested inside each other, you use the chain rule to find the derivative. This is the mathematical heart of backpropagation.

If y = f(g(x)), then: dy/dx = f'(g(x)) \times g'(x) "derivative of outer \times derivative of inner"

Example 1

y = (3x + 1)²

# Outer function: f(u) = u²   → f'(u) = 2u
# Inner function: g(x) = 3x+1 → g'(x) = 3

dy/dx = 2(3x+1) × 3
       = 6(3x+1)

# At x=2: dy/dx = 6(7) = 42
        

Example 2 — neural network layer

# In a neural network: output = activation(weights × input)
# loss depends on output, output depends on weights

loss = L(activation(W × x))

# Chain rule gives us:
dLoss/dW = dLoss/dOutput × dOutput/dActivation × dActivation/dW

# This chain is what backpropagation computes!
# It "chains" derivatives through each layer backwards
        

3.3 Partial Derivatives

When a function has multiple variables, a partial derivative measures the change with respect to one variable at a time, treating others as constants.

f(x, y) = x² + 3y + xy

# Partial derivative with respect to x (treat y as constant)
∂f/∂x = 2x + y

# Partial derivative with respect to y (treat x as constant)
∂f/∂y = 3 + x
      

Evaluating at a point

f(x, y) = x² + 3y + xy

∂f/∂x = 2x + y
∂f/∂y = 3 + x

# At point (x=2, y=3):
∂f/∂x = 2×2 + 3 = 7   # slope in x direction is 7
∂f/∂y = 3 + 2   = 5   # slope in y direction is 5
        

In AI

A neural network's loss depends on millions of weights (w₁, w₂, ... wₙ). We compute ∂Loss/∂w₁, ∂Loss/∂w₂, etc. — one partial derivative per weight. This tells us exactly how to nudge each weight to reduce the loss. Backpropagation automates this computation efficiently.

3.4 Gradients

The gradient is just a vector that collects all the partial derivatives of a function. It points in the direction of steepest increase of the function.

\nablaf = [\partialf/\partialx₁, \partialf/\partialx₂, ..., \partialf/\partialxₙ] \nabla is called "nabla" — just means "gradient of"

f(x, y) = x² + y²

∇f = [∂f/∂x, ∂f/∂y]
    = [2x, 2y]

# At point (3, 4):
∇f = [6, 8]   # "go right 6, up 8" → steepest increase
    # negate it → [-6, -8] → steepest DECREASE (where we want to go!)
      

Gradient intuition

Imagine you're on a hill. The gradient says "this is the steepest direction uphill." To find the bottom of the hill (minimum loss), you go in the opposite direction of the gradient. That's gradient descent.

3.5 Gradient Descent

Gradient descent is the algorithm that trains neural networks. It repeatedly computes the gradient of the loss and takes a small step in the opposite direction to reduce the loss.

w = w - α \times \nablaLoss w = weight (parameter) α = learning rate (how big a step) \nablaLoss = gradient of the loss w.r.t. w

Forward pass: Run the model on input data, get a prediction

Compute loss: Compare prediction to true label using cross-entropy loss

Backward pass: Use chain rule to compute gradient for every weight

Update weights: w = w - α × gradient — move each weight slightly in the right direction

Repeat thousands of times until loss is low enough

Worked Example — one step

# Simple 1-weight example
w = 5.0         # current weight
α = 0.1         # learning rate
gradient = 3.0  # ∂Loss/∂w computed by backprop

# Update:
w_new = w - α × gradient
       = 5.0 - 0.1 × 3.0
       = 5.0 - 0.3
       = 4.7   # moved slightly toward minimum

# Next iteration: compute new gradient at w=4.7, repeat
        

The Learning Rate matters a lot

Learning Rate (α)	Problem	Example
Too large (0.9)	Overshoots minimum, oscillates	Jumps past the answer
Just right (0.001)	Steady convergence	Standard for Adam optimizer
Too small (0.000001)	Very slow training	Takes too long to converge

Interactive — Gradient Descent Simulator

Learning Rate (α) 0.10

Starting weight 6

The Full Picture

Every time you call model.fit() or run a training loop, this is what happens millions of times. GPT-4 was trained by running gradient descent over trillions of tokens, updating billions of weights. The math is exactly w = w - α∇Loss, just at enormous scale.

3.6 Least Squares

Least squares is the mathematical method that finds the best-fit line (or hyperplane) through a set of data points. It is the direct mathematical foundation of Linear Regression — and it connects linear algebra to calculus in one elegant formula.

The problem: overdetermined systems

Imagine you have 100 data points (x, y) and you want to find a straight line y = mx + b that fits them all. In matrix form this is Aw = y, where A is your data matrix and w holds the weights you want to learn. But with 100 equations and only 2 unknowns, there is no exact solution — the system is overdetermined. Least squares finds the best possible answer.

# Example: fitting y = w·x + b to 4 data points
# (x=1,y=2), (x=2,y=3.9), (x=3,y=6.1), (x=4,y=7.8)

# Matrix form: A·w = y
A = [1,  1]    ← each row: [x_i, 1]
    [2,  1]
    [3,  1]
    [4,  1]

w = [m]          ← unknowns: slope m, intercept b
    [b]

y = [2.0]        ← actual target values
    [3.9]
    [6.1]
    [7.8]

# 4 equations, 2 unknowns → overdetermined → no exact solution
      

Residuals — the error you're minimising

A residual is the difference between what your model predicts and what the actual value is. For each data point:

residual_i = y_i - ŷ_i = actual - predicted Total error = r₁² + r₂² + r₃² + ... + rₙ² Least Squares: minimise ||y - Aw||²

Residuals in action

# Suppose we guess w = [2.0, 0.1] (slope=2.0, intercept=0.1)
predictions = A·w = [2.1, 4.1, 6.1, 8.1]
actual      =       [2.0, 3.9, 6.1, 7.8]

residuals   =       [-0.1, -0.2, 0.0, 0.3]

# Sum of squared residuals = 0.01 + 0.04 + 0.00 + 0.09 = 0.14
# Least squares finds the w that makes this as small as possible
        

The Normal Equation — the closed-form solution

Taking the derivative of the squared error with respect to w, setting it to zero, and solving gives a direct formula for the optimal weights:

w* = (AᵀA)⁻¹ Aᵀ y This is called the Normal Equation.

import numpy as np

# Data
X = np.array([[1, 1], [2, 1], [3, 1], [4, 1]])   # A matrix
y = np.array([2.0, 3.9, 6.1, 7.8])

# Normal equation: w = (XᵀX)⁻¹ Xᵀy
w = np.linalg.inv(X.T @ X) @ X.T @ y

print(w)   # [1.94, 0.15]  → slope ≈ 1.94, intercept ≈ 0.15

# Same as sklearn's LinearRegression under the hood
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X[:, :1], y)
print(model.coef_, model.intercept_)  # same result
      

Gradient descent vs Normal Equation

Method	How	Best when
Normal Equation	Direct formula: w = (AᵀA)⁻¹Aᵀy	Small datasets, few features
Gradient Descent	Iterative: step by step downhill	Large datasets, neural networks

Why not always use the Normal Equation?

Inverting a matrix scales as O(n³) with the number of features. With 10,000 features, this becomes computationally expensive. For neural networks with millions of parameters, gradient descent is the only practical option. For simple linear regression with a few features, the Normal Equation is fast and exact.

This is the mathematics behind Linear Regression

When you train a Linear Regression model — whether in scikit-learn, from scratch in NumPy, or as the output layer of a neural network — the math underneath is least squares. The model finds the weights w that minimise the sum of squared residuals between predictions and targets. Every more complex model (Logistic Regression, Ridge, LASSO, neural networks) is built on this same fundamental idea, extended in different directions.

Quick Check

Why is a system with 100 data points and 2 unknowns called "overdetermined"?

Because the data is too large to process

Because there are more equations than unknowns, so no exact solution exists

Because the matrix A cannot be transposed

Because the residuals are too large

Correct! With 100 equations and only 2 unknowns, you have far more constraints than variables. No single line can pass through all 100 points exactly. Least squares finds the line that is closest to all of them — minimising the total squared error.

You're Ready 🎉

With this foundation, you can now study Classical ML and then Deep Learning with real understanding — not just memorizing formulas.

Next: Classical ML

Linear Regression · Logistic Regression · KNN · Decision Trees · Random Forest · XGBoost

Then: Deep Learning

Neural Networks · Backpropagation · Activation Functions · CNNs · RNNs

Then: AI Engineering

Transformers · LLMs · RAG · Agents · Fine-tuning · Evaluation