A Moment of Change#

The Math of a Learning Network

So, I was exploring different ways to initialize network parameters and selecting an appropriate learning rate (LR) to better understand how to improve the model’s learning performance during backpropagation.

This led me to do yet another “Math” rabbit hole 🐇 to self clarify a few un-answered questions around derivatives, rate of change etc

This photo is from 2017, of me test riding a Street Triple on a national highway in India. The speed on the speedometer indicates the current speed at the given moment. It’s basically an estimate of how much distance the bike would cover in one hour if it continues at the same speed at that moment — i.e., 149 km per hour.

In order to measure, there needs to be a change — we can’t measure anything if nothing changes. Change can be computed as

\[ \Delta x = x_{\text{final}} - x_{\text{initial}} \]

Speed can be computed as

\[ S = \frac{\Delta D}{\Delta T} \]

where:

\(S\) is speed
\(\Delta D\) is distance traveled = \(D_{\text{final}} - D_{\text{initial}}\)
\(\Delta T\) is time = \(T_{\text{final}} - T_{\text{initial}}\)

when you travel from one point to another, there are infinite speeds possible. The change of speed is a gradient of speed with respect to moment in time.

What is a moment in time?#

A moment in time is an infinitesimally small interval of time, often denoted as \(\Delta t\) or \(dt\).

🔍 What is a “moment” in calculus?#

In calculus, when we talk about a “moment” — often casually — we’re usually referring to a very short interval of time or space, often approaching zero. Its typically denoted by \(h\) and is smallest possible non 0 number between 0 and 1 (note: there are infinite possible values between 0 and 1).

How to represent a moment using limits#

A “moment” is mathematically modeled as an infinitesimally small change, typically using the limit:

\[ \lim_{h \to 0} \]

Here, \(h\) represents a very small change (often in \(x\), time, etc.), and as it “goes to zero,” we study how a function behaves in that infinitesimally small “moment.” When we say “h tends to zero,” we mean that \(h\) approaches zero but never actually reaches it - i.e. the next possible number next to zero but not zero. The smaller \(h\) gets, the more accurate your measurement of instantaneous change becomes.

Neural Network: Derivatives at a Moment aka Gradient Descent#

Gradient descent is an optimization algorithm used to minimize a loss function (i.e., to improve a model). So if the Loss is

\[ L = \text{Loss}(\hat{y}, y) \]

\( y \): Actual output.
\( \hat{y} \): Predicted output by the model.
\(\text{Loss}\): Loss function (e.g., mean squared error, cross-entropy loss).

The optimization would be:

\[ \theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot \nabla_\theta L \]

Where:

\(\theta\) = model parameters (weights, biases, etc.)
\(\eta\) = learning rate (a small scalar)
\(\nabla_\theta L\) = is a gradient of the loss function \(L\) with respect to \(\theta\)

To improve predictions \( \hat{y} \), we want to minimize the loss, so we move in the opposite direction of the gradient

💡 To compute \(\nabla_\theta L\), we would use derivatives.#

\[ \nabla_\theta L = \left[ \frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \ldots \right] = \frac{\partial L}{\partial \theta} \]

The derivative tells us:

“At a given moment (i.e., a specific point in parameter space), it shows how a tiny (infinitesimal) change in \(\theta\) will affect the loss.” OR “The effect of the smallest possible change in \(\theta\) on the loss at that exact moment.”

Formally:

\[ \frac{\partial L}{\partial \theta} = \lim_{h \to 0} \frac{L(\theta + h) - L(\theta)}{h} \]

The gradient is not the small change itself — it’s the vector of partial derivatives that tells you how the loss would change if you made a small change in \(\theta\) (parameters).

The above definition is the formal definition of a derivative. In multilayer neural network, it will be very inefficient to compute the gradient using limit definition. Instead we would use chain rule (applied calculus) to backpropogate the loss and compute the gradient efficiently.

Backpropagation: Chain Rule of Tiny Changes#

Backpropagation uses the chain rule to track how small changes in parameters affect the loss at a specific moment.

If:

\(L\) is the loss
\(z\) is a neuron output
\(w\) is a weight

Then:

\[ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial w} \]

Lastly, Every partial derivative in backprop is a slope, and geometrically, that’s just a tangent angle, i.e., \(\tan(\theta)\).

For example:

\[ \frac{\partial L}{\partial w} \longrightarrow \text{Slope of the loss with respect to weight } w \]

And that slope can be visualized as:

\[ \frac{\Delta L}{\Delta w} = \tan(\theta) \]

Where \(\theta\) is the angle between the weight axis and the tangent line of the loss curve at that point.

TODO: 🚧 To be continued… 🚧