← Back to Projects

A Comparative Study on First Order Optimization Algorithms

Optimization Algorithms
Python Optimization Gradient Descent

Brief

In this project, I implemented and compared several first-order optimization algorithms from scratch, including Steepest Descent (Gradient Descent with Line-Search), Momentum Gradient Descent, and Nesterov Accelerated Gradient Descent (NAGD). I also developed a custom Armijo backtracking line search to determine adaptive step sizes. These algorithms were then tested across convex and non-convex functions, as well as on a signal denoising task, to analyze their convergence behavior and robustness.

Details

The study begins with the classical Steepest Descent and Momentum Gradient Descent methods. Momentum improves upon the standard approach by incorporating a velocity term that helps the optimizer maintain direction and accelerate through shallow regions.

Nesterov’s Accelerated Gradient Descent (NAGD) refines this idea further by predicting the next position before computing the gradient, leading to faster convergence and improved handling of ill-conditioned functions. The algorithm’s dynamics are governed by:

\[ x_{i+1} = y_i - \tau_i \nabla f(y_i) \] \[ y_{i+1} = x_{i+1} + \frac{\lambda_i - 1}{\lambda_{i+1}} (x_{i+1} - x_i) \]

where the momentum parameter \( \lambda_i \) evolves according to:

\[ \lambda_0 = 0, \quad \lambda_{i+1} = \frac{1 + \sqrt{1 + 4\lambda_i^2}}{2} \]

To ensure convergence and stability, I implemented an Armijo Backtracking Line Search. This strategy adaptively selects the step size \( \tau \) based on a sufficient decrease condition:

\[ f(x) \leq f(y) + \langle \nabla f(y), x - y \rangle + \frac{1}{2\tau} \|x - y\|^2 \]

The step size \( \tau \) is chosen as:

\[ \tau = \frac{1}{2^m}, \quad \text{where } m \text{ is the smallest integer for which the condition holds.} \]

I applied this framework to a variety of test scenarios, including convex functions, non-convex functions, and a real-world signal denoising task. Each algorithm’s performance was assessed based on convergence speed, stability, and accuracy. NAGD consistently demonstrated superior behavior in terms of stability and fast convergence.

Rotated Hyper-Ellipsoid Function

This convex benchmark function is defined as:

\[ f(x) = \sum_{i=1}^{n} \left( \sum_{j=1}^i x_j \right)^2, \quad n = 10, 100 \]

It has a global minimum at the origin:

\[ f(0,0,\dots,0) = 0 \]

The stopping criterion used was:

\[ |f(x_n) - f(x_{n-1})| < 10^{-6} \]

Results Table

Scheme Iterations (N=10) Value Iterations (N=100) Value
Steepest Descent 407 7.188e-5 23304 0.008
SD with Momentum 101 2.816e-5 3833 0.001
NAGD Fixed 91 0.0001 1609 0.0003
NAGD 65 0.0007 673 0.0007

Ackley's Function

The Ackley function is defined as:

\[ f(\mathbf{x}) = -a \exp \left( -b \sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2} \right) - \exp \left( \frac{1}{d} \sum_{i=1}^{d} \cos(c x_i) \right) + a + \exp(1) \]

where \( a = 20 \), \( b = 0.2 \), \( c = 2\pi \). This is a non-convex function commonly used to test convergence behavior of optimizers.

Results Table

Method Iterations Gradient Norm
Nesterov w/o line search (fixed param) 25 1.263e-07
Nesterov w/o line search 33 2.718e-05
Momentum Gradient Descent 86 9.608e-07
Vanilla Gradient Descent 55 8.291e-07

Trajectories

Ackley NAGD FP

Nesterov w/o Line Search (Fixed Parameter)

Ackley NAGD VP

Nesterov w/o Line Search

Ackley MGD

Momentum Gradient Descent

Ackley GD

Vanilla Gradient Descent

Styblinski–Tang Function

The Styblinski–Tang function is defined as:

\[ f(\mathbf{x}) = \sum_{i=1}^{d} \left( \frac{x_i^4 - 16x_i^2 + 5x_i}{2} \right) \]

This function has four local minima and one global minimum, making it useful for evaluating the performance of optimizers on non-convex landscapes.

Results Table

Method Iterations Minimum Value
Nesterov with Line Search (Variable Param) 9 4.563e-04
Nesterov w/o Line Search (Fixed Param) 14 9.758e-04
Nesterov w/o Line Search 10 1.774e-04
Gradient Descent with Momentum 104 5.630e-05

Optimization Trajectories

ST LS VP

Nesterov with Line Search (Variable Parameter)

ST FP

Nesterov Fixed Parameter (w/o Line Search)

ST MGD

Momentum Gradient Descent

ST GD

Vanilla Gradient Descent