Mathematical Guide
Introduction
This serves as technical documentation for the mathematical background behind FinDi: Finite Difference Gradient Descent Python optimizer. Regular gradient descent will be briefly explained first, followed by its finite difference version, as implemented in FinDi. Lastly, an example will be shown of a problem that cannot be solved by a regular gradient descent optimizer.
(Regular) Gradient Descent
Gradient Descent (GD) is an iterative algorithm for finding a local minimum of a function. The basis of GD is that the objective function \(f\) decreases the fasters in the direction of its negative gradient at a particular point \(v\). Note that a gradient is defined as
and gradient descent update rule is given by
where \(\gamma\) is a learning rate and \(\gamma \in \mathbb{R_{+}}.\) Mathematically speaking, given initial guess \(v_{0}\), GD constructs a monotonic sequence \(f(v_{0}) \geq f(v_{1}) \geq f(v_{2}) \geq \cdot \cdot \cdot\) that should converge to a local minimum, given appropriate learning rate \(\gamma\).
Problems of (regular) Gradient Descent
Objective function \(f\) needs to be differentiable, i.e. a function that has a cusp or any type of discontinuity wouldn’t work. Moreover, any function without analytical form or without analytical gradient also is not solvable with the regular gradient descent.
Sensitivity to learning rate \(\gamma\)
For GD to find the global minimum, \(f\) needs to be convex. Otherwise, GD might converge to a local minimum (or not converge at all).
Finite Difference Gradient Descent
Differentiation
The Finite difference operator maps functions to functions and is defined as
The above expression is called the forward difference due to positive propagation. Analogously, the backward difference is defined as
and the central difference is defined as
Remember the pre-analysis definition of a derivative:
Usefully to us finite difference approximates derivatives by, instead of using infinitesimal changes, it uses finite changes. Let \(h\) be some finite, non-zero value, then
Since \(\frac{\Delta_{h}[f]}{h}=\frac{\Delta f}{\Delta x}\)
This works for backward and central difference as well.
The error of this approximation can be obtained with Taylor’s theorem. By rearranging the Taylor series expansion
where \(\eta \in (x, x+h)\).
While this error is the same for backward difference approximation, the error for central difference approximation is smaller and is given by
where \(\eta_{1} \in (x, x+h)\) and \(\eta_{2} \in (x-h, x).\)
Gradient Descent
Finite Difference Gradient Descent (FDGD) is a modification of the regular GD algorithm that approximates the gradient of the objective function with finite difference derivatives, as
Analogously, the FDGD update rule is given as
where \(\gamma\) is the same as in the regular GD. Given appropriate \(\gamma\), FDGD still constructs a monotonic sequence \(f(v_{0}) \geq f(v_{1}) \geq f(v_{2}) \geq \cdot \cdot \cdot\), however, due to the gradient approximation the convergence has an error proportional to the error discussed in Differentiation subsection.
An Example
Below is an example of a problem that minimizes the sum of squares between some data and some values that can be or are a function of parameters. Since we cannot find a gradient of the objective function, we cannot use any regular implementation of gradient descent. However, FinDi can solve this problem easily as it only needs the objective to be evaluated to optimize it.
import numpy as np
import findi as fd
import cvxpy
# Defining the objective function
def objective(params):
data = np.array([5, 2, 7])
values = np.vstack((np.array(params[0:3]), np.array(params[3:6]))).transpose()
x = cvxpy.Variable(2)
objective = cvxpy.Minimize(cvxpy.sum_squares(values @ x - data))
constraints = [0 <= x, x <= 1, cvxpy.sum(x) == 1]
prob = cvxpy.Problem(objective, constraints)
result = prob.solve()
return prob.value, x.value[0], x.value[1]
# Descent
outputs, parameters = fd.descent(
objective=objective,
initial=np.array([0, 0, 0, 0, 0, 0]),
h=0.001,
l=0.01,
epochs=1000,
)
print("Solution (argmin): ", parameters[-1])
print("Objective value at solution (min): ", outputs[-1])
Summary
Finite difference gradient descent uses a simple modification of the regular gradient descent to be able to optimize a much wider variety of objective functions. Unlike derivative-free optimization methods, FDGD is intuitive making it easier to integrate in scientific, engineering and commercial projects.