Skip to the content of the web site.

11.4 Gradient Descent

Introduction Theory HOWTO Error Analysis Examples Questions Applications in Engineering Matlab Maple


The method of gradient descent using the gradient to convert an N-dimensional problem into a 1-dimensional problem. At this point, any of the other techniques which we have used are now available to solve this 1-dimensional problem, which may then be translated back to an N-dimensional solution.


  • Mathews, Section 8.3, Steepest Descent or Gradient Method, p.446.


Given any scalar-valued function f(x) of an N-dimensional vector variable x, the gradient f(x) is a vector which gives the direction of maximum ascent at any point x. In this case, -f(x) is the direction of maximum descent.

Thus, if we have a point x0, we can evaluate the gradient at that point and define

x0 - t f(x0).

This defines a straight line in the N-dimensional space parametrized by the real variable t. If we evaluate the function f(x) at the points on this line, namely f(x0 + t f(x0)), this now becomes a function of a single variable t. Thus, we can use quadratic optimization or Newton's method to find a local minimum along that line.

Suppose we now calculate the value at which f attains a minimum along the given line. Call this point t1. If we now define x1 = x0 + t1 f(x)), then we are hopefully close to the local minima of f(x).

Note that because the point x1 is at a local minima along the line x0 + t f(x)), this means that the directional derivative at x1 in the direction of f(x0) must be zero. Therefore, the gradient f(x1) must be perpendicular to the gradient f(x0).

Having found x1, we can now find the next approximation x2.



Given a scalar-valued function of a vector variable, f(x), find a local minima of that function. We will assume x is an N-dimensional vector.


We will assume that the function f(x) is continuous.


We will use sampling, Taylor series (in higher dimensions), and iteration.

Initial Requirements

We have an initial bound x0 on the minima.

Iteration Process

Given the approximation xn − 1, calculate the gradient f(x). Evaluate this gradient at the point xn − 1 and then evaluate f( xn − 1 + tf(xn − 1) ). This is a function of a single real variable t, so at this point, use whichever optimization technique for a single variable is appropriate (be it a golden-mean search, Newton's method, or quadratic optimization).

Having found a value tn, let:

xn = xn − 1 + tn f(xn − 1).

Halting Conditions

  1. We halt if both of the following conditions are met:
    • The step between successive iterates is sufficiently small, ||xn - xn − 1|| < εstep, and
    • The difference between y values sufficiently small, |f(xn) - f(xn − 1)| < εabs.
  2. If the gradient is 0.
  3. If we have iterated some maximum number of times, say N, and have not met Condition 1, we halt and indicate that a solution was not found.

If we halt due to Condition 1, we state that xn is our approximation to the minimum.

If we halt due to either Condition 2 or 3, we may either choose a different initial approximation x0 or state that a solution may not exist.

Error Analysis

To be completed.


Example 1

Consider the function f(x) = x12 - 4x1 + 2x1x2 + 2x22 + 2*x2 + 14. This function has a minimum at x = (5, -3)T. This function is shown in Figure 1.

Figure 1. A function of a 2-dimensional vector.

Suppose, however, that we have the approximation x = (4, -4)T. We can calculate the gradient:

This gradient evaluated at (4, -4)T is (-4, -6)T. Figure 2 shows an arrow pointing in the opposite direction of the gradient at the point (4, -4)T. (Recall that the gradient points in the direction of maximum ascent, and we are looking for a minimum, so we must look in the opposite direction.)

Figure 2. The negative gradient vector at (4, -4)T.

The line on which this vector lies is (4, -4)T + t(4, 6)T = (4 + 4t, -4 + 6t)T. If we evaluate the function f(x) along this line

f( (4 + 4t, -4 + 6t)T ) = (4 + 4t)2 - 4(4 + 4t) + 2(4 + 4t)(-4 + 6t) + 2(-4 + 6t)2 + 2(-4 + 6t) + 14 = 136 t2 - 52 t + 6

we see that we have a function of one variable. Figure 3 shows this univariate function.

Figure 3. The function evaluated along the line parallel to the gradient.

Click here to see an animation rotating around this point. The blue dot is the actual minimum, though you will note that the function passes much closer to that point than our original guess (4, -4)T.

This quadratic function of a single variable t has a minimum at t1 = 13/68, and therefore (4, -4)T + (13/68)⋅(4, 6)T = (81/17, -97/34)T ≈ (4.7647, -2.8529)T is a better approximation the minimum (which you will recall is at (5, -3)T). The gradient at this new point is f( (81/17, -97/34)T ) = (-3/17, 2/17)T which you may note is perpendicular to our first gradient, (-4, -6)T.

We can now follow this next gradient to find an even better approximation of the minimum.


Question 1

Suppose you have the function f(x) = sin(x1) + sin(x2) + sin(x3). What is the univariate function you must minimize if you wish to apply gradient descent starting with the point x0 = (0, 0, 0)T?

Answer: f(t) = 3 sin( t ).

Question 2

Suppose you have a function f(x) = x12 - x1 + sin(x2). What is the univariate function you must minimize if you want to apply gradient descent starting with the point x0 = (0, 0)T?

Answer: f(x) = (-t)2 + t + sin(t).

Question 3

Perform two iterations to find a minimum of the function f(x) = x12 + x22 - x1x2 + x1 - 2x2 starting with x = (1, 1)T.

Answer: f(x0) = (2, -1)T, f( (1 + 2t, 1 - t)T ) = 7t2 − 5t. x1 = (0.2857, 1.3571)T.

x2 = (0.10714, 1)T.

Question 4

Compare the error of the three approximations in Question 3 and the actual location of the minimum, (0, 1)T.

Answer: ||x0 - (0, 1)T||2 = 1
||x1 - (0, 1)T||2 = 0.457
||x2 - (0, 1)T||2 = 0.107.

Applications to Engineering

Suppose we have a series of N antennae where all receive signals from same source s(t). The received signals will be denoted by sk(t) where k = 1, ..., N.

Since the antennae are spatially separated, their received signals do not have the same phase (are not coherent in the phase), and due to noise and variations in construction, tolerances, and other factors, the amplitudes of the signals will also have minor variations.

In order to build a stronger signal from the individual signals, it is common to add the signals together, however, when adding signals which are not in phase with each other, this sum will be destructive. The problem, therefore, is to introduce a phase shift ωk into each signal sk(t) so as to maximize the power of the combined signal. This is shown in Figure 1.

Figure 1. Combining the signals of an array of antennae.

Now, if the vector ω of phase shifts is a solution which maximizes the power, then ω + c is also a solution, so therefore we can fix the first phase shift to ω1 = 0.

This represents a real-valued function in N − 1 variables ω2, ω3, ..., ωN and a gradient-descent method could be used to find the optimal values of these phase shifts.

Thanks to Mohammad Fakharzadeh for this example.


Programming the gradient method requires you to explicitly enter the gradient as a function.


To be done.