ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

Xiaotong Chen 20713364

Valerie McKay-Crites 20732319

ECE 602 - Winter 2018

Prof. Oleg Michailovich

Paper: ADAM: A Method for Stochastic Optimization by Diederik P. Kingma and Jimmy Lei Ba [1]

Contents

Problem formulation

The authors propose ADAM, a simple and computationally efficient method for gradient-based optimization of stochastic objective functions. It combines the advantages of two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives. The project evaluated the method on multi-class logistic regression using the MNIST dataset.

Proposed Solution

1. Adam algorithm

Adam update rule consists of the following steps:

$$m_{t} = \beta_1m_{t-1} + (1 - \beta_1) g_t$$

It is computing a moving average of the derivatives for parameters. The gradient descent with momentum ends up taking steps that produce much smaller oscillations. This allows the algorithm to take a more straightforward path to the minimum. One of the advantages of this exponentially weighted average formula is that it takes very little memory. We need to keep just one row number in computer memory. The formula keeps on overwriting old values with the latest values. It is really efficient compared with the average over all parameters, as it only takes one line of code , and the storage for the memory of a single raw number.

$$v_{t} = \beta_2v_{t-1} + (1 - \beta_2) g_t^2$$

We want the learning rate to be fast to converge, and we want to slow down the oscillations. So, with the term $v_t$, what we're hoping is that it will be relatively small. Parameters that gets few or small updates receive higher learning rates. One effect is that we can use a larger learning rate that decreases over time, and get faster learning without diverging in the vertical direction.

$$\hat{m}_{t} = \frac{m_t}{1 - \beta_1^t}$$

$$\hat{v}_{t} = \frac{v_t}{1 - \beta_2^t}$$

When we are implementing a moving average, the initialization value of $v$ is zero. Then the first several values of $v$ are lower. It is not a very good estimate of the gradients. Bias-corrected first momentum is a way to modify this estimate to make it more accurate, especially during this initial phase of the estimate. $1-\beta^t$ will remove the bias. As iteration $t$ is low, the denominator is low and $v$ becomes larger.

$$\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_{t}}{\sqrt{\hat{v}_{t}} + \epsilon}$$

2. Logistic Regression

minimize $l(\theta)$

where $\theta \in R^{(n+1)}$ is the variable, and $l$ is the logistic loss

$$l(\theta) = logL(\theta) = \sum_{i=1}^m y^{(i)} log h(x^{(i)}) + (1-y^{(i)})log(1-h(x^{(i)}))$$

where $m$ is the number of training data $(x^{(i)}, y^{(i)})$, and $h(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}}$

The multiclass classification is to choose the maximum value among binary classifiers for single input data corresponds to the output class label and can be expressed as follows:

$h(x) = max_{i=1,2,...,k}(g(\theta_i^Tx)) = max_{i=1,2,...,k} \frac{1}{1+e^{-\theta_i^Tx}}$

where $k$ is the number of the binary classifiers corresponding to the number of classes

For i = 1 to m, j = 1 to n

$$\frac{\partial L}{\partial h^{(i)}} = -\frac{y^{(i)}}{h^{(i)}} + \frac{1-y^{(i)}}{1-h^{(i)}}$$

$$\frac{\partial L}{\partial \theta_j^{(i)}} = \frac{\partial l}{\partial
h^{(i)}} \frac{\partial h^{(i)}}{\partial \theta_j^{(i)}}=
x_j^{(i)}(h^{(i)} - y^{(i)})$$

$$\frac{\partial L}{\partial \theta_0^{(i)}} = h^{(i)} - y^{(i)}$$

where m is the number of samples and n is the number of dimensions.

Data sources

We take the MNIST dataset training image files [3] and convert them to matlab files. We use the labels file as a categorical and the images training file as a double vector. We use the 60000 MNIST images as training set and convert them to 784-d vectors. Then the samples are randomly split into different mini-batches (size 128 except last one is 95). Mini-batches require less computation to update the weights.

clear; clc;

load IMAGES.mat
load LABELS.mat

% Create a shuffled version of the training set
p = randperm(size(IMAGES,1));
I_shuffle = zeros(60000,784);
labels_shuffle= categorical(zeros(60000,1));
for i = 1:size(I_shuffle,1)
     I_shuffle(i,:) = IMAGES(p(i),:);
     labels_shuffle(i,1) = LABELS(p(i));
end
% Create mini-batches
for j = 1:468
    I(j).image = I_shuffle(128*j-127:128*j,:);
    I(j).labels = labels_shuffle(128*j-127:128*j,:);
end
I(469).image = I_shuffle(59905:60000,:);
I(469).labels = labels_shuffle(59905:60000,:);

Solution

ADAM

% initialize variables
N = 784;
w = randn(N,10) * 0.01; b = randn(10,1) * 0.01;
V_dw = zeros(N,10); V_db = zeros(10,1);
S_dw = zeros(N, 10); S_db = zeros(10,1);
% Hyperparameters
iter = 45;
alpha = 0.01;
belta1 = 0.9;
belta2 = 0.999;
epsilon = 10.^(-8);
batches=469;
cost = zeros(batches,iter);
cost1 = zeros(1,iter);
for k = 1:iter
       for i = 1:batches
         % Forward Propagation
         a = forward_propagation(I(i).image, w, b);

         % Compute cost
         cost(i,k) = compute_cost(a, I(i).labels);

         % Backward propagation
         [dw, db] = backward_propagation(I(i).image, I(i).labels, a);

         %Update Parameters with Adam
         [w, b, V_dw, V_db, S_dw, S_db] = update_parameters_adam(I(i).image,...
           dw, db, w, b, V_dw, V_db, S_dw, S_db, alpha, belta1, belta2, k, epsilon);

       end
       cost1(k) = cost(k,k);
        if mod(k-1,5) == 0 || k > 40
            if k == 1
                fprintf('ADAM Values Every 5 Iterations \n');
            end
            fprintf('Cost for ADAM after iteration %d: %f \n', k-1, cost(k,k));
       end
end

AdaGrad

In this experiment, the paper compares Adam with two different optimizers: AdaGrad and SGD with Nesterov momentum.

AdaGrad is a modified stochastic gradient descent with per-parameter learning rate [2]. It increases the learning rate for more sparse parameters and decreases the rate for less sparse ones.

$G = \sum_{k=1}^t g_kg_k^T$

where $g_k$ is the gradient at iteration $k$. This vector is updated after every iteration. The formula for an update is

$$\theta_t = \theta_{t-1} - \alpha diag(G)^{-1/2} g$$

AdaGrad corresponds to a version of Adam with $\beta_1 = 0$, infinitesimal $(1 - \beta_2)$ and a replacement of $\alpha$ by an annealed version $\alpha_t = \alpha t^{-1/2}$. Therefore, we could implement AdaGrad by changing Adam algorithm.

% initialize variables
N = 784;
w = randn(N,10) * 0.01; b = randn(10,1) * 0.01;
V_dw = zeros(N,10); V_db = zeros(10,1);
S_dw = zeros(N, 10); S_db = zeros(10,1);
% Hyperparameters
iter = 45;
batches = 469;
alpha = 0.01;
% Changing hyperparameters
belta1 = 0;
belta2 = 0.999999;
epsilon = 10.^(-8);
cost2 = zeros(1,iter);
for k = 1:iter
       for i = 1:batches

         % Forward Propagation
         a = forward_propagation(I(i).image, w, b);

         % Compute cost
         cost(i,k) = compute_cost(a, I(i).labels);

         % Backward propagation
         [dw, db] = backward_propagation(I(i).image, I(i).labels, a);

         % Update Parameters with AdaGrad
         [w, b, V_dw, V_db, S_dw, S_db] = update_parameters_AdaGrad(I(i).image,...
           dw, db, w, b, V_dw, V_db, S_dw, S_db, alpha, belta1, belta2, k, epsilon);

       end
       cost2(k) = cost(k,k);
end

SGD Nesterov

First make a big jump in the direction of the previous accumulated gradient. Then measure the gradient where we end up and make a correction.

% initialize variables
N = 784;
w = randn(N,10) * 0.01; b = randn(10,1) * 0.01;
V_dw = zeros(N,10); V_db = zeros(10,1);
S_dw = zeros(N, 10); S_db = zeros(10,1);
% Hyperparameters
iter = 45;
batches = 469;
alpha = 0.1;
belta1 = 0.9;
belta2 = 0.999;
epsilon = 10.^(-8);
cost = zeros(batches,iter);
cost3 = zeros(1,iter);
for k = 1:iter
       for i = 1:batches

         % Forward Propagation
         a = forward_propagation(I(i).image, w, b);

         % Compute cost
         cost(i,k) = compute_cost(a, I(i).labels);

         % Backward propagation
         [dw, db] = backward_propagation(I(i).image, I(i).labels, a);

        % Update Parameters with SGDNesterov
        [w, b, V_dw, V_db] = update_parameters_SGDNesterov(I(i).image, dw, db, w, b,...
            V_dw, V_db, alpha, belta1, k);
       end
       cost3(k) = cost(k,k);
end

Visualization of results

figure, plot(cost1);
hold on, plot(cost2);
hold on, plot(cost3);
axis([0 45 0.2 0.7]);
legend('ADAM', 'AdaGrad', 'SGD Nesterov');
title('MNIST Logistic Regression');
xlabel('Iterations Over Entire Dataset');
ylabel('Training Cost')
%
% The random fluctuations in the error due to the different gradients on
% different mini-batches.

Solvers and Functions

function cost = compute_cost(a, labels)
cost = 0;
      for i = 1:size(labels,1)
        num = str2num(char(labels(i)));
        y = zeros(10,1);
        y(num+1) = 1;
        % Cost Function
        cost = cost + ( -( y' * log( a(i, :)' ) + ( 1 - y )' * log( 1 - a(i, :)')) )  /size(labels,1)/10;
      end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function a = forward_propagation(I, w, b)
z= zeros( size(I,1), 10);
a = zeros( size(I,1), 10);
    for i = 1:size(I,1)
        for j = 1:size(w,2)
            % Output of Logistic Regression
            z(i,j) = w(:,j)' * I(i,:)'./255 + b(j);
            a(i,j) = 1./(1+exp(-z(i,j)));
        end
    end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [dw, db] = backward_propagation(I, labels, a)
dw = zeros(size(I,2),10); db = zeros(10,1);
y = zeros(10, size(a,1));
    for i=1:size(I,1)
        num(i) = str2num(char(labels(i)));
        y(num(i)+1, i) = 1;
    end
dz = a' - y;
db = sum(dz,2) ./ size(I,1);
dw = I' * dz' ./ size(I,1);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [w, b, V_dw, V_db, S_dw, S_db] = update_parameters_adam(I, dw, db, w, b, V_dw, V_db, S_dw, S_db, alpha, belta1, belta2, t, epsilon)

    % Momentum
    V_dw = belta1 .* V_dw + (1-belta1) * 1/size(I,1) .* dw;
    V_db = belta1 .* V_db + (1-belta1) * 1/size(I,1) .* db;

    % RMSprop
    S_dw = belta2 .* S_dw + (1-belta2) * 1/size(I,1) .* dw.^2;
    S_db = belta2 .* S_db + (1-belta2) * 1/size(I,1) .* db.^2;

    alpha = alpha/sqrt(t);

    w = w - alpha .* V_dw ./ (sqrt(S_dw) + epsilon);
    b = b - alpha .* V_db ./ (sqrt(S_db) + epsilon);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [w, b, V_dw, V_db, S_dw, S_db] = update_parameters_AdaGrad(I, dw, db, w, b, V_dw, V_db, S_dw, S_db, alpha, belta1, belta2, t, epsilon)

    % Calculate an exponentially weighted average of past gradients, and
    % stores it in variables v.
    V_dw = belta1 .* V_dw + (1-belta1) * 1/size(I,1) .* dw;
    V_db = belta1 .* V_db + (1-belta1) * 1/size(I,1) .* db;

    % Calculate an exponentially weighted average of the squares of past
    % gradients, and stores it in variables S
    S_dw = belta2 .* S_dw + (1-belta2) * 1/size(I,1) .* dw.^2;
    S_db = belta2 .* S_db + (1-belta2) * 1/size(I,1) .* db.^2;

    % Bias-Correct
    V_dw_corrected = V_dw./ (1-belta1^t);
    V_db_corrected = V_db ./ (1-belta1^t);

    S_dw_corrected = S_dw./ (1-belta2^t);
    S_db_corrected = S_db ./ (1-belta2^t);

    % Update parameters in a direction based on v and S
    w = w - alpha/sqrt(t) .* V_dw_corrected ./ (sqrt(S_dw_corrected) + epsilon);
    b = b - alpha/sqrt(t) .* V_db_corrected ./ (sqrt(S_db_corrected) + epsilon);

end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [w, b, V_dw, V_db] = update_parameters_SGDNesterov(I, dw, db, w, b, V_dw, V_db, alpha, belta1, t)
    % Momentum
    V_dw = belta1 .* V_dw + (1-belta1) * 1/size(I,1) .* dw;
    V_db = belta1 .* V_db + (1-belta1) * 1/size(I,1) .* db;

    w = w - alpha .* V_dw;
    b = b - alpha .*  V_db;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
ADAM Values Every 5 Iterations 
Cost for ADAM after iteration 0: 0.694234 
Cost for ADAM after iteration 5: 0.318023 
Cost for ADAM after iteration 10: 0.304476 
Cost for ADAM after iteration 15: 0.295369 
Cost for ADAM after iteration 20: 0.287921 
Cost for ADAM after iteration 25: 0.281101 
Cost for ADAM after iteration 30: 0.277861 
Cost for ADAM after iteration 35: 0.272720 
Cost for ADAM after iteration 40: 0.264486 
Cost for ADAM after iteration 41: 0.261258 
Cost for ADAM after iteration 42: 0.268538 
Cost for ADAM after iteration 43: 0.266488 
Cost for ADAM after iteration 44: 0.262349 

Analysis and conclusions

Adam is an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update network weights iteratively based on training data. There are attractive benefits of using Adam as follows:

In reproducing the results of the paper "ADAM: A Method for Stochastic Optimization", we found our algorithm for ADAM had very similar results. Our implementation converges similarly to the SGD Nesterov method for the MNIST dataset. As well we also observed that both of those methods converge faster than ADAGrad for the MNIST database.

References

[1] Kingma, D.P. and Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[2] Duchi, J., Hazan, E. and Singer, Y., 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), pp.2121-2159.

[3] Y. LeCun, C. Cortes and C. Burges, "MNIST handwritten digit database", Yann.lecun.com, 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/. [Accessed: 14- Apr- 2018].