%% *Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods*
%% *Group Members*
%
% ECE602: Introduction to Optimization (Winter, 2018)
%
% Group Members: 
% *Lingheng Meng, 20702336;*
% *Ala Eldin Omer, 20738888*
% *Daiwei Lin, 20490018;*   


%% *1. Project Discription* 
% In this project, we will reimplememt *Sum of Functions (SFO)* optimizer
% proposed in [1], then test the optimizer in optimizing an Autoencoder on synthetic data, 
% and compared its performance against lbfgs.
%
% Autoencoder is a basic neural netorwork model intending to automatically
% abstract features from original data. The classic structure of
% Autoencoder is a two layer NN that has a input layer, a hidden layer and
% a output layer. The objective of Autoendocer is to reconstruct input on
% the output layer i.e. optimizing objective function to let $input  \approx output$.
%
% With stacking, greedy-layerwise training and fine-tuning strategies, Autoencoder could
% act as a block to construct deep neural network.
% 
% In this project, we reproduce result on Autoencoder as illustrated in
% paper [1].
%
%% *2. Formal Definition of Objective Function in Autoencoder*
%
% In this section, we will illustrate the formal experssion of objective
% function and derive the back-propagated gradient with respect to each
% parameter in an Autoencoder. 
% 
% *Forward Propagation* (for each datapoint) :
%
% $$x^{(i)}$$
%
% $$ s_{1}^{(i)} =  w_{1} * x^{(i)} + b_{1}  $$ 
%
% $$ h^{(i)} =  \frac{1}{1 +  e^{-s^{(i)}} } \quad \big(pointwise \: operation\big) $$
%
% $$ \widehat{y}^{(i)} =  w_{2} * h^{(i)} +  b_{2} $$
%
% *Objective Function* (for all datapoints in a batch):
% 
% $$ f = \frac{1}{N}  \sum_{i=1}^{N} (y^{(i)} -  \widehat{y}^{(i)} )^{2}$$
% 
%
% *Back-Propagation of Gradient* (for each datapoint):
% 
% $$ \nabla_{ \widehat{y}^{(i)} } f  =  \frac{2}{N}  \big( \widehat{y}^{(i)} -y^{(i)} \big) $$
% 
% $$ \nabla_{ b_{2} } f=\sum_{i=1}^{N} \nabla_{ \widehat{y}^{(i)} } f $$
% 
% $$\nabla_{ w_{2} } f = \sum_{i=1}^{N}  \big(h^{(i)}  {\nabla_{ \widehat{y}^{(i)} } f}^{T} \big) $$
% 
% $$\nabla_{ h^{(i)} } f =  w_{2} \nabla_{ \widehat{y}^{(i)} } f $$
% 
% $$ \nabla_{ b_{1} } f = \sum_{i=1}^{N} \nabla_{ h^{(i)} } f  \odot h^{(i)} \odot \big(1-h^{(i)}\big) \quad \big(\odot \: is \: pointwise \: operation\big)  $$
%
% $$\nabla_{ w_{1} } f = \sum_{i=1}^{N}  \big(x^{(i)} { \big(\nabla_{ h^{(i)} } f  \odot h^{(i)} \odot \big(1-h^{(i)}\big)\big)}^{T} \big) \quad \big(\odot is\: pointwise \: operation\big)$$
% 
% The forward and backward propagation we derived here will be
% incorporated in a single function called _*f_df_autoencoder*_. And this
% function wil be called in *SFO* optimizer.
%
% The optimization problem can be expressed as:
%
% $$ minimize {\quad} f = \frac{1}{N}  \sum_{i=1}^{N} (y^{(i)} -  \widehat{y}^{(i)} )^{2}$$
%
% <include>f_df_autoencoder.m</include>
%
%% *3. Overall Design Sum of Functions Optimizer*
%
% In this section, we are going to illustrate the underlying mehodology to
% implememt *SFO* using a MATLAB code.
% 
%%% 3.1 Approximating the objective function
% 
% A series of functions $G\left(x\right)$ (sum of N subfunctions $g_{i}\left(x\right)$ each of whcih is a quadratic approximation to 
% its corresponding $f_{i}\left(x\right)$) is defined to approximate the objective function $F\left(x\right)$ in each learning iteration as shown in the Figure below.
% The functions $g_{i}\left(x\right)$ will be stored and one of them is
% updated every learning step t:
% 
% <<figure_cartoon_pane_A.jpg>> 
%
% *i*. A vector $x^{t}$ is chosen as shown in the Figure below by minimizing the approximated objective function ${G}^{t-1}\left(x\right)$ at the previous iteration by 
% a Newton step.
%
% $$x^t= argmin_{x} G^{t-1}\left (x\right)$$ Equation(4)
%
% <<figure_cartoon_pane_B.jpg>> 
%
% *ii*. Only one subfunction $g_{j}^{t}\left(x\right)$ of index ${j}$ is chosen (to be updated) according to Equation (13), that is last evaluated 
% farthest from current location (least accurate)
% 
% $$ j= argmax_{i} [x^{t}-x^{\tau_{i}}]^{T} Q^{t} [x^{t}-x^{\tau_{i}}] $$ Equation (13)
%
% where ${\tau_{i}}$ is the time at which subfunction $i$ was last evaluated and Q is set either to approximated 
% Hessian for this subfunction (more accurate measure of distance) or full Hessian to avoid bad Hessian estimate for 
% a single subfunction (evaluation impossible).
%
% <include>get_target_index.m</include>
%
%%% 3.2 Get an Estimate of Subfunction
% The chosen subfunction of index ($indx$) is updated (as shown in the Figure below for $g_1^t(x)$) using a second order power series (Equation (6)) around the vector $x^{t}$
% chosen in the step (i). 
%
% $$g_{i}^{t}(x)=f_{i}(x^{t})+(x-x^{t})^{T}f_{i}^{'}(x^{t})+\frac{1}{2}(x-x^{t})^{T}H_{i}^{t}(x-x^{t}) $$ Equation (6)
%
% <<figure_cartoon_pane_C.jpg>>
%
% where the quadratic term $H_{j}^{t}$ is the Hessian of this subfunction and is approximated using BFGS algorithm
% based on its history of grdaient evaluation (details in step 3.4)
% 
% <include>get_predicted_subf.m</include> 
%
%%% 3.3 Subspace Update
% The subspace is expanded at each optimization step t to include the most recent gradient direction as an additional 
% column by finding the gradient vector component that lies outside the exsiting subspace and appending it to the current subspace.
% Following equations were used:
%
% $$ q_{orth}=f_{j}^{'}(x^{t})-P^{t-1}(P^{t-1})^{T} f_{j}^{'}(x^{t}) $$ Equation (7)
%
% $$ P^{t}=\left [P^{t-1} \frac{q_{orth}}{\left \|q_{orth}\right\|}\right] $$ Equation (8)
%
% *i*. The function *update_subspace(obj, x_in)* is written to update the subspace at each optimization
% step t to include the most recent gradient direction as an additional
% column. This is done by finding the gradient vector component that lies
% outside the exsiting subspace and appending it to the current subspace.
% 
% <include>update_subspace.m</include>
%
% *ii*. function *collapse_subspace(obj, xl)*: collapse the subspace to its smallest dimensionality. $xl$ is a new direction that may not 
% be in the history yet, hence to be included.
% In case the subspace dimensionality $K^{t}$ exceeds $K_{max}=3N$, it is collapsed to a new subspace that includes only the most recent 
% gradient and position measurements from each subfunction through Orthogonal-triangular (QR) decomposition
%
% $$ P^{'}=orth \left(\left[f_{1}^{'}(x^{\tau_{1}^{t}}) \ldots f_{N}^{'}(x^{\tau_{N}^{t}}) \quad x^{\tau_{1}^{t}} \ldots x^{\tau_{N}^{t}}\right]\right) $$ Equation (9)
%
% <include>collapse_subspace.m</include>
%
% *iii*. *reorthogonalize_subspace(obj)*: re-orthogonalizes the subspace if small errors are accumulated by performing Orthogonal-triangular 
% (QR) decomposition to the non-orthogonal subspace.    
%
% <include>reorthogonalize_subspace.m</include>
%
% *iv*. *apply_subspace_transformation(obj,T_left,T_right)*: apply change-of-subspace transformation when the subspace is collapsed 
% to project into the new lower dimensional subspace where T_left and T_right are the covariant and contravariant subspace 
% projection matrices, respectively. 
%
% <include>apply_subspace_transformation.m</include>
%
%%% 3.4 Online Hessian Approximation
% In this section we addressed all the calculations correspond to maintain an independent Hessian 
% approximation for each subfunction computed through BFGS algorithm on its history of gradient evaluations and positions 
%
% *i*. For each subfunction ${g_{j}^t}(x)$, two matrices $\Delta{f'}$ and $\Delta{x}$ are constructed to hold the change in 
% subfunction gradient and the corresponding change in position x between successive evaluations. This history is updated 
% using the function *update_history(obj, indx, theta_proj, f, df_proj)* that updates history of position & gradient differences for subfunction indx.
%
% <include>update_history.m</include>
%
% *ii*. Then, the Hessian approximation for a single subfunction (indx) is updated using BFGS algorithm as detailed in 
% function *update_hessian(obj,indx)* that iterates through the columns in $\Delta f'$ and $\Delta {x}$ from oldest 
% to most recent. At certain column of index $s$ the approximated Hessian should satisfy:
%
% * The secant equation $\Delta f'_{s} =B_{s} \Delta {x_s}$
%
% * Smallest weighted Forbenius norm between the current and prior estimate of Hessian (standard BFGS update equation)
%
% $$B_{s} = B_{s-1} + \frac {\Delta f'_{s} \Delta {f'_{s}}^{T}}{\Delta {f'_{s}}^{T} \Delta x_{s}} - \frac {B_{s-1} \Delta x_{s} \Delta x_{s}^{T} B_{s-1}}{\Delta x_{s}^{T} B_{s-1} \Delta x_{s}}$$ Equation (10)
%
% <include>update_hessian.m</include>
%
% *iii*. The full approximated Hessian is etimated using the function *get_full_H_with_diagonal(obj)*.
%  
% <include>get_full_H_with_diagonal.m</include>
%
%%% 3.5 Detecting bad updates
% Bad proposed parameter updates are identidied using the function *handle_step_failure(obj, f, df_proj, indx)*.
% Since only one subfunction is evaluated per update step, a line search on
% the full objective is impossible, therefore upades are instead lablelled
% bad when the value of subfunction has increased since its previous
% evaluation, and also exceeds its approximating function by more than the
% corresponding reduction in the summed approximating function (i.e. $f_j(x^t) -g_j^{t-1} > G^{t-1}(x^{t-1}) - G^{t-1}(x^{t})$)
%
% When a bad update is detected, $x^{t}$ is reset to its previous value $x^{t-1}$, the BFGS history matrices are also updated to 
% include the change in gradient in the failed update direction and the step length is updated as follows
%
% Case 1: successful update ${\eta}^{t+1} = \frac{1}{N}+\frac{N-1}{N}{\eta}^{t}$
%
% Case 2: failed update ${\eta}^{t+1} = \frac{1}{2}{\eta}^{t+1}$ Equation (16)
% 
% <include>handle_step_failure.m</include>
%
%%% 3.6 Growing the number of active subfunctions
% The algorithm begins with only small number of active subfunctions (two for our implementation) and expands the active 
% set as learning progresses by one subfunction every time the average gradient shrinks to within a factor $\alpha$ of 
% the standard error in the average gradient, i.e. the active subset is incremented whenever 
%
% $$ {(\bar{f}^{'t})}^{T} H^{t-1}{\bar{f}^{'t}} < \alpha \frac{\sum_{i}{(\bar{f}^{'t})}^{T} H^{t-1} {\bar{f}}^{'t}}{(N^{t}-1)N^{t}} $$ Equation (14)
% 
% where $N^{t}$ is the active subset size at time, $H^{t}$ is the full Hessian, and $\bar{f}^{'t}$ is the average gradient given by:  
%
% $$ \bar{f}^{'t}= \frac{1}{N^t} \sum_{i} f_{i}^{'} (x_{i}^{t}) $$ Equation (15)
%
% The active subset is also increased when a bad update is detected or when
% a full pass through the active batch occurs without a batch size increase
% 
% <include>expand_active_subfunctions.m</include>
%
%%% 3.7 Perform One Step Optimization
% After we defined all these subfunctions, we can perform a single optimization step. 
% 
% # Choose an index using get_target_index() function.
% # Calculate the gradient, 
% # Calculate inverse of hessian
% # Performe finally the Newton step according to Equation (5). (The Newton step length is adjusted in order to avoid numerical errors caused by small numbers or bouncing
% around opitimized values caused by large step size.)
% # Active subfunction set is expended at the end.
%
% $$ x^{t}=x^{t-1}-\eta^{t}\left(H^{t-1}\right)\frac{\partial
% G^{t-1}\left(x^{t-1}\right)}{\partial x} $$ Equation(5)
% 
% <include>optimization_step.m</include>
%% *4. Set Up Hyperparameters*
% Set Autoencoder structure and training dataset parameters
M = 784; % number visible units
J = 50; % number hidden units
D = 10000; % full data batch size
N = floor(sqrt(D)/10.); % number minibatches
%% *5. Data Import*
% The dataset we used in this project is <http://yann.lecun.com/exdb/mnist/
% MNIST> dataset.
v = loadMNISTImages('train-images-idx3-ubyte.idx3-ubyte');

% create the cell array of subfunction specific arguments
sub_refs = cell(N,1);
for i = 1:N
    % extract a single minibatch of training data.
    sub_refs{i} = v(:,i:N:end);
end

display_network(v(:,1:100));
%% *6. Optimize Autoencoder Using SFO on MNIST Dataset*
% initialize parameters
r  = sqrt(6) / sqrt(M+J+1);   % we'll choose weights uniformly from the interval [-r, r]
w = randn(J,M) * 2 * r - r; 
b_h = randn(J,1); 
b_v = randn(M,1);
b_h = zeros(J,1);
b_v = zeros(M,1);

theta_init = {w, b_h, b_v};
% initialize the optimizer
optimizer = sfo(@f_df_autoencoder, theta_init, sub_refs);
maxIteration = 200;
theta = optimizer.optimize(maxIteration);

% plot the convergence trace
plot(optimizer.hist_f_flat);
xlabel('Iteration');
ylabel('Minibatch Function Value');
title('Convergence Trace of SFO');

%% 6.1 Reconstruction of Autoencoder Trained by SFO
% Randomly pick up 100 samples to test the reconstruction performence of
% trained autoencoder
testSamples = v(:,1:100);
[reconstructed_testSamples] = autoencoder(theta, testSamples);
%Original testSamples
figure(); display_network(testSamples);

%Reconstructed testSamples
figure(); display_network(reconstructed_testSamples);
%% *7. Optimize Autoencoder Using Other Optimizers On MNIST Dataset*
%  Use <http://www.cs.ubc.ca/~schmidtm/Software/ lbfgs> to minimize the same objective function
addpath ./minFunc/
options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % sparseAutoencoderCost.m satisfies this.
options.maxIter = 200;	  % Maximum number of iterations of L-BFGS to run 
options.display = 'off';

theta_init2 = [w(:) ; b_h(:) ; b_v(:)];

%  Run fminlbfgs
[opttheta, loss, flag, trace] = minFunc( @(p)f_df_autoencoder(p,v), ...
                                        theta_init2, options);

plot(trace.trace.fval);
xlabel('Iteration');
ylabel('Function Value on WHole Dataset');
title('Convergence Trace of lbfgs');
%% 7.1 Reconstruction of Autoencoder Trained by lbfgs
testSamples = v(:,1:100);
[reconstructed_testSamples] = autoencoder(opttheta, testSamples);
%Original testSamples
figure(); display_network(testSamples);
%Reconstructed testSamples
figure(); display_network(reconstructed_testSamples);

%% *8. Analysis and conclusions*
% *Experiment results analysis*:
%
% # Comparing convergence traces produced by SFO and lbfgs, we can see that
% both optimizers converge to a similar optimal function value i.e. around
% 15. This can also be seen from the reconstruction results that the
% reconstructed figures of Autoendoers trained by SFO and lbfgs
% respecitvely look very similar.
% # Another obvious character is that the convergence trace produced by SFO is less 
% smooth than the trace produced by lbfgs. This is bacause SFO trains
% objective function on subfunction i.e. minibatches rather than on a whole
% dataset as lbfgs does. Although the convergence trance is less smooth,
% SFO needs much less running time memory than lbfgs, and this feature will
% be more important in task with very large training dataset.
% # The way that SFO works on subfunctions of objective function also
% brings it a disadvantage in task with small dataset.
%
% *Summarize Key Features of SFO*:
%
% # Single subfunction is evaluated per update step, hence quadratic approximation 
% is maintained for each to get the full hessian estimation (direct target for SFO).
% Other algorithms considered the hessian of each subfunction as a noisy approximation to the 
% full Hessian.
% # Recent history of gradients and positions for each subfunction (iterates) 
% is used to determine a shared and adaptive low dimensional subspace where 
% active subfunctions are stored, manipulated and updated. This guarantees 
% that subspace includes both the steepest gradient descent direction over 
% the full batch as well as update directions from the most recent newton steps. 
% This way SFO algorithm maintains computational tractability and limits memory
% requirements in the face of high dimensional optimization problems (large M).
% # The algorithm initializes the approximate hessian for each subfunction 
% using a diagonal matrix (identity times median eigenvalue of other active 
% subfunctions average hessian (no history), or identity times minimum eigenvalue 
% found by solving squared secant equation (available history)). It might be 
% effective to use the average approximate hessian from all other subfunctions 
% for this initialization that would be overwritten by divergence of individual 
% subfunctions. This way will take the advantage that different 
% subfunctions (for many objective functions) have very similar hessians. 
% 
%% *References*
% [1] Sohl-Dickstein, Jascha, Ben Poole, and Surya Ganguli. "Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods." In International Conference on Machine Learning, pp. 604-612. 2014.
% 
% [2] Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.