%% *Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods* %% *Group Members* % % ECE602: Introduction to Optimization (Winter, 2018) % % Group Members: % *Lingheng Meng, 20702336;* % *Ala Eldin Omer, 20738888* % *Daiwei Lin, 20490018;* %% *1. Project Discription* % In this project, we will reimplememt *Sum of Functions (SFO)* optimizer % proposed in [1], then test the optimizer in optimizing an Autoencoder on synthetic data, % and compared its performance against lbfgs. % % Autoencoder is a basic neural netorwork model intending to automatically % abstract features from original data. The classic structure of % Autoencoder is a two layer NN that has a input layer, a hidden layer and % a output layer. The objective of Autoendocer is to reconstruct input on % the output layer i.e. optimizing objective function to let $input \approx output$. % % With stacking, greedy-layerwise training and fine-tuning strategies, Autoencoder could % act as a block to construct deep neural network. % % In this project, we reproduce result on Autoencoder as illustrated in % paper [1]. % %% *2. Formal Definition of Objective Function in Autoencoder* % % In this section, we will illustrate the formal experssion of objective % function and derive the back-propagated gradient with respect to each % parameter in an Autoencoder. % % *Forward Propagation* (for each datapoint) : % % $$x^{(i)}$$ % % $$ s_{1}^{(i)} = w_{1} * x^{(i)} + b_{1} $$ % % $$ h^{(i)} = \frac{1}{1 + e^{-s^{(i)}} } \quad \big(pointwise \: operation\big) $$ % % $$ \widehat{y}^{(i)} = w_{2} * h^{(i)} + b_{2} $$ % % *Objective Function* (for all datapoints in a batch): % % $$ f = \frac{1}{N} \sum_{i=1}^{N} (y^{(i)} - \widehat{y}^{(i)} )^{2}$$ % % % *Back-Propagation of Gradient* (for each datapoint): % % $$ \nabla_{ \widehat{y}^{(i)} } f = \frac{2}{N} \big( \widehat{y}^{(i)} -y^{(i)} \big) $$ % % $$ \nabla_{ b_{2} } f=\sum_{i=1}^{N} \nabla_{ \widehat{y}^{(i)} } f $$ % % $$\nabla_{ w_{2} } f = \sum_{i=1}^{N} \big(h^{(i)} {\nabla_{ \widehat{y}^{(i)} } f}^{T} \big) $$ % % $$\nabla_{ h^{(i)} } f = w_{2} \nabla_{ \widehat{y}^{(i)} } f $$ % % $$ \nabla_{ b_{1} } f = \sum_{i=1}^{N} \nabla_{ h^{(i)} } f \odot h^{(i)} \odot \big(1-h^{(i)}\big) \quad \big(\odot \: is \: pointwise \: operation\big) $$ % % $$\nabla_{ w_{1} } f = \sum_{i=1}^{N} \big(x^{(i)} { \big(\nabla_{ h^{(i)} } f \odot h^{(i)} \odot \big(1-h^{(i)}\big)\big)}^{T} \big) \quad \big(\odot is\: pointwise \: operation\big)$$ % % The forward and backward propagation we derived here will be % incorporated in a single function called _*f_df_autoencoder*_. And this % function wil be called in *SFO* optimizer. % % The optimization problem can be expressed as: % % $$ minimize {\quad} f = \frac{1}{N} \sum_{i=1}^{N} (y^{(i)} - \widehat{y}^{(i)} )^{2}$$ % % f_df_autoencoder.m % %% *3. Overall Design Sum of Functions Optimizer* % % In this section, we are going to illustrate the underlying mehodology to % implememt *SFO* using a MATLAB code. % %%% 3.1 Approximating the objective function % % A series of functions $G\left(x\right)$ (sum of N subfunctions $g_{i}\left(x\right)$ each of whcih is a quadratic approximation to % its corresponding $f_{i}\left(x\right)$) is defined to approximate the objective function $F\left(x\right)$ in each learning iteration as shown in the Figure below. % The functions $g_{i}\left(x\right)$ will be stored and one of them is % updated every learning step t: % % <> % % *i*. A vector $x^{t}$ is chosen as shown in the Figure below by minimizing the approximated objective function ${G}^{t-1}\left(x\right)$ at the previous iteration by % a Newton step. % % $$x^t= argmin_{x} G^{t-1}\left (x\right)$$ Equation(4) % % <> % % *ii*. Only one subfunction $g_{j}^{t}\left(x\right)$ of index ${j}$ is chosen (to be updated) according to Equation (13), that is last evaluated % farthest from current location (least accurate) % % $$ j= argmax_{i} [x^{t}-x^{\tau_{i}}]^{T} Q^{t} [x^{t}-x^{\tau_{i}}] $$ Equation (13) % % where ${\tau_{i}}$ is the time at which subfunction $i$ was last evaluated and Q is set either to approximated % Hessian for this subfunction (more accurate measure of distance) or full Hessian to avoid bad Hessian estimate for % a single subfunction (evaluation impossible). % % get_target_index.m % %%% 3.2 Get an Estimate of Subfunction % The chosen subfunction of index ($indx$) is updated (as shown in the Figure below for $g_1^t(x)$) using a second order power series (Equation (6)) around the vector $x^{t}$ % chosen in the step (i). % % $$g_{i}^{t}(x)=f_{i}(x^{t})+(x-x^{t})^{T}f_{i}^{'}(x^{t})+\frac{1}{2}(x-x^{t})^{T}H_{i}^{t}(x-x^{t}) $$ Equation (6) % % <> % % where the quadratic term $H_{j}^{t}$ is the Hessian of this subfunction and is approximated using BFGS algorithm % based on its history of grdaient evaluation (details in step 3.4) % % get_predicted_subf.m % %%% 3.3 Subspace Update % The subspace is expanded at each optimization step t to include the most recent gradient direction as an additional % column by finding the gradient vector component that lies outside the exsiting subspace and appending it to the current subspace. % Following equations were used: % % $$ q_{orth}=f_{j}^{'}(x^{t})-P^{t-1}(P^{t-1})^{T} f_{j}^{'}(x^{t}) $$ Equation (7) % % $$ P^{t}=\left [P^{t-1} \frac{q_{orth}}{\left \|q_{orth}\right\|}\right] $$ Equation (8) % % *i*. The function *update_subspace(obj, x_in)* is written to update the subspace at each optimization % step t to include the most recent gradient direction as an additional % column. This is done by finding the gradient vector component that lies % outside the exsiting subspace and appending it to the current subspace. % % update_subspace.m % % *ii*. function *collapse_subspace(obj, xl)*: collapse the subspace to its smallest dimensionality. $xl$ is a new direction that may not % be in the history yet, hence to be included. % In case the subspace dimensionality $K^{t}$ exceeds $K_{max}=3N$, it is collapsed to a new subspace that includes only the most recent % gradient and position measurements from each subfunction through Orthogonal-triangular (QR) decomposition % % $$ P^{'}=orth \left(\left[f_{1}^{'}(x^{\tau_{1}^{t}}) \ldots f_{N}^{'}(x^{\tau_{N}^{t}}) \quad x^{\tau_{1}^{t}} \ldots x^{\tau_{N}^{t}}\right]\right) $$ Equation (9) % % collapse_subspace.m % % *iii*. *reorthogonalize_subspace(obj)*: re-orthogonalizes the subspace if small errors are accumulated by performing Orthogonal-triangular % (QR) decomposition to the non-orthogonal subspace. % % reorthogonalize_subspace.m % % *iv*. *apply_subspace_transformation(obj,T_left,T_right)*: apply change-of-subspace transformation when the subspace is collapsed % to project into the new lower dimensional subspace where T_left and T_right are the covariant and contravariant subspace % projection matrices, respectively. % % apply_subspace_transformation.m % %%% 3.4 Online Hessian Approximation % In this section we addressed all the calculations correspond to maintain an independent Hessian % approximation for each subfunction computed through BFGS algorithm on its history of gradient evaluations and positions % % *i*. For each subfunction ${g_{j}^t}(x)$, two matrices $\Delta{f'}$ and $\Delta{x}$ are constructed to hold the change in % subfunction gradient and the corresponding change in position x between successive evaluations. This history is updated % using the function *update_history(obj, indx, theta_proj, f, df_proj)* that updates history of position & gradient differences for subfunction indx. % % update_history.m % % *ii*. Then, the Hessian approximation for a single subfunction (indx) is updated using BFGS algorithm as detailed in % function *update_hessian(obj,indx)* that iterates through the columns in $\Delta f'$ and $\Delta {x}$ from oldest % to most recent. At certain column of index $s$ the approximated Hessian should satisfy: % % * The secant equation $\Delta f'_{s} =B_{s} \Delta {x_s}$ % % * Smallest weighted Forbenius norm between the current and prior estimate of Hessian (standard BFGS update equation) % % $$B_{s} = B_{s-1} + \frac {\Delta f'_{s} \Delta {f'_{s}}^{T}}{\Delta {f'_{s}}^{T} \Delta x_{s}} - \frac {B_{s-1} \Delta x_{s} \Delta x_{s}^{T} B_{s-1}}{\Delta x_{s}^{T} B_{s-1} \Delta x_{s}}$$ Equation (10) % % update_hessian.m % % *iii*. The full approximated Hessian is etimated using the function *get_full_H_with_diagonal(obj)*. % % get_full_H_with_diagonal.m % %%% 3.5 Detecting bad updates % Bad proposed parameter updates are identidied using the function *handle_step_failure(obj, f, df_proj, indx)*. % Since only one subfunction is evaluated per update step, a line search on % the full objective is impossible, therefore upades are instead lablelled % bad when the value of subfunction has increased since its previous % evaluation, and also exceeds its approximating function by more than the % corresponding reduction in the summed approximating function (i.e. $f_j(x^t) -g_j^{t-1} > G^{t-1}(x^{t-1}) - G^{t-1}(x^{t})$) % % When a bad update is detected, $x^{t}$ is reset to its previous value $x^{t-1}$, the BFGS history matrices are also updated to % include the change in gradient in the failed update direction and the step length is updated as follows % % Case 1: successful update ${\eta}^{t+1} = \frac{1}{N}+\frac{N-1}{N}{\eta}^{t}$ % % Case 2: failed update ${\eta}^{t+1} = \frac{1}{2}{\eta}^{t+1}$ Equation (16) % % handle_step_failure.m % %%% 3.6 Growing the number of active subfunctions % The algorithm begins with only small number of active subfunctions (two for our implementation) and expands the active % set as learning progresses by one subfunction every time the average gradient shrinks to within a factor $\alpha$ of % the standard error in the average gradient, i.e. the active subset is incremented whenever % % $$ {(\bar{f}^{'t})}^{T} H^{t-1}{\bar{f}^{'t}} < \alpha \frac{\sum_{i}{(\bar{f}^{'t})}^{T} H^{t-1} {\bar{f}}^{'t}}{(N^{t}-1)N^{t}} $$ Equation (14) % % where $N^{t}$ is the active subset size at time, $H^{t}$ is the full Hessian, and $\bar{f}^{'t}$ is the average gradient given by: % % $$ \bar{f}^{'t}= \frac{1}{N^t} \sum_{i} f_{i}^{'} (x_{i}^{t}) $$ Equation (15) % % The active subset is also increased when a bad update is detected or when % a full pass through the active batch occurs without a batch size increase % % expand_active_subfunctions.m % %%% 3.7 Perform One Step Optimization % After we defined all these subfunctions, we can perform a single optimization step. % % # Choose an index using get_target_index() function. % # Calculate the gradient, % # Calculate inverse of hessian % # Performe finally the Newton step according to Equation (5). (The Newton step length is adjusted in order to avoid numerical errors caused by small numbers or bouncing % around opitimized values caused by large step size.) % # Active subfunction set is expended at the end. % % $$ x^{t}=x^{t-1}-\eta^{t}\left(H^{t-1}\right)\frac{\partial % G^{t-1}\left(x^{t-1}\right)}{\partial x} $$ Equation(5) % % optimization_step.m %% *4. Set Up Hyperparameters* % Set Autoencoder structure and training dataset parameters M = 784; % number visible units J = 50; % number hidden units D = 10000; % full data batch size N = floor(sqrt(D)/10.); % number minibatches %% *5. Data Import* % The dataset we used in this project is dataset. v = loadMNISTImages('train-images-idx3-ubyte.idx3-ubyte'); % create the cell array of subfunction specific arguments sub_refs = cell(N,1); for i = 1:N % extract a single minibatch of training data. sub_refs{i} = v(:,i:N:end); end display_network(v(:,1:100)); %% *6. Optimize Autoencoder Using SFO on MNIST Dataset* % initialize parameters r = sqrt(6) / sqrt(M+J+1); % we'll choose weights uniformly from the interval [-r, r] w = randn(J,M) * 2 * r - r; b_h = randn(J,1); b_v = randn(M,1); b_h = zeros(J,1); b_v = zeros(M,1); theta_init = {w, b_h, b_v}; % initialize the optimizer optimizer = sfo(@f_df_autoencoder, theta_init, sub_refs); maxIteration = 200; theta = optimizer.optimize(maxIteration); % plot the convergence trace plot(optimizer.hist_f_flat); xlabel('Iteration'); ylabel('Minibatch Function Value'); title('Convergence Trace of SFO'); %% 6.1 Reconstruction of Autoencoder Trained by SFO % Randomly pick up 100 samples to test the reconstruction performence of % trained autoencoder testSamples = v(:,1:100); [reconstructed_testSamples] = autoencoder(theta, testSamples); %Original testSamples figure(); display_network(testSamples); %Reconstructed testSamples figure(); display_network(reconstructed_testSamples); %% *7. Optimize Autoencoder Using Other Optimizers On MNIST Dataset* % Use to minimize the same objective function addpath ./minFunc/ options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost % function. Generally, for minFunc to work, you % need a function pointer with two outputs: the % function value and the gradient. In our problem, % sparseAutoencoderCost.m satisfies this. options.maxIter = 200; % Maximum number of iterations of L-BFGS to run options.display = 'off'; theta_init2 = [w(:) ; b_h(:) ; b_v(:)]; % Run fminlbfgs [opttheta, loss, flag, trace] = minFunc( @(p)f_df_autoencoder(p,v), ... theta_init2, options); plot(trace.trace.fval); xlabel('Iteration'); ylabel('Function Value on WHole Dataset'); title('Convergence Trace of lbfgs'); %% 7.1 Reconstruction of Autoencoder Trained by lbfgs testSamples = v(:,1:100); [reconstructed_testSamples] = autoencoder(opttheta, testSamples); %Original testSamples figure(); display_network(testSamples); %Reconstructed testSamples figure(); display_network(reconstructed_testSamples); %% *8. Analysis and conclusions* % *Experiment results analysis*: % % # Comparing convergence traces produced by SFO and lbfgs, we can see that % both optimizers converge to a similar optimal function value i.e. around % 15. This can also be seen from the reconstruction results that the % reconstructed figures of Autoendoers trained by SFO and lbfgs % respecitvely look very similar. % # Another obvious character is that the convergence trace produced by SFO is less % smooth than the trace produced by lbfgs. This is bacause SFO trains % objective function on subfunction i.e. minibatches rather than on a whole % dataset as lbfgs does. Although the convergence trance is less smooth, % SFO needs much less running time memory than lbfgs, and this feature will % be more important in task with very large training dataset. % # The way that SFO works on subfunctions of objective function also % brings it a disadvantage in task with small dataset. % % *Summarize Key Features of SFO*: % % # Single subfunction is evaluated per update step, hence quadratic approximation % is maintained for each to get the full hessian estimation (direct target for SFO). % Other algorithms considered the hessian of each subfunction as a noisy approximation to the % full Hessian. % # Recent history of gradients and positions for each subfunction (iterates) % is used to determine a shared and adaptive low dimensional subspace where % active subfunctions are stored, manipulated and updated. This guarantees % that subspace includes both the steepest gradient descent direction over % the full batch as well as update directions from the most recent newton steps. % This way SFO algorithm maintains computational tractability and limits memory % requirements in the face of high dimensional optimization problems (large M). % # The algorithm initializes the approximate hessian for each subfunction % using a diagonal matrix (identity times median eigenvalue of other active % subfunctions average hessian (no history), or identity times minimum eigenvalue % found by solving squared secant equation (available history)). It might be % effective to use the average approximate hessian from all other subfunctions % for this initialization that would be overwritten by divergence of individual % subfunctions. This way will take the advantage that different % subfunctions (for many objective functions) have very similar hessians. % %% *References* % [1] Sohl-Dickstein, Jascha, Ben Poole, and Surya Ganguli. "Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods." In International Conference on Machine Learning, pp. 604-612. 2014. % % [2] Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.