Gradient Descent

If we have a complex, difficult to solve, system of equations for which we need to find a minimum value, it can be easier to just pick a random set of inputs, calculate the result (which is usually must easier) and then see if we can change the inputs to make the output value go down. If it goes up, we can move the inputs in the other direction. But there is an easier way to figure out which direction to move: Take the derivative of the original formula. That gives us the slope, and we can just role the input down the hill to the bottom.

In practice, we have a guess, call it theta, which represents the inputs to the formula. In order to change theta to a better value, we can modify it by a small increment (represented by a or alpha) times the slope of our error. Doing this again and again will slowly move our prediction over the entire training set to an optimal line IF the value of alpha is small enough. If alpha is too large, the new predicted value of theta may overshoot the ideal value and bounce out of control. Of course, very small values of alpha may converge to the ideal parameters very slowly.

Note that we are computing the derivative of the cost function to find it's slope so that we know which direction to move and by how much. That's why we use half the MSE as our cost function: The derivative of ½x² is x which is easy to calculate. The slope for parameter O_J is simply (O^TX - y) X, or the actual error (difference not MSE) times X.

% Basic Linear Regression Gradient Decent with multiple parameters in Octave
alpha = 0.01; % try larger values 0.01, 0.03, 0.1, 0.3, etc... 
m = length(y); % number of training examples
p = size(X,2); % number of parameters (second dimension of X)

for iter = 1:num_iters
  hyp = X*theta;  %calculate our hypothesis using current parameters
  err = (hyp .- y); %find the error between that and the real data
  s = ( X' * err )./m; %find the slope of the error. (should that be .' instead of '?)
  %Note: This is the derivative of our cost function
  theta = theta - alpha .* s; 
  %adjust our parameters by a small distance along that slope.
  end

Given an error curve that smoothly approaches a local minimum, the correction applied to each theta is decreased as our error slope decreases, even though alpha is not changed over each run. This helps us quickly converge when the error is large, but slow down and not overshoot our goal as we approach the best fit. This nice curve is not always present.