Gradient Descent

Notes on parameter adjustment by gradient descent
----------------

Reading: Murray Smith, Neural Networks for Statistical Modeling,
Van Nostrand Reinhold (1993) QA76.87 (Murray Smith is the father of an ex-122 student).

The problem: Imagine a system with K parameters (weights, [W1 W2 W3 ... WK] ) that can influence the output V by weighting the same number of inputs, xK.

Furthermore, say we know that for an input pattern p [x1p x2p x3p ... xKp ] we know the desired output, dp.
Then the LMS (least mean square) error for that will be
where it's likely that V won't equal d "to begin with".

How do we use knowledge of the error to make a correction to each parameter Wi?

One "simple" answer: if we actually know all input-desired output combinations, and no tanh is involved, then

where [X] is a matrix of all input combinations (each pattern is a different column) and pinv is a Matlab function for non-square matrices which finds the LMS best matrix Mpi such that M*Mpi = I.

------------------- --------------------- --------------------
Otherwise assume the desired outputs are not all known in advance, or there are nonlinearities in the problem. Then we consider gradient descent. See diagram below:

where the error is graphed as a function of ONE weight wi. From a starting point, we want to end up at a minimum of E. A global minimum would be best, but we'll settle for a local min. There are two "local" minimums in the diagram above. Assume all other weights W are fixed, and we are concerned with varying wi only.

Suppose we compute , the gradient of E at Wk. Consider the case where the gradient is positive: The local min will be to the left. Therefore the rule

must be true, where ΔW is the change in W, and Wnew = Wold + ΔW
Now think about the other case, where the gradient is negative: The local min must be the right and the above equation must also hold.

LMS gradient. Figure out . By the chain rule,
where we assume V = W*x
let d-V = δ the local error. Now the LMS gradient descent learning rule is

If we had defined the local error to be V-d you should be able to prove the same ΔW result would obtain. The rule says that on each update the weight should change proportional to the local error times the local input times a learning rate eta.

For Simulink. For a continous "update" think of the ΔW as a rate of change of W:

which you can implement as a product sent to an integrator.

Summary
§ System with many parameters
§ psuedoinverse solution
§ gradient descent
§ LMS gradient descent learning rule