Notes on parameter adjustment by gradient descent
----------------
Reading: Murray Smith, Neural Networks for Statistical Modeling,
Van Nostrand Reinhold (1993) QA76.87 (Murray Smith is the father of an ex-122
student).
The problem: Imagine a system with K parameters (weights, [W1 W2 W3 ...
WK] ) that can influence the output V by weighting the same
number of inputs, xK.
Furthermore, say we know that for an input pattern p [x1p x2p x3p ... xKp
] we know the desired output, dp.
Then the LMS (least mean square) error for that will be
where it's likely that V won't equal d "to begin with".
How do we use knowledge of the error to make a correction to each parameter Wi?
One "simple" answer: if we actually know all input-desired output
combinations, and no tanh is involved, then
where [X] is a matrix of all input combinations (each pattern is a different column)
and pinv is a Matlab function for non-square matrices which finds the LMS
best matrix Mpi such that M*Mpi = I.
------------------- --------------------- --------------------
Otherwise assume the desired outputs are not all known in advance, or there are
nonlinearities in the problem. Then we consider gradient descent. See diagram
below:
where the error is graphed as a function of ONE weight wi. From a starting point,
we want to end up at a minimum of E. A global minimum would be best, but we'll settle
for a local min. There are two "local" minimums in the diagram above.
Assume all other weights W are fixed, and we are concerned with varying wi only.
Suppose we compute ,
the gradient of E at Wk. Consider the case where the gradient is positive: The local
min will be to the left. Therefore the rule
must be true, where ΔW is the change in W, and Wnew = Wold + ΔW
Now think about the other case, where the gradient is negative: The local min must
be the right and the above equation must also hold.
LMS gradient. Figure out .
By the chain rule,
where we assume V = W*x
let d-V = δ the local error. Now the LMS gradient descent learning rule
is
If we had defined the local error to be V-d you should be able to prove the same
ΔW result would obtain. The rule says that on each update the weight should
change proportional to the local error times the local input times a learning rate
eta.
For Simulink. For a continous "update" think of the ΔW as
a rate of change of W:
which you can implement as a product sent to an integrator.
Summary
§ System with many parameters
§ psuedoinverse solution
§ gradient descent
§ LMS gradient descent learning rule