back-prop derivations

Feb 22, 1995 and forward, from EN291S, reformatted 2005

Back-propagation delta functions

See chapter 4 of Simon Haykin, Neural Networks: A Comprehensive Foundation, 2nd Ed,Prentice-Hall, Upper Saddle River, NJ (1999) pp 161 ff.

And try Murray Smith's book, Neural Networks for Statistical Modeling, Van Nostrand Reinhold, 1993, ISBN 0-442-01310-8, QA76.87.S62

First: believe in gradient descent! What is it?
See previous file. Wk is a weight in a neural network. The formula can be seen as the Least Means Square approximation to true gradient descent. E is an "energy" or error function.

In principle, a 2-layer BP network is complicated enough to solve any (reasonable) classification problem, given enough time and precision. (time for learning, and precision of weight resolution in the computer) Why? Because the hidden layer presents various nonlinear compression functions to be combined at the output layer. Each compression function is weighted and biased. The weight may be negative or positive. The combining of the weights has similarities to Fourier or Laplace transforms: representing something important with something convenient.

What do the biases do? provides an offset...
future GRAPHIC: COMPRESSION FUNCTIONS

For any weight in the 2-layer network, is all you need to compute, in principle, to obtain "instantaneous" weight updates.

In simulating a 2-layer back prop a forward pass computes the hidden and output layer neuron responses. These responses are then used in a backward pass which updates the weights.

Notation
Haykin's notation is reasonable if you need something for an arbitrary number of layers, but here we use a more direct notation to find hidden unit weight update rules for a strict 2-layer system.

The only choice the designer normally has (and it's a big choice) is the number of hidden units, J, below. Inputs K and outputs I are usually fixed by the nature of the problem. What follows is notation a method to update all weights by back propagation of error.

Furthermore, η is the learning rate for back prop.
Later we may also use superscript μ to denote which input pattern is being presented.

BP weight update for output layer
Here we derive the update rule for the output layer weights in back prop, for one pattern x, the components of which are the input xk's. The weights for the ouput layer are the Wij's; we want a particular i & j to update, so we need to compute

The d-V term is set up to appear as negative feedback.

Updating hidden layer weights
Consider hidden layer weight wjk where j is the hidden layer index and k is the input index. We want
(eqn **)

Here comes a critical concept to grasp. See figure below.

Notice that in a fully connected 2-layer network, hidden unit weight wjk affects all the outputs V, through hidden unit output vj projecting to output layer weights Wmj, m = 1,2, ..., I .
Examine

to verify that any Vm will involve a particular vj.
therefore the partial in equation ** above on all the δ-i's of the output layer. Write out

.
This development get us up to needing

We're done! We have two formulas, one for updating the output layer, the other for updating the hidden layer weights, pattern by {xk}μ pattern.

Do we need to know HU outputs in order to update by BP? Yes, and they're computed on the forward pass!

Implementing back prop with the Matlab Neural Network toolbox.

Using MATLAB you can efficiently implement the back prop algorithm.
(Efficiently for number of instructions, maybe not so efficient for speed...)
Let's say you want a 2-layer network:
You the designer need to make some decisions...
How many hidden units?
What compression function?
Initialize weights
error stopping criteria
learning rate

back prop code from the Matlab Neural Network Toolbox documentation (3/05)
Creating a Network (newff).
The first step in training a feedforward network is to create the network object. The function newff creates a feedforward network. It requires four inputs and returns the network object. The first input is an R by 2 matrix of minimum and maximum values for each of the R elements of the input vector. The second input is an array containing the sizes of each layer. The third input is a cell array containing the names of the transfer functions to be used in each layer. The final input contains the name of the training function to be used.

For example, the following command creates a two-layer network. There is one input vector with two elements. The values for the first element of the input vector range between -1 and 2, the values of the second element of the input vector range between 0 and 5. There are three neurons in the first layer and one neuron in the second (output) layer. The transfer function in the first layer is tan-sigmoid, and the output layer transfer function is linear. The training function is traingd (which is described in a later section).

net=newff([-1 2; 0 5],[3,1],{'tansig','purelin'},'traingd');

This command creates the network object and also initializes the weights and biases of the network; therefore the network is ready for training. There are times when you may want to reinitialize the weights, or to perform a custom initialization. The next section explains the details of the initialization process.

Initializing Weights (init). Before training a feedforward network, the weights and biases must be initialized. The newff command will automatically initialize the weights, but you may want to reinitialize them. This can be done with the command init. This function takes a network object as input and returns a network object with all weights and biases initialized.

Here is how a network is initialized (or reinitialized):

net = init(net);

Below is Matlab code for direct computation of back-propagated errors in a time series prediction application:
while cnt < max_cnt % & ssE > accu
shuf = [shuffleDT([1:sze-2], sze-2) sze-1 sze];
  eta_decay = eta*exp(-3*cnt/max_cnt);
  for mm = 1:sze %:-1:1
    nn = shuf(mm);
    DELW = eta_decay*del(nn)*hist_tot(:,nn)';
   WtBP = WtBP+DELW;
  end
  V = tanh(WtBP*hist_tot);
  epsilon = des_tot-V;
  del = epsilon.*(1-V.*V);
  ssE = sumsqr(del); % sum squared error
  cnt = cnt+1;
end

Batch learning (speed) vs one-pattern-at-a-time learning (accuracy)

Could use other methods (breeding algorithm) to search for optimal weights...