Copyright 1996-2001 by Donald R. Tveter, commercial use is prohibited. Short quotations are permitted if proper attribution is given. This material CAN be posted elsewhere on the net if the posted files are not altered in any way but please let me know where it is posted. The main location is: http://dontveter.com/bpr/bpr.html |
Choosing the correct representation for data can have a dramatic effect on training time and results. The topics are:
There is a widespread belief that the INPUT to backprop networks must be scaled to between 0 and 1 if the activation function used is the standard sigmoid and -1 to 1 for tanh. This is absolutely not true, the INPUTS can be any real value however the network may choke completely or learn very slowly if the magnitudes of the inputs are too large. Outputs are another story, if you have a function like the standard sigmoid with a range of 0 to 1 or tanh which runs from -1 to 1 you can only get outputs within this range. For outputs outside this range you would have to scale the outputs if you really wanted to use these functions, but even here you can use the linear function and avoid the scaling, although if the magnitudes of the outputs are too large then scaling may be helpful with the linear function as well.
One of the simplest transformations you can make is to take each column of data (each input variable) and subtract the mean of that column and divide by the standard deviation (sometimes training is faster when you divide by some value other than the standard deviation). This transformation gives you inputs symmetric about 0. Learning is normally much faster, for an example of how this is magic try training a network to learn sin(x) with inputs from 0 to 2 * pi and from -pi to +pi. The network with the symmetric data will learn much faster. One factor at work here is that when the input value is large the hidden layer units for two large values are nearly identical, for instance look at the hidden layer units when the input is 5 and when it is 6. It is extremely hard for the network to use these slight differences to make a large difference in the output. An article by Le Cun, Kanter and Solla explains why scaling helps, apparently with a fair amount of higher math (I have not seen the article). An article by Brown, An, Harris and Wang gives the same results, in fact the analysis predicts that using tanh on the hidden layer should also speed-up convergence. In an online article by Neal, Goodacre and Kell they found that simply scaling each input parameter to between 0 and 1 greatly speeded up convergence in their particular application.
For parameters with an exceptionally large range taking the log of the values may work or if some numbers are in the range [0..1) try log(1+x).
For measurements made in degrees the values 1 degree and 359 degrees are rather close together (think wind direction) but if you use 1 and 359 you make it hard on the network to generalize, 1 and 359 are virtually the same direction and should give almost identical outputs. In this case transform the data into an (x,y) coordinate by: y = sin(the_angle_in_radians) and x = cos(the_angle_in_radians), this makes the angles 1 and 359 very close. Use the same transformation for periodic data like the day of the year or time of day.
Another data transformation that comes up is principal component analysis, one site with PCA software is the Carnegie-Mellon Statistics Library, see the file pca.c. I have never tried it. If you can say anything good or bad about PCA please let me know.
Another data transformation that comes up is singular value decomposition. The online article by Kalman and Kwasny claims that this treatment improves training and for problems with large input vectors it can reduce the length of the vector (eliminate some parameters). I have never tried it. If you can say anything good or bad about it please let me know.
The normal method for coding outputs in classification problems is to put a 1 in the output vector for the right class and 0 in all the other locations however.
There is an alternative to this I've been meaning to try, maybe someday (sigh).
One method is to insert random values. In some situations random values may even help the network find a lower error minimum as in simulated annealing.
You can use an auto-associative network to estimate a value. Simply train an auto-associative network on complete patterns then give the network an incomplete pattern and read off the missing value on the output units.
Every once in a while someone asks about how to do character recognition so I put together a short tutorial on the subject.