When input and output pattern values have magnitudes that are not fairly close to 1 backprop can have trouble learning the patterns or the learning can take much longer than if the values were scaled to more or less a magnitude of 1. Training is also faster when the mean of each column is 0. To make the training easier there is a separate program called scale that can be run on a set of files (usually a training set and a testing set) to scale the values in various ways. This section describes how to use the program and the new data files that the program writes.
To run the scale program from the command line type:
The menu bar contains the choices:
Go to the Commands menu entry on the menu bar and do the following: First, select `Define the Network' and answer the questions. Second, select `Name the Files'. Click a file name, then Select, then a file name, then Select, ... until you've named them all, then click Cancel. Third, select the `Major Input Transformation' and select whichever input transformation will be used the most. This fills in that transformation for all the inputs. Fourth, if you want to change some input transformations select `Fiddle with Input Transformations' and answer the questions. Fifth, if the outputs need to be transformed select `Major Output Transformation'. This fills in that transformation for all the outputs. Sixth, if you need to change some output unit transformations then select `Fiddle with Output Transformations'. When you're done, select `Save and Exit' from the File selection.
The 'Define the Network' entry is messy to describe but it is pretty much the same as what needs to be done when making a network for the bp program.
The scale program asks a number of questions about the files you want scaled in order to do its job. The first set of questions is designed to find out information about the data on the training and testing files. Note that the point of all these questions is to determine how many columns of data the program will read from your original unscaled data files. The first question you get is:
How many input variables are there for each pattern? This number should include all the inputs you supply and however many columns come from values recirculated down from the output layer and however many input columns are shifted around on the input layer BUT NOT however many inputs come from the hidden layer. How many?
Or to put it another way, how many inputs are there minus the inputs that come from the first hidden layer? At this point type in the number. For instance, suppose this is the data for the problem at hand:
1 0 0 0 0 o1 o2 o3 o4 X 0 1 0 0 0 0 1 0 0 o1 o2 o3 o4 H 1 0 0 0 0 1 0 0 0 o1 o2 o3 o4 H 0 0 0 1 0 1 0 0 0 o1 o2 o3 o4 X 0 1 0 0 0 0 1 0 0 o1 o2 o3 o4 H 0 0 1 0 0 0 0 1 0 o1 o2 o3 o4 H 0 0 0 1
The answer here is 9, 5 of them are regular user supplied numbers plus there are 4 more that are copied down from the output layer.
The second question is:
Does the network use any recurrent connections? Type y or n:
If the problem does not involves using recurrent connections from the hidden layer then type n otherwise type y. If you type y the following questions will come up:
Will the network have a FIXED number of short-term memory units? Type y or n:
The backprop programs are prepared to accept either a fixed or variable number of inputs carried over from the hidden layer. When puttering around with a recurrent network problem you normally want to change the number of hidden layer units and use the variable number of hidden layer inputs notation in the data. If you use X or H the answer is n but if you use x or h then you must tell the program how many columns use x or h.
The next question again applies to all networks and it is:
Is this a classification type problem? Type y or n
If the problem is a classification type problem then type y and there will not be any more questions about the nature of the data because the program knows there will be only one column giving the class number of the answer. If it is not a a classification type problem, that is, if there are arbitrary output values the program asks:
How many output units will there be?
and you must type in the correct number of output units.
After all these questions the program knows enough to read the data and find the mean, standard deviation and maximum values of each column of data. After doing this it starts asking questions about how you want the data scaled. At this point it will be asking what you want done with MOST of the inputs and with MOST of the outputs. After taking in this information it will move on and let you change the transformation for ANY column of data so you can get whatever type of scaling you want but in the first phase where it asks for MOST, give whichever transformation will leave you with the least number of changes to make later on in the second phase.
The initial set of questions goes like this:
Do you want most or all of the regular (not any recurrent) inputs scaled by the mean and standard deviation?
Type y or n:
If at this point you type y then it will not ask if you want most of them scaled by the maximum value. If you type n it will go on and ask:
Do you want most or all of the regular (not any recurrent) inputs scaled by the maximum value?
Type y or n:
If you answer n to this then it will flag the inputs as unscaled.
Now if it is not a classification type problem it has to ask about how you want the outputs transformed. The questions are:
Do you want most or all of the outputs scaled by the mean and standard deviation?
Type y or n
If you type y that's it but if you type n the program goes on to ask:
Do you want most or all of the outputs scaled by the maximum value?
Type y or n
At this point the first round of questioning about which transformations to apply has finished and the program lists the current set of transformations, such as the following for a network with one input and one output:
The 'Name the Files' entry gives you a listing of the files in the current directory, you select a file from the listing, then click the select button at the bottom, then select another file, until you select them all.
YOU MUST GIVE THE SCALING PROGRAM ALL THE FILES AT ONCE SO IT WILL PRODUCE THE SAME SCALE FACTORS FOR EVERY FILE. The scaling program writes new files that contain the scaled values and a listing at the top of the file within comments that gives the scaling parameters. When such a TRAINING file (not a TESTING file) is read by the backprop programs these scaling parameters at the top of the file are also read and the scaling and unscaling can be done by the backprop programs.
Every input parameter unit can receive a different transformation but the next step, the `Major Input Transformation' entry lets you select whichever one is used the most, all the inputs then get this transformation and you have to individually change the units that get other transformations. Of course in many cases all the units get the same transformation.
The possible methods of scaling the data are:
The most generally useful method is to translate by the mean and scale down by the standard deviation. So given some value, x, apply the transform:
In scaling down by the maximum value apply the transform:
where x is the largest absolute value in that column of data. Next there is an option for the user to supply two values a and b and apply the transform:
Then for extremely large values or values which cover quite a range of values (e.g. 0.001 to 1000) using the log function is worthwhile.
Select this option to change any input parameters from one transformation to another. A listing of the inputs and their transformations come up in a list box.
If you need to transform the outputs then select this option and then choose the major output transformation. Again this sets all the outputs to that transform but in the next step you can change individual entries.
Select this option to change any output parameters from one transformation to another. A listing of the outputs and their transformations come up in a list box.
When you're finished determining the transformations go to the File menu bar button and select the 'Save and Quit' option. The program takes a few seconds to save the file.
input # translate by scale by transform 1 2.000000e+000 1.632993e+000 scale by mean and standard deviationThese are the output pattern transformations:
output # translate by scale by transform 1 0.000000e+000 1.000000e+000 no scaling
If you use the option of inputting a set of values on a command line to the network the program will scale them the same as training data, submit the scaled data to the network to produce a set of output values but ordinarily the network output will be unscaled unless certain other options are set. To get the output to also be scaled you must use "f os" (output scaled, back to their original form). Of course the backprop calculations are done with the scaled values, not the original ones and the way the program is written the error values you get out will be based on the scaled values not the original ones. To get the errors based on the original unscaled errors you must also use "f u+" as well as "f os" and EVEN THEN THE AUTOMATIC CHECKING FOR CONVERGENCE MECHANISM STILL USES THE SCALED VALUES AND ERRORS (I will change this someday.). In real world problems the patterns seldom all converge and what you really do is run the training until you get a minimum on the test set or you get tired of running it. The checking for convergence can in effect be bypassed by making the tolerances zero.
To use this scaling within your own C program you need to call the function readscaling after making the network like so:
if (readscaling(dummy,"xor.sca") == 0) { printf("bad scaling file\n"); exit(1); };